Thread: out of memory errors
Hi All, I need some assistance with a particular out of memory issue I am currently experiencing, your thoughts would be greatly appreciated. Configuration: [1] 3 x ESX VM's [a] 8 vCPU's each [b] 16GB memory each [2] CentOS 6.5 64-bit on each [a] Kernel Rev: 2.6.32-431.17.1.el6.x86_64 [3] Postgresql from official repository [a] Version 9.3.4 [4] Configured as a master-slave pacemaker/cman/pgsql cluster [a] Pacemaker version: 1.1.10-14 [b] CMAN version: 3.0.12.1-59 [c] pgsql RA version: taken from clusterlabs git repo 3 months ago (cant find version in ra file) I did not tune any OS IPC parameters as I believe Postgresql v9.3 doesnt use those anymore (Please correct me if I am wrong). I have the following OS settings in place to try get optimal use of memory and smooth out fsync operations (comments may not be 100% accurate :) ): # Shrink FS cache before paging to swap vm.swappiness = 0 # Dont hand out more memory than neccesary vm.overcommit_memory = 2 # Smooth out FS Sync vm.dirty_ratio = 10 vm.dirty_background_ratio = 5 I have the following memory related settings for Postgresql: work_mem = 1MB maintenance_work_mem = 128MB effective_cache_size = 6GB max_connections = 700 shared_buffers = 4GB temp_buffers = 8MB wal_buffers = 16MB max_stack_depth = 2MB Currently there are roughly 300 client connections active when this error occurs. What appears to have happened here is that there is an autovacuum process that attempts to kick off and fails with an out of memory error, then shortly after that, the cluster resource agent attempts a connection to template1 to try and see if the database is up, this connection then fails with an out of memory error as well, at which point the cluster fails over the database to another node. Looking at the system memory usage, there is roughly 4GB - 5GB free physical memory, swap (21GB) is not in use at all when this error occurs, page cache is roughly 3GB in size when this occurs. I have attached the two memory dump logs where the first error is related to autovacuum and the second is the cluster ra connection attempt which fails too. I do not know how to read that memory information to come up with any ideas to correct this issue. The OS default for stack depth is 10MB, shall I attempt to increase the max_stack_depth to 10MB too? The system does not appear to be running out of memory, so I'm wondering if I have some issue with limits or some memory related settings. Any thoughts, tips, suggestions would be greatly appreciated. If you need any additional info from me please dont hesitate to ask. Thanks Bruce
Attachment
Hi, On 2014-06-16 13:56:23 +0100, Bruce McAlister wrote: > [1] 3 x ESX VM's > [a] 8 vCPU's each > [b] 16GB memory each > # Dont hand out more memory than neccesary > vm.overcommit_memory = 2 So you haven't tune overcommit_ratio at all? Can you show /proc/meminfo's contents? My guess is that the CommitLimit is too low... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 16/06/2014 14:15, Andres Freund wrote: > Hi, > > On 2014-06-16 13:56:23 +0100, Bruce McAlister wrote: >> [1] 3 x ESX VM's >> [a] 8 vCPU's each >> [b] 16GB memory each >> # Dont hand out more memory than neccesary >> vm.overcommit_memory = 2 > So you haven't tune overcommit_ratio at all? Can you show > /proc/meminfo's contents? > My guess is that the CommitLimit is too low... > No I have not tune overcommit_ratio. Below is the /proc/meminfo contents. One note though, the database is currently not running on this node, just in case i need to make some changes that require a restart. [root@bfievdb01 heartbeat]# cat /proc/meminfo MemTotal: 16333652 kB MemFree: 2928544 kB Buffers: 197216 kB Cached: 1884032 kB SwapCached: 0 kB Active: 4638780 kB Inactive: 1403676 kB Active(anon): 4006088 kB Inactive(anon): 7120 kB Active(file): 632692 kB Inactive(file): 1396556 kB Unevictable: 65004 kB Mlocked: 56828 kB SwapTotal: 22015984 kB SwapFree: 22015984 kB Dirty: 3616 kB Writeback: 0 kB AnonPages: 4026228 kB Mapped: 82408 kB Shmem: 45352 kB Slab: 197052 kB SReclaimable: 106804 kB SUnreclaim: 90248 kB KernelStack: 4000 kB PageTables: 15172 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 30182808 kB Committed_AS: 4342644 kB VmallocTotal: 34359738367 kB VmallocUsed: 7004496 kB VmallocChunk: 34352726816 kB HardwareCorrupted: 0 kB AnonHugePages: 3868672 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 10240 kB DirectMap2M: 16766976 kB Thanks Bruce
I was reading in to the parameter a little more and it appears that the defuault for vm.overcommit_ratio is 50%, I am considering bumping this up to 95% so the sums look like this: max memory allocation for process = swap + ratio of physical memory 21 + (16 * 0.95) = 36.2GB This in theory should always leave me with roughly 1GB of free physical memory, swap may be blown though :) (if my understanding of this parameter is correct). What I dont understand is, even at its default, the overcommit ratio is 50% of physical, which would make it 21GB + 8GB, ending up at around 29GB (which looks about right in the meminfo output below), so, assuming my understanding is correct: [1] How can an analyze process run out of memory on this setting if it is asking for, at most, maintenance_work_mem (plus some overhead) 128MB [2] How can a new connection run out of memory, I presume work_mem + some overhead, I'm guessing around 2MB memory? I'm beginning to wonder if my issue is somewhere else now. Thanks for the tip though at looking at vm.overcommit_ratio, I obvisouly overlooked this setting when setting vm.overcommit_memory = 2 Any other pointers would be greatly appreciated :) Reference: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-captun.html Thanks Bruce On 16/06/2014 14:21, Bruce McAlister wrote: > Hi, > > On 16/06/2014 14:15, Andres Freund wrote: >> Hi, >> >> On 2014-06-16 13:56:23 +0100, Bruce McAlister wrote: >>> [1] 3 x ESX VM's >>> [a] 8 vCPU's each >>> [b] 16GB memory each >>> # Dont hand out more memory than neccesary >>> vm.overcommit_memory = 2 >> So you haven't tune overcommit_ratio at all? Can you show >> /proc/meminfo's contents? >> My guess is that the CommitLimit is too low... >> > > No I have not tune overcommit_ratio. > > Below is the /proc/meminfo contents. One note though, the database is > currently not running on this node, just in case i need to make some > changes that require a restart. > > [root@bfievdb01 heartbeat]# cat /proc/meminfo > MemTotal: 16333652 kB > MemFree: 2928544 kB > Buffers: 197216 kB > Cached: 1884032 kB > SwapCached: 0 kB > Active: 4638780 kB > Inactive: 1403676 kB > Active(anon): 4006088 kB > Inactive(anon): 7120 kB > Active(file): 632692 kB > Inactive(file): 1396556 kB > Unevictable: 65004 kB > Mlocked: 56828 kB > SwapTotal: 22015984 kB > SwapFree: 22015984 kB > Dirty: 3616 kB > Writeback: 0 kB > AnonPages: 4026228 kB > Mapped: 82408 kB > Shmem: 45352 kB > Slab: 197052 kB > SReclaimable: 106804 kB > SUnreclaim: 90248 kB > KernelStack: 4000 kB > PageTables: 15172 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 30182808 kB > Committed_AS: 4342644 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 7004496 kB > VmallocChunk: 34352726816 kB > HardwareCorrupted: 0 kB > AnonHugePages: 3868672 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 10240 kB > DirectMap2M: 16766976 kB > > Thanks > Bruce