Thread: out of memory errors

out of memory errors

From

Bruce McAlister

Date:

16 June 2014, 12:56:34

Hi All,

I need some assistance with a particular out of memory issue I am
currently experiencing, your thoughts would be greatly appreciated.

Configuration:

     [1] 3 x ESX VM's
                 [a] 8 vCPU's each
                 [b] 16GB memory each
     [2] CentOS 6.5 64-bit on each
                 [a] Kernel Rev: 2.6.32-431.17.1.el6.x86_64
     [3] Postgresql from official repository
                 [a] Version 9.3.4
     [4] Configured as a master-slave pacemaker/cman/pgsql cluster
                 [a] Pacemaker version: 1.1.10-14
                 [b] CMAN version: 3.0.12.1-59
                 [c] pgsql RA version: taken from clusterlabs git repo 3
months ago (cant find version in ra file)

I did not tune any OS IPC parameters as I believe Postgresql v9.3 doesnt
use those anymore (Please correct me if I am wrong).
I have the following OS settings in place to try get optimal use of
memory and smooth out fsync operations (comments may not be 100%
accurate :) ):

# Shrink FS cache before paging to swap
vm.swappiness = 0

# Dont hand out more memory than neccesary
vm.overcommit_memory = 2

# Smooth out FS Sync
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

I have the following memory related settings for Postgresql:

work_mem = 1MB
maintenance_work_mem = 128MB
effective_cache_size = 6GB
max_connections = 700
shared_buffers = 4GB
temp_buffers = 8MB
wal_buffers = 16MB
max_stack_depth = 2MB

Currently there are roughly 300 client connections active when this
error occurs.

What appears to have happened here is that there is an autovacuum
process that attempts to kick off and fails with an out of memory error,
then shortly after that, the cluster resource agent attempts a
connection to template1 to try and see if the database is up, this
connection then fails with an out of memory error as well, at which
point the cluster fails over the database to another node.

Looking at the system memory usage, there is roughly 4GB - 5GB free
physical memory, swap (21GB) is not in use at all when this error
occurs, page cache is roughly 3GB in size when this occurs.

I have attached the two memory dump logs where the first error is
related to autovacuum and the second is the cluster ra connection
attempt which fails too. I do not know how to read that memory
information to come up with any ideas to correct this issue.

The OS default for stack depth is 10MB, shall I attempt to increase the
max_stack_depth to 10MB too?

The system does not appear to be running out of memory, so I'm wondering
if I have some issue with limits or some memory related settings.

Any thoughts, tips, suggestions would be greatly appreciated.

If you need any additional info from me please dont hesitate to ask.

Thanks
Bruce

Attachment

Re: out of memory errors

From

Andres Freund

Date:

16 June 2014, 13:16:02

Hi,

On 2014-06-16 13:56:23 +0100, Bruce McAlister wrote:
>     [1] 3 x ESX VM's
>                 [a] 8 vCPU's each
>                 [b] 16GB memory each

> # Dont hand out more memory than neccesary
> vm.overcommit_memory = 2

So you haven't tune overcommit_ratio at all? Can you show
/proc/meminfo's contents?
My guess is that the CommitLimit is too low...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: out of memory errors

From

Bruce McAlister

Date:

16 June 2014, 13:21:51

Hi,

On 16/06/2014 14:15, Andres Freund wrote:
> Hi,
>
> On 2014-06-16 13:56:23 +0100, Bruce McAlister wrote:
>>      [1] 3 x ESX VM's
>>                  [a] 8 vCPU's each
>>                  [b] 16GB memory each
>> # Dont hand out more memory than neccesary
>> vm.overcommit_memory = 2
> So you haven't tune overcommit_ratio at all? Can you show
> /proc/meminfo's contents?
> My guess is that the CommitLimit is too low...
>

No I have not tune overcommit_ratio.

Below is the /proc/meminfo contents. One note though, the database is
currently not running on this node, just in case i need to make some
changes that require a restart.

[root@bfievdb01 heartbeat]# cat /proc/meminfo
MemTotal:       16333652 kB
MemFree:         2928544 kB
Buffers:          197216 kB
Cached:          1884032 kB
SwapCached:            0 kB
Active:          4638780 kB
Inactive:        1403676 kB
Active(anon):    4006088 kB
Inactive(anon):     7120 kB
Active(file):     632692 kB
Inactive(file):  1396556 kB
Unevictable:       65004 kB
Mlocked:           56828 kB
SwapTotal:      22015984 kB
SwapFree:       22015984 kB
Dirty:              3616 kB
Writeback:             0 kB
AnonPages:       4026228 kB
Mapped:            82408 kB
Shmem:             45352 kB
Slab:             197052 kB
SReclaimable:     106804 kB
SUnreclaim:        90248 kB
KernelStack:        4000 kB
PageTables:        15172 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    30182808 kB
Committed_AS:    4342644 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     7004496 kB
VmallocChunk:   34352726816 kB
HardwareCorrupted:     0 kB
AnonHugePages:   3868672 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       10240 kB
DirectMap2M:    16766976 kB

Thanks
Bruce

Re: out of memory errors

From

Bruce McAlister

Date:

16 June 2014, 15:30:17

I was reading in to the parameter a little more and it appears that the
defuault for vm.overcommit_ratio is 50%, I am considering bumping this
up to 95% so the sums look like this:

max memory allocation for process = swap + ratio of physical memory

21 + (16 * 0.95) = 36.2GB

This in theory should always leave me with roughly 1GB of free physical
memory, swap may be blown though :) (if my understanding of this
parameter is correct).

What I dont understand is, even at its default, the overcommit ratio is
50% of physical, which would make it 21GB + 8GB, ending up at around
29GB (which looks about right in the meminfo output below), so, assuming
my understanding is correct:

     [1] How can an analyze process run out of memory on this setting if
it is asking for, at most, maintenance_work_mem (plus some overhead) 128MB
     [2] How can a new connection run out of memory, I presume work_mem
+ some overhead, I'm guessing around 2MB memory?

I'm beginning to wonder if my issue is somewhere else now.

Thanks for the tip though at looking at vm.overcommit_ratio, I obvisouly
overlooked this setting when setting vm.overcommit_memory = 2

Any other pointers would be greatly appreciated :)

Reference:

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-captun.html

Thanks
Bruce

On 16/06/2014 14:21, Bruce McAlister wrote:
> Hi,
>
> On 16/06/2014 14:15, Andres Freund wrote:
>> Hi,
>>
>> On 2014-06-16 13:56:23 +0100, Bruce McAlister wrote:
>>>      [1] 3 x ESX VM's
>>>                  [a] 8 vCPU's each
>>>                  [b] 16GB memory each
>>> # Dont hand out more memory than neccesary
>>> vm.overcommit_memory = 2
>> So you haven't tune overcommit_ratio at all? Can you show
>> /proc/meminfo's contents?
>> My guess is that the CommitLimit is too low...
>>
>
> No I have not tune overcommit_ratio.
>
> Below is the /proc/meminfo contents. One note though, the database is
> currently not running on this node, just in case i need to make some
> changes that require a restart.
>
> [root@bfievdb01 heartbeat]# cat /proc/meminfo
> MemTotal:       16333652 kB
> MemFree:         2928544 kB
> Buffers:          197216 kB
> Cached:          1884032 kB
> SwapCached:            0 kB
> Active:          4638780 kB
> Inactive:        1403676 kB
> Active(anon):    4006088 kB
> Inactive(anon):     7120 kB
> Active(file):     632692 kB
> Inactive(file):  1396556 kB
> Unevictable:       65004 kB
> Mlocked:           56828 kB
> SwapTotal:      22015984 kB
> SwapFree:       22015984 kB
> Dirty:              3616 kB
> Writeback:             0 kB
> AnonPages:       4026228 kB
> Mapped:            82408 kB
> Shmem:             45352 kB
> Slab:             197052 kB
> SReclaimable:     106804 kB
> SUnreclaim:        90248 kB
> KernelStack:        4000 kB
> PageTables:        15172 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    30182808 kB
> Committed_AS:    4342644 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:     7004496 kB
> VmallocChunk:   34352726816 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:   3868672 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:       10240 kB
> DirectMap2M:    16766976 kB
>
> Thanks
> Bruce