Hi All,
I need some assistance with a particular out of memory issue I am
currently experiencing, your thoughts would be greatly appreciated.
Configuration:
[1] 3 x ESX VM's
[a] 8 vCPU's each
[b] 16GB memory each
[2] CentOS 6.5 64-bit on each
[a] Kernel Rev: 2.6.32-431.17.1.el6.x86_64
[3] Postgresql from official repository
[a] Version 9.3.4
[4] Configured as a master-slave pacemaker/cman/pgsql cluster
[a] Pacemaker version: 1.1.10-14
[b] CMAN version: 3.0.12.1-59
[c] pgsql RA version: taken from clusterlabs git repo 3
months ago (cant find version in ra file)
I did not tune any OS IPC parameters as I believe Postgresql v9.3 doesnt
use those anymore (Please correct me if I am wrong).
I have the following OS settings in place to try get optimal use of
memory and smooth out fsync operations (comments may not be 100%
accurate :) ):
# Shrink FS cache before paging to swap
vm.swappiness = 0
# Dont hand out more memory than neccesary
vm.overcommit_memory = 2
# Smooth out FS Sync
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
I have the following memory related settings for Postgresql:
work_mem = 1MB
maintenance_work_mem = 128MB
effective_cache_size = 6GB
max_connections = 700
shared_buffers = 4GB
temp_buffers = 8MB
wal_buffers = 16MB
max_stack_depth = 2MB
Currently there are roughly 300 client connections active when this
error occurs.
What appears to have happened here is that there is an autovacuum
process that attempts to kick off and fails with an out of memory error,
then shortly after that, the cluster resource agent attempts a
connection to template1 to try and see if the database is up, this
connection then fails with an out of memory error as well, at which
point the cluster fails over the database to another node.
Looking at the system memory usage, there is roughly 4GB - 5GB free
physical memory, swap (21GB) is not in use at all when this error
occurs, page cache is roughly 3GB in size when this occurs.
I have attached the two memory dump logs where the first error is
related to autovacuum and the second is the cluster ra connection
attempt which fails too. I do not know how to read that memory
information to come up with any ideas to correct this issue.
The OS default for stack depth is 10MB, shall I attempt to increase the
max_stack_depth to 10MB too?
The system does not appear to be running out of memory, so I'm wondering
if I have some issue with limits or some memory related settings.
Any thoughts, tips, suggestions would be greatly appreciated.
If you need any additional info from me please dont hesitate to ask.
Thanks
Bruce