I have a client that experienced several Out Of Memory errors a few
weeks ago (March 10 & 11), and I'd like to figure out the cause. In the
logs it's showing that they were getting out of memory errors for about
0.5-1 hour, after which one of the processes would crash and take the
whole database down. After they restarted the server it would
eventually start giving out of memory messages and crash again. This
happened a total of five times over a 24 hour period. After that we did
not see these errors again. They did upgrade to 8.1.11 on the 14th, and
have also moved some of the databases to different servers afterwards.
First some background information:
Software (at the time of the memory errors): CentOS 4.5 (x86_64) running
its 2.6.9-55.ELsmp Linux kernel, PostgreSQL 8.1.9 (from RPMs provided on
the PostgreSQL web site: postgresql-8.1.9-1PGDG.x86_64).
Hardware: 4 dual-core Opterons. 16GB physical RAM, 2GB swap.
Database: they use persistent connections, and usually have around 1000
open database connections. The vast majority of those are usually
idle. They do run a lot of queries though. The total size of the
databases in this cluster is 36GB, with the largest database being 21GB,
and the largest table being 2.5GB (having 20 million tuples).
Highlights of postgresql.conf settings:
max_connections = 2000
shared_buffers = 120000
work_mem = 4096
maintenance_work_mem = 268435
max_fsm_pages = 1500000
max_fsm_relations = 2000
checkpoint_segments = 8
effective_cache_size = 1400000
stats_start_collector = on
stats_command_string = off
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = off
autovacuum = on
lc_* = 'en_US.UTF-8'
Note that not all of these settings may have been in effect during the
first crash since the server hadn't been restarted for quite some time.
However, during the second and third crash these settings were in effect.
ulimit settings:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 1024
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 137216
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Miscellaneous facts:
Before the crashes there were a lot of "duplicate key violates unique
constraint ..." errors as well as some syntax errors in some of the
queries. A lot of these errors were fixed after the crashes. I don't
think these would be related to the problem unless there's a memory leak
somewhere.
This server does not have any other processes running that could consume
a significant amount of memory. Some of the other processes that run
are SSH, cupsd, snmpd, sendmail, portsentry, xinetd, and Dell
OpenManage. This server has no cronjobs.
These errors usually seemed to occur right after autovacuum started
vacuuming a database, although I'm not sure if autovacuum is the cause here.
Here are some of the relevant portions from the logs (some of the
database/relation names are obfuscated to protect confidentiality):
2008-03-10 18:42:58 EDT LOG: autovacuum: processing database
"database1" [this database is 4GB in size]
TopMemoryContext: 401632 total in 11 blocks; 12832 free (12 chunks);
388800 used
TopTransactionContext: 8192 total in 1 blocks; 7352 free (11 chunks);
840 used
Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512
used
Autovacuum context: 253952 total in 5 blocks; 131152 free (147 chunks);
122800 used
smgr relation table: 24576 total in 2 blocks; 15968 free (4 chunks);
8608 used
Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
PortalMemory: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 24576 total in 2 blocks; 7632 free (3 chunks); 16944 used
CacheMemoryContext: 2157568 total in 10 blocks; 941176 free (0 chunks);
1216392 used
some_pkey: 1024 total in 1 blocks; 312 free (0 chunks); 712 used [note:
this index is currently about 41MB but only has 129 small tuples - it's
currently very bloated]
pg_index_indrelid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_constraint_conrelid_index: 1024 total in 1 blocks; 312 free (0
chunks); 712 used
pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_autovacuum_vacrelid_index: 1024 total in 1 blocks; 312 free (0
chunks); 712 used
pg_database_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_type_typname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks);
808 used
pg_type_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_statistic_relid_att_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_auth_members_member_role_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_auth_members_role_member_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_proc_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 16 free (0
chunks); 1008 used
pg_operator_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_opclass_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_namespace_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_namespace_nspname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_language_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_language_name_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_inherits_relid_seqno_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_index_indexrelid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_authid_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_authid_rolname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_database_datname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_conversion_oid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_conversion_name_nsp_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_conversion_default_index: 1024 total in 1 blocks; 16 free (0 chunks);
1008 used
pg_class_relname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks);
808 used
pg_class_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_cast_source_target_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_amproc_opc_proc_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_amop_opr_opc_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used
pg_amop_opc_strat_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_aggregate_fnoid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
Per-database table: 57344 total in 3 blocks; 26096 free (10 chunks);
31248 used
Per-database table: 57344 total in 3 blocks; 34320 free (10 chunks);
23024 used
Per-database table: 24576 total in 2 blocks; 5696 free (4 chunks); 18880
used
Per-database table: 122880 total in 4 blocks; 9344 free (13 chunks);
113536 used
Per-database table: 122880 total in 4 blocks; 21680 free (13 chunks);
101200 used
Per-database table: 57344 total in 3 blocks; 34320 free (10 chunks);
23024 used
Per-database table: 24576 total in 2 blocks; 5696 free (4 chunks); 18880
used
Per-database table: 122880 total in 4 blocks; 23744 free (14 chunks);
99136 used
Per-database table: 122880 total in 4 blocks; 21680 free (13 chunks);
101200 used
Per-database table: 24576 total in 2 blocks; 13920 free (4 chunks);
10656 used
Per-database table: 122880 total in 4 blocks; 21680 free (13 chunks);
101200 used
Databases hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
MdSmgr: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
LockTable (locallock hash): 24576 total in 2 blocks; 15984 free (5
chunks); 8592 used
Timezones: 52560 total in 2 blocks; 3744 free (0 chunks); 48816 used
Postmaster: 2328624 total in 3 blocks; 18048 free (86 chunks); 2310576 used
ErrorContext: 8192 total in 1 blocks; 8160 free (3 chunks); 32 used
2008-03-10 18:42:58 EDT ERROR: out of memory
2008-03-10 18:42:58 EDT DETAIL: Failed on request of size 274877436.
TopMemoryContext: 458976 total in 18 blocks; 12528 free (53 chunks);
446448 used
TopTransactionContext: 8192 total in 1 blocks; 7648 free (0 chunks); 544
used
SPI Plan: 15360 total in 4 blocks; 584 free (0 chunks); 14776 used
SPI Plan: 7168 total in 3 blocks; 952 free (0 chunks); 6216 used
SPI Plan: 3072 total in 2 blocks; 1112 free (0 chunks); 1960 used
SPI Plan: 3072 total in 2 blocks; 1112 free (0 chunks); 1960 used
SPI Plan: 7168 total in 3 blocks; 952 free (0 chunks); 6216 used
SPI Plan: 3072 total in 2 blocks; 1576 free (0 chunks); 1496 used
SPI Plan: 3072 total in 2 blocks; 1576 free (0 chunks); 1496 used
SPI Plan: 3072 total in 2 blocks; 1256 free (0 chunks); 1816 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 1024 total in 1 blocks; 72 free (0 chunks); 952 used
PL/PgSQL function context: 24576 total in 2 blocks; 15096 free (4
chunks); 9480 used
SPI Plan: 15360 total in 4 blocks; 776 free (0 chunks); 14584 used
SPI Plan: 3072 total in 2 blocks; 1112 free (0 chunks); 1960 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
PL/PgSQL function context: 24576 total in 2 blocks; 1024 free (4
chunks); 23552 used
SPI Plan: 3072 total in 2 blocks; 1448 free (0 chunks); 1624 used
SPI Plan: 7168 total in 3 blocks; 192 free (0 chunks); 6976 used
PL/PgSQL function context: 8192 total in 1 blocks; 6016 free (3 chunks);
2176 used
SPI Plan: 3072 total in 2 blocks; 1064 free (0 chunks); 2008 used
SPI Plan: 3072 total in 2 blocks; 1208 free (0 chunks); 1864 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1256 free (0 chunks); 1816 used
SPI Plan: 3072 total in 2 blocks; 1688 free (0 chunks); 1384 used
SPI Plan: 7168 total in 3 blocks; 1952 free (0 chunks); 5216 used
SPI Plan: 3072 total in 2 blocks; 1064 free (0 chunks); 2008 used
SPI Plan: 3072 total in 2 blocks; 1208 free (0 chunks); 1864 used
SPI Plan: 3072 total in 2 blocks; 1208 free (0 chunks); 1864 used
SPI Plan: 3072 total in 2 blocks; 1664 free (0 chunks); 1408 used
PL/PgSQL function context: 24576 total in 2 blocks; 13080 free (8
chunks); 11496 used
SPI Plan: 7168 total in 3 blocks; 4056 free (0 chunks); 3112 used
SPI Plan: 3072 total in 2 blocks; 952 free (0 chunks); 2120 used
SPI Plan: 3072 total in 2 blocks; 936 free (0 chunks); 2136 used
SPI Plan: 3072 total in 2 blocks; 952 free (0 chunks); 2120 used
SPI Plan: 3072 total in 2 blocks; 952 free (0 chunks); 2120 used
SPI Plan: 3072 total in 2 blocks; 952 free (0 chunks); 2120 used
SPI Plan: 7168 total in 3 blocks; 3864 free (0 chunks); 3304 used
SPI Plan: 3072 total in 2 blocks; 936 free (0 chunks); 2136 used
PL/PgSQL function context: 8192 total in 1 blocks; 896 free (4 chunks);
7296 used
Type information cache: 24576 total in 2 blocks; 11888 free (5 chunks);
12688 used
SPI Plan: 3072 total in 2 blocks; 352 free (0 chunks); 2720 used
PL/PgSQL function context: 8192 total in 1 blocks; 3552 free (6 chunks);
4640 used
PLpgSQL function cache: 24224 total in 2 blocks; 3744 free (0 chunks);
20480 used
Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512
used
MessageContext: 1063832 total in 8 blocks; 19088 free (6 chunks);
1044744 used
CFuncHash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
smgr relation table: 24576 total in 2 blocks; 3584 free (4 chunks);
20992 used
Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
PortalMemory: 8192 total in 1 blocks; 8160 free (1 chunks); 32 used
Relcache by OID: 24576 total in 2 blocks; 8672 free (3 chunks); 15904 used
CacheMemoryContext: 4267376 total in 11 blocks; 1835032 free (1 chunks);
2432344 used
[139 user indexes skipped - the biggest item was:187392 total in 25
blocks; 8024 free (0 chunks); 179368 used]
pg_index_indrelid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_constraint_conrelid_index: 1024 total in 1 blocks; 312 free (0
chunks); 712 used
pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_type_typname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks);
808 used
pg_type_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_statistic_relid_att_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_auth_members_member_role_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_auth_members_role_member_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_proc_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 16 free (0
chunks); 1008 used
pg_operator_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_opclass_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_namespace_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_namespace_nspname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_language_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_language_name_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_inherits_relid_seqno_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_index_indexrelid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_authid_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_authid_rolname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_database_datname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_conversion_oid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_conversion_name_nsp_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_conversion_default_index: 1024 total in 1 blocks; 16 free (0 chunks);
1008 used
pg_class_relname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks);
808 used
pg_class_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_cast_source_target_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_amproc_opc_proc_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_amop_opr_opc_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used
pg_amop_opc_strat_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_aggregate_fnoid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
MdSmgr: 8192 total in 1 blocks; 2208 free (0 chunks); 5984 used
LockTable (locallock hash): 24576 total in 2 blocks; 15984 free (5
chunks); 8592 used
Timezones: 52560 total in 2 blocks; 3744 free (0 chunks); 48816 used
ErrorContext: 8192 total in 1 blocks; 8160 free (3 chunks); 32 used
2008-03-10 18:43:30 EDT ERROR: out of memory
2008-03-10 18:43:30 EDT DETAIL: Failed on request of size 16.
[A bunch more memory errors skipped]
2008-03-10 18:43:33 EDT LOG: could not fork new process for connection:
Cannot allocate memory
2008-03-10 18:43:33 EDT LOG: could not fork new process for connection:
Cannot allocate memory
2008-03-10 18:43:33 EDT LOG: could not fork new process for connection:
Cannot allocate memory
2008-03-10 18:43:33 EDT LOG: could not fork new process for connection:
Cannot allocate memory
[...]
2008-03-10 18:51:08 EDT LOG: archive command "..." failed: return code -1
[...]
2008-03-10 19:20:39 EDT ERROR: out of memory
2008-03-10 19:20:39 EDT DETAIL: Failed on request of size 32.
2008-03-10 19:20:39 EDT LOG: server process (PID 5609) was terminated
by signal 11
2008-03-10 19:20:39 EDT LOG: terminating any other active server processes
2008-03-10 19:20:39 EDT WARNING: terminating connection because of
crash of another server process
2008-03-10 19:20:39 EDT DETAIL: The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-03-10 19:20:39 EDT HINT: In a moment you should be able to
reconnect to the database and repeat your command.
I also saw the following lines in the kernel logs during this 24 hour
period:
Mar 10 19:20:39 servername kernel: postmaster[5609]: segfault at
0000007fbffedfe8 rip 000000000049ea35 rsp 0000007fbffedff0 error 6
Mar 10 20:44:15 servername kernel: postmaster[26808]: segfault at
0000007fbffeafe8 rip 000000000049ea00 rsp 0000007fbffeb018 error 6
Mar 11 10:10:59 servername kernel: postmaster[19984]: segfault at
fffffffffffffff0 rip 00000000005e7b13 rsp 0000007fbfffe178 error 4
Mar 11 11:52:07 servername kernel: postmaster[27309]: segfault at
0000007fbffedfe8 rip 000000000049ea00 rsp 0000007fbffee018 error 6
Mar 11 12:39:18 servername kernel: postmaster[30301]: segfault at
fffffffffffffff0 rip 00000000005e7b13 rsp 0000007fbfffe158 error 4
Each of these five segfaults correspond to when the database crashed.
vm.overcommit_memory was set to 0, but I do not see any memory errors in
the kernel logs so I don't believe the OOM killer was used.
Current output of free:
total used free shared buffers cached
Mem: 16479276 16441000 38276 0 26612 8633316
-/+ buffers/cache: 7781072 8698204
Swap: 2031608 1180 2030428
The client is using Zenoss to monitor the servers, and it always shows
that free memory is pretty low, which tells me that this setup doesn't
differentiate memory that's used by the OS buffers/cache, which is
unfortunate. I did see free memory going up to about 1GB around each
time PostgreSQL crashed, presumably because PostgreSQL got killed and
released its memory. Free swap remained at 2GB free before, during, and
after the crashes. I did not see any abnormal changes in the loads and
CPU utilization during the crashes. Note that these measurements were
taking once every other hour so it's not very granular.
Any ideas as to what could have caused this?
Thank you,
Alex