Thread: [GENERAL] Postgresql out-of-memory kill

[GENERAL] Postgresql out-of-memory kill

From
Israel Brewster
Date:
So just a bit ago I ran into a bit of excitement when the kernel decided to kill one of my postmaster processes due to an out-of-memory issue, which would have been fine, except that the problem was then compounded by Pacemaker attempting to restart postgresql, but only managing to get as far as stopping the primary and failing to promote the secondary, leaving me with nothing. Not fun for a mission-critical database, but luckily I was notified of the issue nearly immediately, and was able to get everything back up and running quickly (after a few moments of panic).

In any case the root problem here was the out-of-memory issue. The logs show this:

Feb  1 11:58:34 fai-dbs1 kernel: Out of memory: Kill process 26316 (postmaster) score 837 or sacrifice child
Feb  1 11:58:34 fai-dbs1 kernel: Killed process 26316, UID 26, (postmaster) total-vm:72328760kB, anon-rss:55470760kB, file-rss:4753180kB

So a single postmaster process was using over 72GB of ram. Obviously I have something in my config tuned too high. The question is "what"? The machine has 64 GB of RAM, and I want postgresql to be able to use as much of that as it wants (since the machine is dedicated to postgresql), but obviously it needs to be more limited than what I have.

From my config:

max_connections = 300
shared_buffers = 14GB  # Roughly 1/4 of 64GB, as per https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
work_mem = 75MB

Not sure what else in there would play a role. I was seeing some of this in the postgresql logs shortly before the issue arose:

2017-02-01 11:58:02.074 AKST > LOG:  checkpoints are occurring too frequently (12 seconds apart)
2017-02-01 11:58:02.074 AKST > HINT:  Consider increasing the configuration parameter "max_wal_size".

I was thinking perhaps the work_mem was the issue, but if my understanding and calculations are correct, that would mean I had more than 800 simultaneous sorts/queries going on at the time, which seems quite improbable, given the relatively light load on this database. At the moment, for example, a query on pg_stat_activity reveals only 11 entries.

What am I missing here?
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------



Attachment

Re: [GENERAL] Postgresql out-of-memory kill

From
Tom Lane
Date:
Israel Brewster <israel@ravnalaska.net> writes:
> So just a bit ago I ran into a bit of excitement when the kernel decided
> to kill one of my postmaster processes due to an out-of-memory issue,

Fun :-(

> So a single postmaster process was using over 72GB of ram.

No, the kernel was blaming it for 72GB, which is an entirely different
statement.  The Linux OOM killer makes some assumptions that are
ludicrously wrong for Postgres: not only does it blame a parent process
for the total memory consumption of all its children, but if the children
share a large shared memory segment, *it counts the shared memory segment
over again for each child*.  At least this was true last I looked;
perhaps very recent kernels are a bit less insane about shared memory.
In any case, the core problem is blaming the parent process for the
sins of a child.

Now the PG postmaster itself consumes very little memory, and this is
quite unlikely to suddenly go wrong because it doesn't do very much.
A child backend process might go crazy, but what you want to happen then
is for the OOM killer to kill the child process not the postmaster.
That will still result in a database crash/restart scenario, but as long
as the postmaster is alive everything should recover automatically.

Your problem, then, is that the OOM killer is egregiously and with malice
aforethought killing the wrong process.

The usual fix for this is to configure things so that the postmaster is
excluded from OOM kill but its children aren't.  See
https://www.postgresql.org/docs/current/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
(but be sure to consult the page for your PG version, as we've changed
the support mechanism for that in the past.)

If you're using a vendor-supplied packaging of PG and it doesn't have some
easy way to turn on this behavior, complain to the vendor ...

            regards, tom lane


Re: [GENERAL] Postgresql out-of-memory kill

From
Israel Brewster
Date:
On Feb 1, 2017, at 1:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Israel Brewster <israel@ravnalaska.net> writes:
So just a bit ago I ran into a bit of excitement when the kernel decided
to kill one of my postmaster processes due to an out-of-memory issue,

Fun :-(

So a single postmaster process was using over 72GB of ram.

No, the kernel was blaming it for 72GB, which is an entirely different
statement.  The Linux OOM killer makes some assumptions that are
ludicrously wrong for Postgres: not only does it blame a parent process
for the total memory consumption of all its children, but if the children
share a large shared memory segment, *it counts the shared memory segment
over again for each child*.  At least this was true last I looked;
perhaps very recent kernels are a bit less insane about shared memory.
In any case, the core problem is blaming the parent process for the
sins of a child.

Now the PG postmaster itself consumes very little memory, and this is
quite unlikely to suddenly go wrong because it doesn't do very much.
A child backend process might go crazy, but what you want to happen then
is for the OOM killer to kill the child process not the postmaster.
That will still result in a database crash/restart scenario, but as long
as the postmaster is alive everything should recover automatically.

Your problem, then, is that the OOM killer is egregiously and with malice
aforethought killing the wrong process.

The usual fix for this is to configure things so that the postmaster is
excluded from OOM kill but its children aren't.  See
https://www.postgresql.org/docs/current/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
(but be sure to consult the page for your PG version, as we've changed
the support mechanism for that in the past.)

If you're using a vendor-supplied packaging of PG and it doesn't have some
easy way to turn on this behavior, complain to the vendor ...

regards, tom lane


Thanks for the explanation. This is a CentOS 6 box, kernel  2.6.32-642.11.1.el6.x86_64, running the PostgreSQL supplied Postgres 9.6.1, so hopefully the information on that page applies. I'll mess around with modifying the init.d script to exclude the postmaster process. Thanks again!

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general