Re: Properly handle OOM death? - Mailing list pgsql-general

From Joe Conway
Subject Re: Properly handle OOM death?
Date
Msg-id 522bcfcb-05d5-9f54-947e-ac4c7a8ba1da@joeconway.com
Whole thread Raw
In response to Properly handle OOM death?  (Israel Brewster <ijbrewster@alaska.edu>)
Responses Re: Properly handle OOM death?  (Israel Brewster <ijbrewster@alaska.edu>)
List pgsql-general
On 3/13/23 13:21, Israel Brewster wrote:
> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit 
> more memory constrained than I would like, such that every week or so 
> the various processes running on the machine will align badly and the 
> OOM killer will kick in, killing off postgresql, as per the following 
> journalctl output:
> 
> Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A 
> process of this unit has been killed by the OOM killer.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed 
> with result 'oom-kill'.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: 
> Consumed 5d 17h 48min 24.509s CPU time.
> 
> And the service is no longer running.
> 
> When this happens, I go in and restart the postgresql service, and 
> everything is happy again for the next week or two.
> 
> Obviously this is not a good situation. Which leads to two questions:
> 
> 1) is there some tweaking I can do in the postgresql config itself to 
> prevent the situation from occurring in the first place?
> 2) My first thought was to simply have systemd restart postgresql 
> whenever it is killed like this, which is easy enough. Then I looked at 
> the default unit file, and found these lines:
> 
> # prevent OOM killer from choosing the postmaster (individual backends will
> # reset the score to 0)
> OOMScoreAdjust=-900
> # restarting automatically will prevent "pg_ctlcluster ... stop" from 
> working,
> # so we disable it here. Also, the postmaster will restart by itself on most
> # problems anyway, so it is questionable if one wants to enable external
> # automatic restarts.
> #Restart=on-failure
> 
> Which seems to imply that the OOM killer should only be killing off 
> individual backends, not the entire cluster to begin with - which should 
> be fine. And also that adding the restart=on-failure option is probably 
> not the greatest idea. Which makes me wonder what is really going on?

First, are you running with a cgroup memory.limit set (e.g. in a container)?

Assuming no, see:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

That will tell you:
1/ Turn off memory overcommit: "Although this setting will not prevent 
the OOM killer from being invoked altogether, it will lower the chances 
significantly and will therefore lead to more robust system behavior."

2/ set /proc/self/oom_score_adj to -1000 rather than -900 
(OOMScoreAdjust=-1000): the value -1000 is important as it is a "magic" 
value which prevents the process from being selected by the OOM killer 
(see: 
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/oom.h#L6) 
whereas -900 just makes it less likely.

All that said, even if the individual backend gets killed, the 
postmaster will still go into crash recovery. So while technically 
postgres does not restart, the effect is much the same. So see #1 above 
as your best protection.

HTH,

Joe

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




pgsql-general by date:

Previous
From: Israel Brewster
Date:
Subject: Re: Properly handle OOM death?
Next
From: Israel Brewster
Date:
Subject: Re: Properly handle OOM death?