Thread: Properly handle OOM death?

Properly handle OOM death?

From
Israel Brewster
Date:
I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more memory constrained than I would like, such that every week or so the various processes running on the machine will align badly and the OOM killer will kick in, killing off postgresql, as per the following journalctl output:

Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer.
Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with result 'oom-kill'.
Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d 17h 48min 24.509s CPU time.

And the service is no longer running.

When this happens, I go in and restart the postgresql service, and everything is happy again for the next week or two.

Obviously this is not a good situation. Which leads to two questions:

1) is there some tweaking I can do in the postgresql config itself to prevent the situation from occurring in the first place?
2) My first thought was to simply have systemd restart postgresql whenever it is killed like this, which is easy enough. Then I looked at the default unit file, and found these lines:

# prevent OOM killer from choosing the postmaster (individual backends will
# reset the score to 0)
OOMScoreAdjust=-900
# restarting automatically will prevent "pg_ctlcluster ... stop" from working,
# so we disable it here. Also, the postmaster will restart by itself on most
# problems anyway, so it is questionable if one wants to enable external
# automatic restarts.
#Restart=on-failure

Which seems to imply that the OOM killer should only be killing off individual backends, not the entire cluster to begin with - which should be fine. And also that adding the restart=on-failure option is probably not the greatest idea. Which makes me wonder what is really going on?

Thanks.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: Properly handle OOM death?

From
Adrian Klaver
Date:
On 3/13/23 10:21 AM, Israel Brewster wrote:
> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit 
> more memory constrained than I would like, such that every week or so 
> the various processes running on the machine will align badly and the 
> OOM killer will kick in, killing off postgresql, as per the following 
> journalctl output:
> 
> Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A 
> process of this unit has been killed by the OOM killer.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed 
> with result 'oom-kill'.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: 
> Consumed 5d 17h 48min 24.509s CPU time.
> 
> And the service is no longer running.
> 
> When this happens, I go in and restart the postgresql service, and 
> everything is happy again for the next week or two.
> 
> Obviously this is not a good situation. Which leads to two questions:
> 
> 1) is there some tweaking I can do in the postgresql config itself to 
> prevent the situation from occurring in the first place?
> 2) My first thought was to simply have systemd restart postgresql 
> whenever it is killed like this, which is easy enough. Then I looked at 
> the default unit file, and found these lines:
> 
> # prevent OOM killer from choosing the postmaster (individual backends will
> # reset the score to 0)
> OOMScoreAdjust=-900
> # restarting automatically will prevent "pg_ctlcluster ... stop" from 
> working,
> # so we disable it here. Also, the postmaster will restart by itself on most
> # problems anyway, so it is questionable if one wants to enable external
> # automatic restarts.
> #Restart=on-failure
> 
> Which seems to imply that the OOM killer should only be killing off 
> individual backends, not the entire cluster to begin with - which should 
> be fine. And also that adding the restart=on-failure option is probably 
> not the greatest idea. Which makes me wonder what is really going on?

You might want to read:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

> 
> Thanks.
> 
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Properly handle OOM death?

From
Israel Brewster
Date:
On Mar 13, 2023, at 9:28 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 3/13/23 10:21 AM, Israel Brewster wrote:
I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more memory constrained than I would like, such that every week or so the various processes running on the machine will align badly and the OOM killer will kick in, killing off postgresql, as per the following journalctl output:
Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer.
Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with result 'oom-kill'.
Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d 17h 48min 24.509s CPU time.
And the service is no longer running.
When this happens, I go in and restart the postgresql service, and everything is happy again for the next week or two.
Obviously this is not a good situation. Which leads to two questions:
1) is there some tweaking I can do in the postgresql config itself to prevent the situation from occurring in the first place?
2) My first thought was to simply have systemd restart postgresql whenever it is killed like this, which is easy enough. Then I looked at the default unit file, and found these lines:
# prevent OOM killer from choosing the postmaster (individual backends will
# reset the score to 0)
OOMScoreAdjust=-900
# restarting automatically will prevent "pg_ctlcluster ... stop" from working,
# so we disable it here. Also, the postmaster will restart by itself on most
# problems anyway, so it is questionable if one wants to enable external
# automatic restarts.
#Restart=on-failure
Which seems to imply that the OOM killer should only be killing off individual backends, not the entire cluster to begin with - which should be fine. And also that adding the restart=on-failure option is probably not the greatest idea. Which makes me wonder what is really going on?

You might want to read:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

Good information, thanks. One thing there confuses me though. It says:

Another approach, which can be used with or without altering vm.overcommit_memory, is to set the process-specific OOM score adjustment value for the postmaster process to -1000, thereby guaranteeing it will not be targeted by the OOM killer

Isn’t that exactly what the "OOMScoreAdjust=-900” line in the Unit file does though (except with a score of -900 rather than -1000)?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Thanks.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Properly handle OOM death?

From
Joe Conway
Date:
On 3/13/23 13:21, Israel Brewster wrote:
> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit 
> more memory constrained than I would like, such that every week or so 
> the various processes running on the machine will align badly and the 
> OOM killer will kick in, killing off postgresql, as per the following 
> journalctl output:
> 
> Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A 
> process of this unit has been killed by the OOM killer.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed 
> with result 'oom-kill'.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: 
> Consumed 5d 17h 48min 24.509s CPU time.
> 
> And the service is no longer running.
> 
> When this happens, I go in and restart the postgresql service, and 
> everything is happy again for the next week or two.
> 
> Obviously this is not a good situation. Which leads to two questions:
> 
> 1) is there some tweaking I can do in the postgresql config itself to 
> prevent the situation from occurring in the first place?
> 2) My first thought was to simply have systemd restart postgresql 
> whenever it is killed like this, which is easy enough. Then I looked at 
> the default unit file, and found these lines:
> 
> # prevent OOM killer from choosing the postmaster (individual backends will
> # reset the score to 0)
> OOMScoreAdjust=-900
> # restarting automatically will prevent "pg_ctlcluster ... stop" from 
> working,
> # so we disable it here. Also, the postmaster will restart by itself on most
> # problems anyway, so it is questionable if one wants to enable external
> # automatic restarts.
> #Restart=on-failure
> 
> Which seems to imply that the OOM killer should only be killing off 
> individual backends, not the entire cluster to begin with - which should 
> be fine. And also that adding the restart=on-failure option is probably 
> not the greatest idea. Which makes me wonder what is really going on?

First, are you running with a cgroup memory.limit set (e.g. in a container)?

Assuming no, see:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

That will tell you:
1/ Turn off memory overcommit: "Although this setting will not prevent 
the OOM killer from being invoked altogether, it will lower the chances 
significantly and will therefore lead to more robust system behavior."

2/ set /proc/self/oom_score_adj to -1000 rather than -900 
(OOMScoreAdjust=-1000): the value -1000 is important as it is a "magic" 
value which prevents the process from being selected by the OOM killer 
(see: 
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/oom.h#L6) 
whereas -900 just makes it less likely.

All that said, even if the individual backend gets killed, the 
postmaster will still go into crash recovery. So while technically 
postgres does not restart, the effect is much the same. So see #1 above 
as your best protection.

HTH,

Joe

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Properly handle OOM death?

From
Israel Brewster
Date:
On Mar 13, 2023, at 9:36 AM, Joe Conway <mail@joeconway.com> wrote:

On 3/13/23 13:21, Israel Brewster wrote:
I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more memory constrained than I would like, such that every week or so the various processes running on the machine will align badly and the OOM killer will kick in, killing off postgresql, as per the following journalctl output:
Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer.
Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with result 'oom-kill'.
Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d 17h 48min 24.509s CPU time.
And the service is no longer running.
When this happens, I go in and restart the postgresql service, and everything is happy again for the next week or two.
Obviously this is not a good situation. Which leads to two questions:
1) is there some tweaking I can do in the postgresql config itself to prevent the situation from occurring in the first place?
2) My first thought was to simply have systemd restart postgresql whenever it is killed like this, which is easy enough. Then I looked at the default unit file, and found these lines:
# prevent OOM killer from choosing the postmaster (individual backends will
# reset the score to 0)
OOMScoreAdjust=-900
# restarting automatically will prevent "pg_ctlcluster ... stop" from working,
# so we disable it here. Also, the postmaster will restart by itself on most
# problems anyway, so it is questionable if one wants to enable external
# automatic restarts.
#Restart=on-failure
Which seems to imply that the OOM killer should only be killing off individual backends, not the entire cluster to begin with - which should be fine. And also that adding the restart=on-failure option is probably not the greatest idea. Which makes me wonder what is really going on?

First, are you running with a cgroup memory.limit set (e.g. in a container)?

Not sure, actually. I *think* I had it set it up as a full VM though, not a container. I’ll have to double-check that.

Assuming no, see:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

That will tell you:
1/ Turn off memory overcommit: "Although this setting will not prevent the OOM killer from being invoked altogether, it will lower the chances significantly and will therefore lead to more robust system behavior."

2/ set /proc/self/oom_score_adj to -1000 rather than -900 (OOMScoreAdjust=-1000): the value -1000 is important as it is a "magic" value which prevents the process from being selected by the OOM killer (see: https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/oom.h#L6) whereas -900 just makes it less likely.

..and that answers the question I just sent about the above linked page 😄 Thanks!


All that said, even if the individual backend gets killed, the postmaster will still go into crash recovery. So while technically postgres does not restart, the effect is much the same. So see #1 above as your best protection.

Interesting. Makes sense though. Thanks!


---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


HTH,

Joe

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Properly handle OOM death?

From
"Peter J. Holzer"
Date:
On 2023-03-13 09:21:18 -0800, Israel Brewster wrote:
> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more
> memory constrained than I would like, such that every week or so the various
> processes running on the machine will align badly and the OOM killer will kick
> in, killing off postgresql, as per the following journalctl output:
>
> Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of
> this unit has been killed by the OOM killer.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with
> result 'oom-kill'.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d
> 17h 48min 24.509s CPU time.
>
> And the service is no longer running.

I might be misreading this, but it looks to me that systemd detects that
*some* process in the group was killed by the oom killer and stops the
service.

Can you check which process was actually killed? If it's not the
postmaster, setting OOMScoreAdjust is probably useless.

(I tried searching the web for the error messages and didn't find
anything useful)


> 2) My first thought was to simply have systemd restart postgresql whenever it
> is killed like this, which is easy enough. Then I looked at the default unit
> file, and found these lines:
>
> # prevent OOM killer from choosing the postmaster (individual backends will
> # reset the score to 0)
> OOMScoreAdjust=-900
> # restarting automatically will prevent "pg_ctlcluster ... stop" from working,
> # so we disable it here.

I never call pg_ctlcluster directly, so that probably wouldn't be a good
reason for me.

> Also, the postmaster will restart by itself on most
> # problems anyway, so it is questionable if one wants to enable external
> # automatic restarts.
> #Restart=on-failure

So I'd try this despite the comment.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Properly handle OOM death?

From
Israel Brewster
Date:
> On Mar 13, 2023, at 9:43 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>
> On 2023-03-13 09:21:18 -0800, Israel Brewster wrote:
>> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more
>> memory constrained than I would like, such that every week or so the various
>> processes running on the machine will align badly and the OOM killer will kick
>> in, killing off postgresql, as per the following journalctl output:
>>
>> Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of
>> this unit has been killed by the OOM killer.
>> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with
>> result 'oom-kill'.
>> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d
>> 17h 48min 24.509s CPU time.
>>
>> And the service is no longer running.
>
> I might be misreading this, but it looks to me that systemd detects that
> *some* process in the group was killed by the oom killer and stops the
> service.
>
> Can you check which process was actually killed? If it's not the
> postmaster, setting OOMScoreAdjust is probably useless.
>
> (I tried searching the web for the error messages and didn't find
> anything useful)

Your guess is as good as (if not better than) mine. I can find the PID of the killed process in the system log, but
withoutknowing what the PID of postmaster and the child processes were prior to the kill, I’m not sure that helps much.
Thoughfor what it’s worth, I do note the following about all the kill logs: 

1) They reference a “Memory cgroup out of memory”, which refers back to the opening comment on Joe Conway’s message -
thiswould imply to me that I *AM* running with a cgroup memory.limit set. Not sure how that changes things? 
2) All the entries contain the line "oom_score_adj:0”, which would seem to imply that the postmaster, with its -900
scoreis not being directly targeted by the OOM killer. 

>
>> 2) My first thought was to simply have systemd restart postgresql whenever it
>> is killed like this, which is easy enough. Then I looked at the default unit
>> file, and found these lines:
>>
>> # prevent OOM killer from choosing the postmaster (individual backends will
>> # reset the score to 0)
>> OOMScoreAdjust=-900
>> # restarting automatically will prevent "pg_ctlcluster ... stop" from working,
>> # so we disable it here.
>
> I never call pg_ctlcluster directly, so that probably wouldn't be a good
> reason for me.

Valid point, unless something under-the-hood needs to call it?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

>
>> Also, the postmaster will restart by itself on most
>> # problems anyway, so it is questionable if one wants to enable external
>> # automatic restarts.
>> #Restart=on-failure
>
> So I'd try this despite the comment.
>
>        hp
>
> --
>   _  | Peter J. Holzer    | Story must make more sense than reality.
> |_|_) |                    |
> | |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
> __/   | http://www.hjp.at/ |       challenge!"




Re: Properly handle OOM death?

From
Joe Conway
Date:
On 3/13/23 13:55, Israel Brewster wrote:
> 1) They reference a “Memory cgroup out of memory”, which refers back
> to the opening comment on Joe Conway’s message - this would imply to
> me that I *AM* running with a cgroup memory.limit set. Not sure how
> that changes things?

cgroup memory limit is enforced regardless of the actual host level 
memory pressure. As an example, if your host VM has 128 GB of memory, 
but your cgroup memory limit is 512MB, you will get an OOM kill when the 
sum memory usage of all of your postgres processes (and anything else 
sharing the same cgroup) exceeds 512 MB, even if the host VM has nothing 
else going on consuming memory.

You can check if a memory is set by reading the corresponding virtual 
file, e.g:

8<-------------------
# cat 
/sys/fs/cgroup/memory/system.slice/postgresql.service/memory.limit_in_bytes
9223372036854710272
8<-------------------

A few notes:
1/ The specific path to memory.limit_in_bytes might vary, but this 
example is the default for the RHEL 8 postgresql 10 RPM.

2/ The value above, 9223372036854710272 basically means "no limit" has 
been set.

3/ The example assumes cgroup v1. There are very few distro's that 
enable cgroup v2 by default, and generally I have not seen much cgroup 
v2 usage in the wild (although I strongly recommend it), but if you are 
using cgroup v2 the names have changed. You can check by doing:

8<--cgroupv2 enabled-----------------
# stat -fc %T /sys/fs/cgroup/
cgroup2fs
8<--cgroupv1 enabled-----------------
# stat -fc %T /sys/fs/cgroup/
tmpfs
8<-------------------

> 2) All the entries contain the line "oom_score_adj:0”, which would
> seem to imply that the postmaster, with its -900 score is not being
> directly targeted by the OOM killer.

Sounds correct

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Properly handle OOM death?

From
Israel Brewster
Date:
> On Mar 13, 2023, at 10:37 AM, Joe Conway <mail@joeconway.com> wrote:
>
> On 3/13/23 13:55, Israel Brewster wrote:
>> 1) They reference a “Memory cgroup out of memory”, which refers back
>> to the opening comment on Joe Conway’s message - this would imply to
>> me that I *AM* running with a cgroup memory.limit set. Not sure how
>> that changes things?
>
> cgroup memory limit is enforced regardless of the actual host level memory pressure. As an example, if your host VM
has128 GB of memory, but your cgroup memory limit is 512MB, you will get an OOM kill when the sum memory usage of all
ofyour postgres processes (and anything else sharing the same cgroup) exceeds 512 MB, even if the host VM has nothing
elsegoing on consuming memory. 
>
> You can check if a memory is set by reading the corresponding virtual file, e.g:
>
> 8<-------------------
> # cat /sys/fs/cgroup/memory/system.slice/postgresql.service/memory.limit_in_bytes
> 9223372036854710272
> 8<-------------------
>
> A few notes:
> 1/ The specific path to memory.limit_in_bytes might vary, but this example is the default for the RHEL 8 postgresql
10RPM. 

Not finding that file specifically (this is probably too much info, but…):

root@novarupta:~# ls /sys/fs/cgroup/system.slice/
 -.mount                   cgroup.threads          dev-hugepages.mount      memory.events.local   memory.swap.events
       proc-diskstats.mount                   ssh.service                           system-postgresql.slice
systemd-resolved.service
 accounts-daemon.service   cgroup.type             dev-lxc-console.mount    memory.high           memory.swap.high
       proc-loadavg.mount                     sys-devices-system-cpu-online.mount   systemd-initctl.socket
systemd-sysctl.service
 cgroup.controllers        console-getty.service   dev-lxc-tty1.mount       memory.low            memory.swap.max
       proc-meminfo.mount                     sys-devices-virtual-net.mount         systemd-journal-flush.service
systemd-sysusers.service
 cgroup.events             console-setup.service   dev-lxc-tty2.mount       memory.max
networkd-dispatcher.service  proc-stat.mount                        sys-fs-fuse-connections.mount
systemd-journald-audit.socket    systemd-tmpfiles-setup-dev.service 
 cgroup.freeze             cpu.pressure            dev-mqueue.mount         memory.min            pids.current
       proc-swaps.mount                       sys-kernel-debug.mount                systemd-journald-dev-log.socket
systemd-tmpfiles-setup.service
 cgroup.max.depth          cpu.stat                dev-ptmx.mount           memory.numa_stat      pids.events
       proc-sys-kernel-random-boot_id.mount   syslog.socket                         systemd-journald.service
systemd-update-utmp.service
 cgroup.max.descendants    cron.service            io.pressure              memory.oom.group      pids.max
       proc-sys-net.mount                     sysstat.service                       systemd-journald.socket
systemd-user-sessions.service
 cgroup.procs              data.mount              keyboard-setup.service   memory.pressure       pool.mount
      'proc-sysrq\x2dtrigger.mount'          'system-container\x2dgetty.slice'      systemd-logind.service
ufw.service
 cgroup.stat               dbus.service            memory.current           memory.stat           postfix.service
       proc-uptime.mount                      system-modprobe.slice                 systemd-networkd.service
uuidd.socket
 cgroup.subtree_control    dbus.socket             memory.events            memory.swap.current   proc-cpuinfo.mount
       rsyslog.service                        system-postfix.slice                  systemd-remount-fs.service 

root@novarupta:~# ls /sys/fs/cgroup/system.slice/system-postgresql.slice/
cgroup.controllers  cgroup.max.depth        cgroup.stat             cgroup.type   io.pressure     memory.events.local
memory.max       memory.oom.group  memory.swap.current  memory.swap.max  pids.max 
cgroup.events       cgroup.max.descendants  cgroup.subtree_control  cpu.pressure  memory.current  memory.high
memory.min       memory.pressure   memory.swap.events   pids.current     postgresql@13-main.service 
cgroup.freeze       cgroup.procs            cgroup.threads          cpu.stat      memory.events   memory.low
memory.numa_stat memory.stat       memory.swap.high     pids.events 

root@novarupta:~# ls /sys/fs/cgroup/system.slice/system-postgresql.slice/postgresql@13-main.service/
cgroup.controllers  cgroup.max.depth        cgroup.stat             cgroup.type   io.pressure     memory.events.local
memory.max       memory.oom.group  memory.swap.current  memory.swap.max  pids.max 
cgroup.events       cgroup.max.descendants  cgroup.subtree_control  cpu.pressure  memory.current  memory.high
memory.min       memory.pressure   memory.swap.events   pids.current 
cgroup.freeze       cgroup.procs            cgroup.threads          cpu.stat      memory.events   memory.low
memory.numa_stat memory.stat       memory.swap.high     pids.events 

>
> 2/ The value above, 9223372036854710272 basically means "no limit" has been set.
>
> 3/ The example assumes cgroup v1. There are very few distro's that enable cgroup v2 by default, and generally I have
notseen much cgroup v2 usage in the wild (although I strongly recommend it), but if you are using cgroup v2 the names
havechanged. You can check by doing: 
>
> 8<--cgroupv2 enabled-----------------
> # stat -fc %T /sys/fs/cgroup/
> cgroup2fs
> 8<--cgroupv1 enabled-----------------
> # stat -fc %T /sys/fs/cgroup/
> tmpfs
> 8<-------------------

Looks like V2:

root@novarupta:~# stat -fc %T /sys/fs/cgroup/
cgroup2fs
root@novarupta:~# lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:    20.04
Codename:    focal

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

>
>> 2) All the entries contain the line "oom_score_adj:0”, which would
>> seem to imply that the postmaster, with its -900 score is not being
>> directly targeted by the OOM killer.
>
> Sounds correct
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>




Re: Properly handle OOM death?

From
Joe Conway
Date:
On 3/13/23 14:50, Israel Brewster wrote:
> Looks like V2:
> 
> root@novarupta:~# stat -fc %T /sys/fs/cgroup/
> cgroup2fs

Interesting -- it does indeed look like you are using cgroup v2

So the file you want to look at in that case is:
8<-----------
cat 
/sys/fs/cgroup/system.slice/system-postgresql.slice/postgresql@14.service/memory.max
4294967296

cat 
/sys/fs/cgroup/system.slice/system-postgresql.slice/postgresql@14.service/memory.high
3221225472
8<-----------
If the value comes back as "max" it means no limit is set.

In this example (on my Linux Mint machine with a custom systemd unit 
file) I have memory.max set to 4G and memory.high set to 3G.

The value of memory.max determines when the OOM killer will strike. The 
value of memory.high will determine when the kernel goes into aggressive 
memory reclaim (trying to avoid memory.max and thus an OOM kill).

The corresponding/relevant systemd unit file parameters are:
8<-----------
MemoryAccounting=yes
MemoryHigh=3G
MemoryMax=4G
8<-----------

There are other ways that memory.max may get set, but it seems most 
likely that the systemd unit file is doing it (if it is in fact set).

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Properly handle OOM death?

From
Israel Brewster
Date:
On Mar 13, 2023, at 11:10 AM, Joe Conway <mail@joeconway.com> wrote:
>
> On 3/13/23 14:50, Israel Brewster wrote:
>> Looks like V2:
>> root@novarupta:~# stat -fc %T /sys/fs/cgroup/
>> cgroup2fs
>
> Interesting -- it does indeed look like you are using cgroup v2
>
> So the file you want to look at in that case is:
> 8<-----------
> cat /sys/fs/cgroup/system.slice/system-postgresql.slice/postgresql@14.service/memory.max
> 4294967296
>
> cat /sys/fs/cgroup/system.slice/system-postgresql.slice/postgresql@14.service/memory.high
> 3221225472
> 8<-----------
> If the value comes back as "max" it means no limit is set.

This does, in fact, appear to be the case here:

root@novarupta:~# cat /sys/fs/cgroup/system.slice/system-postgresql.slice/postgresql@13-main.service/memory.max
max
root@novarupta:~# cat /sys/fs/cgroup/system.slice/system-postgresql.slice/postgresql@13-main.service/memory.high
max
root@novarupta:~#

which would presumably indicate that it’s a system level limit being exceeded, rather than a postgresql specific one?
Thesyslog specifically says "Memory cgroup out of memory”, if that means something (this is my first exposure to
cgroups,if you couldn’t tell). 
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


>
> In this example (on my Linux Mint machine with a custom systemd unit file) I have memory.max set to 4G and
memory.highset to 3G. 
>
> The value of memory.max determines when the OOM killer will strike. The value of memory.high will determine when the
kernelgoes into aggressive memory reclaim (trying to avoid memory.max and thus an OOM kill). 
>
> The corresponding/relevant systemd unit file parameters are:
> 8<-----------
> MemoryAccounting=yes
> MemoryHigh=3G
> MemoryMax=4G
> 8<-----------
>
> There are other ways that memory.max may get set, but it seems most likely that the systemd unit file is doing it (if
itis in fact set). 
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>




Re: Properly handle OOM death?

From
Jeffrey Walton
Date:
On Mon, Mar 13, 2023 at 1:21 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
>
> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more memory constrained than I would like,
suchthat every week or so the various processes running on the machine will align badly and the OOM killer will kick
in,killing off postgresql, as per the following journalctl output: 
>
> Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM
killer.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with result 'oom-kill'.
> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d 17h 48min 24.509s CPU time.
>
> And the service is no longer running.
>
> When this happens, I go in and restart the postgresql service, and everything is happy again for the next week or
two.
>
> Obviously this is not a good situation. Which leads to two questions:
>
> 1) is there some tweaking I can do in the postgresql config itself to prevent the situation from occurring in the
firstplace? 
> 2) My first thought was to simply have systemd restart postgresql whenever it is killed like this, which is easy
enough.Then I looked at the default unit file, and found these lines: 
>
> # prevent OOM killer from choosing the postmaster (individual backends will
> # reset the score to 0)
> OOMScoreAdjust=-900
> # restarting automatically will prevent "pg_ctlcluster ... stop" from working,
> # so we disable it here. Also, the postmaster will restart by itself on most
> # problems anyway, so it is questionable if one wants to enable external
> # automatic restarts.
> #Restart=on-failure
>
> Which seems to imply that the OOM killer should only be killing off individual backends, not the entire cluster to
beginwith - which should be fine. And also that adding the restart=on-failure option is probably not the greatest idea.
Whichmakes me wonder what is really going on? 
>

Related, we (a FOSS project) used to have a Linux server with a LAMP
stack on GoDaddy. The machine provided a website and wiki. It was very
low-end. I think it had 512MB or 1 GB RAM and no swap file. And no way
to enable a swap file (part of an upsell). We paid about $2 a month
for it.

MySQL was killed several times a week. It corrupted the database on a
regular basis. We had to run the database repair tools daily. We
eventually switched to Ionos for hosting. We got a VM with more memory
and a swap file for about $5 a month. No more OOM kills.

If possible, you might want to add more memory (or a swap file) to the
machine. It will help sidestep the OOM problem.

You can also add vm.overcommit_memory = 2 to stop Linux from
oversubscribing memory. The machine will act like a Solaris box rather
than a Linux box (which takes some getting used to). Also see
https://serverfault.com/questions/606185/how-does-vm-overcommit-memory-work
.

Jeff



Re: Properly handle OOM death?

From
Joe Conway
Date:
On 3/13/23 15:18, Israel Brewster wrote:
> root@novarupta:~# cat /sys/fs/cgroup/system.slice/system-postgresql.slice/postgresql@13-main.service/memory.max
> max
> root@novarupta:~# cat /sys/fs/cgroup/system.slice/system-postgresql.slice/postgresql@13-main.service/memory.high
> max
> root@novarupta:~#
> 
> which would presumably indicate that it’s a system level limit being
> exceeded, rather than a postgresql specific one?

Yep

> The syslog specifically says "Memory cgroup out of memory”, if that means
> something (this is my first exposure to cgroups, if you couldn’t
> tell).

I am not entirely sure, but without actually testing it I suspect that 
since memory.max = high (that is, the limit is whatever the host has 
available) the OOM kill is technically a cgroup OOM kill even though it 
is effectively a host level memory pressure event.

Did you try setting "vm.overcommit_memory=2"?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Properly handle OOM death?

From
"Peter J. Holzer"
Date:
On 2023-03-13 09:55:50 -0800, Israel Brewster wrote:
> On Mar 13, 2023, at 9:43 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > On 2023-03-13 09:21:18 -0800, Israel Brewster wrote:
> >> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more
> >> memory constrained than I would like, such that every week or so the various
> >> processes running on the machine will align badly and the OOM killer will kick
> >> in, killing off postgresql, as per the following journalctl output:
> >>
> >> Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of
> >> this unit has been killed by the OOM killer.
> >> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with
> >> result 'oom-kill'.
> >> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d
> >> 17h 48min 24.509s CPU time.
> >>
> >> And the service is no longer running.
> >
> > I might be misreading this, but it looks to me that systemd detects that
> > *some* process in the group was killed by the oom killer and stops the
> > service.
> >
> > Can you check which process was actually killed? If it's not the
> > postmaster, setting OOMScoreAdjust is probably useless.
> >
> > (I tried searching the web for the error messages and didn't find
> > anything useful)
>
> Your guess is as good as (if not better than) mine. I can find the PID
> of the killed process in the system log, but without knowing what the
> PID of postmaster and the child processes were prior to the kill, I’m
> not sure that helps much.

The syslog should contain a list of all tasks prior to the kill. For
example, I just provoked an OOM kill on my laptop and the syslog
contains (among lots of others) these lines:

Mar 13 21:00:36 trintignant kernel: [112024.084117] [   2721]   126  2721    54563     2042   163840      555
-900postgres 
Mar 13 21:00:36 trintignant kernel: [112024.084123] [   2873]   126  2873    18211       85   114688      594
 0 postgres 
Mar 13 21:00:36 trintignant kernel: [112024.084128] [   2941]   126  2941    54592     1231   147456      565
 0 postgres 
Mar 13 21:00:36 trintignant kernel: [112024.084134] [   2942]   126  2942    54563      535   143360      550
 0 postgres 
Mar 13 21:00:36 trintignant kernel: [112024.084139] [   2943]   126  2943    54563     1243   139264      548
 0 postgres 
Mar 13 21:00:36 trintignant kernel: [112024.084145] [   2944]   126  2944    54798      561   147456      545
 0 postgres 
Mar 13 21:00:36 trintignant kernel: [112024.084150] [   2945]   126  2945    54563      215   131072      551
 0 postgres 
Mar 13 21:00:36 trintignant kernel: [112024.084156] [   2956]   126  2956    18718      506   122880      553
 0 postgres 
Mar 13 21:00:36 trintignant kernel: [112024.084161] [   2957]   126  2957    54672      269   139264      546
 0 postgres 

That's less helpful than it could be since all the postgres processes
are just listed as "postgres" without arguments. However, it is very
likely that the first one is actually the postmaster, because it has the
lowest pid (and the other pids follow closely) and it has an OOM score
of -900 as set in the systemd service file.

So I could compare the PID of the killed process with this list (in my
case the killed process wasn't one of them but a test program which just
allocates lots of memory).

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Properly handle OOM death?

From
Israel Brewster
Date:
> On Mar 13, 2023, at 11:42 AM, Joe Conway <mail@joeconway.com> wrote:
>
> On 3/13/23 15:18, Israel Brewster wrote:
>> The syslog specifically says "Memory cgroup out of memory”, if that means
>> something (this is my first exposure to cgroups, if you couldn’t
>> tell).
>
> I am not entirely sure, but without actually testing it I suspect that since memory.max = high (that is, the limit is
whateverthe host has available) the OOM kill is technically a cgroup OOM kill even though it is effectively a host
levelmemory pressure event. 

That would make sense.

>
> Did you try setting "vm.overcommit_memory=2"?

Yeah:

root@novarupta:~# sysctl -w vm.overcommit_memory=2
sysctl: setting key "vm.overcommit_memory", ignoring: Read-only file system

I’m thinking I wound up with a container rather than a full VM after all - and as such, the best solution may be to
migrateto a full VM with some swap space available to avoid the issue in the first place. I’ll have to get in touch
withthe sys admin for that though. 
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>




Re: Properly handle OOM death?

From
Joe Conway
Date:
On 3/13/23 16:18, Israel Brewster wrote:
>> On Mar 13, 2023, at 11:42 AM, Joe Conway <mail@joeconway.com> wrote:
>> I am not entirely sure, but without actually testing it I suspect
>> that since memory.max = high (that is, the limit is whatever the
>> host has available) the OOM kill is technically a cgroup OOM kill
>> even though it is effectively a host level memory pressure event.

Sorry, actually meant "memory.max = max" here


>> Did you try setting "vm.overcommit_memory=2"?

> root@novarupta:~# sysctl -w vm.overcommit_memory=2
> sysctl: setting key "vm.overcommit_memory", ignoring: Read-only file system

> I’m thinking I wound up with a container rather than a full VM after
> all - and as such, the best solution may be to migrate to a full VM
> with some swap space available to avoid the issue in the first place.
> I’ll have to get in touch with the sys admin for that though.

Hmm, well big +1 for having swap turned on, but I recommend setting 
"vm.overcommit_memory=2" even so.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Properly handle OOM death?

From
Israel Brewster
Date:


On Mar 13, 2023, at 12:16 PM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

On 2023-03-13 09:55:50 -0800, Israel Brewster wrote:
On Mar 13, 2023, at 9:43 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
The syslog should contain a list of all tasks prior to the kill. For
example, I just provoked an OOM kill on my laptop and the syslog
contains (among lots of others) these lines:

Mar 13 21:00:36 trintignant kernel: [112024.084117] [   2721]   126  2721    54563     2042   163840      555          -900 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084123] [   2873]   126  2873    18211       85   114688      594             0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084128] [   2941]   126  2941    54592     1231   147456      565             0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084134] [   2942]   126  2942    54563      535   143360      550             0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084139] [   2943]   126  2943    54563     1243   139264      548             0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084145] [   2944]   126  2944    54798      561   147456      545             0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084150] [   2945]   126  2945    54563      215   131072      551             0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084156] [   2956]   126  2956    18718      506   122880      553             0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084161] [   2957]   126  2957    54672      269   139264      546             0 postgres

That's less helpful than it could be since all the postgres processes
are just listed as "postgres" without arguments. However, it is very
likely that the first one is actually the postmaster, because it has the
lowest pid (and the other pids follow closely) and it has an OOM score
of -900 as set in the systemd service file.

So I could compare the PID of the killed process with this list (in my
case the killed process wasn't one of them but a test program which just
allocates lots of memory).

Oh, interesting. I had just greped for ‘Killed process’, so I didn’t see those preceding lines 😛 Looking at that, I see two things:
1) The entries in my syslog all refer to an R process, not a postgresql process at all
2) The ‘Killed process’ entry *does* actually have the process name in it - it’s just since the process name was “R”, I wasn’t making the connection 😄
 

       hp

-- 
  _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Re: Properly handle OOM death?

From
Israel Brewster
Date:
> On Mar 13, 2023, at 12:25 PM, Joe Conway <mail@joeconway.com> wrote:
>
> On 3/13/23 16:18, Israel Brewster wrote:
>>> Did you try setting "vm.overcommit_memory=2"?
>
>> root@novarupta:~# sysctl -w vm.overcommit_memory=2
>> sysctl: setting key "vm.overcommit_memory", ignoring: Read-only file system
>
>> I’m thinking I wound up with a container rather than a full VM after
>> all - and as such, the best solution may be to migrate to a full VM
>> with some swap space available to avoid the issue in the first place.
>> I’ll have to get in touch with the sys admin for that though.
>
> Hmm, well big +1 for having swap turned on, but I recommend setting "vm.overcommit_memory=2" even so.

Makes sense. Presumably with a full VM I won’t get the “Read-only file system” error when trying to do so.

Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>




Re: Properly handle OOM death?

From
Tomas Pospisek
Date:
On 13.03.23 21:25, Joe Conway wrote:

> Hmm, well big +1 for having swap turned on, but I recommend setting 
> "vm.overcommit_memory=2" even so.

I've snipped out the context here, since my advice is very unspecific: 
do use swap only as a safety net. Once your system starts swapping 
performance goes down the toilet.
*t




Re: Properly handle OOM death?

From
Jeffrey Walton
Date:
On Sat, Mar 18, 2023 at 6:02 PM Tomas Pospisek <tpo2@sourcepole.ch> wrote:
>
> On 13.03.23 21:25, Joe Conway wrote:
>
> > Hmm, well big +1 for having swap turned on, but I recommend setting
> > "vm.overcommit_memory=2" even so.
>
> I've snipped out the context here, since my advice is very unspecific:
> do use swap only as a safety net. Once your system starts swapping
> performance goes down the toilet.

To use swap as a safety net, set swappiness to a low value, like 2.
Two will keep most data in RAM and reduce (but not eliminate) spilling
to the file system.

I have a bunch of old ARM dev boards that are resource constrained.
They use SDcards, which have a limited lifetime based on writes. I
give the boards a 1 GB swap file to avoid OOM kills when running the
compiler on C++ programs. And I configure them with a swappiness of 2
to reduce swapping.

Jeff



Re: Properly handle OOM death?

From
Joe Conway
Date:
On 3/18/23 18:02, Tomas Pospisek wrote:
> On 13.03.23 21:25, Joe Conway wrote:
> 
>> Hmm, well big +1 for having swap turned on, but I recommend setting 
>> "vm.overcommit_memory=2" even so.
> 
> I've snipped out the context here, since my advice is very unspecific:
> do use swap only as a safety net. Once your system starts swapping
> performance goes down the toilet.


While I agree with this statement in principle, it is exactly the notion 
that "once your system starts swapping performance goes down the toilet" 
that leads people to conclude that having lots of memory and disabling 
swap will solve all their problems.

Because of how the Linux kernel works, you should, IMHO, always have 
some swap available. For more on why, see:

https://chrisdown.name/2018/01/02/in-defence-of-swap.html

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Properly handle OOM death?

From
Justin Pryzby
Date:
On Mon, Mar 13, 2023 at 06:43:01PM +0100, Peter J. Holzer wrote:
> On 2023-03-13 09:21:18 -0800, Israel Brewster wrote:
> > I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more
> > memory constrained than I would like, such that every week or so the various
> > processes running on the machine will align badly and the OOM killer will kick
> > in, killing off postgresql, as per the following journalctl output:
> > 
> > Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM
killer.
> > Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with result 'oom-kill'.
> > Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d 17h 48min 24.509s CPU time.
> > 
> > And the service is no longer running.
> 
> I might be misreading this, but it looks to me that systemd detects that
> *some* process in the group was killed by the oom killer and stops the
> service.

Yeah.

I found this old message on google.  I'm surprised there aren't more,
similar complaints about this.  It's as Peter said: it (sometimes)
causes systemd to actively *stop* the cluster after OOM, when it
would've come back online on its own if the init (supervisor) process
didn't interfere.

My solution was to set:
/usr/lib/systemd/system/postgresql@.service OOMPolicy=continue

I suggest that the default unit files should do likewise.

-- 
Justin