Thread: Getting out ahead of OOM

Getting out ahead of OOM

From
Joseph Hammerman
Date:
Hello admins,

We run Postgres in a Kubernetes environment, and we have not to date been able to convince our Compute team to create a class of Kubernetes hosts that have memory overcommit disabled.

Has anyone had success tracking all the Postgres memory allocation configurables and using that to administratively prevent OOMing?

Alternatively, has anyone has success implementing an extension or periodic process to monitor the memory consumption of the Postgres children and killing them before the OOM event occurs?

If there are adjacent ideas or approaches that I have not considered, please feel free to share those with me as well.

Thanks in advance for any assistance anyone can provide,
Joseph Hammerman

Re: Getting out ahead of OOM

From
Tom Lane
Date:
Joseph Hammerman <joe.hammerman@datadoghq.com> writes:
> We run Postgres in a Kubernetes environment, and we have not to date been
> able to convince our Compute team to create a class of Kubernetes hosts
> that have memory overcommit disabled.

:-(

> Has anyone had success tracking all the Postgres memory allocation
> configurables and using that to administratively prevent OOMing?

I doubt anyone has tried that.  I would look into whether running
the postmaster under a suitable ulimit helps.  I seem to recall
discussions that in Linux, "ulimit -v" works better than the other
likely-looking options.  But that might be stale information.

> Alternatively, has anyone has success implementing an extension or periodic
> process to monitor the memory consumption of the Postgres children and
> killing them before the OOM event occurs?

That's not going to be noticeably nicer than the kernel-induced
OOM, I think.  The one thing it might do for you is ensure that
the kill happens to a child process and not the postmaster; but
you can already use PG_OOM_ADJUST_VALUE and PG_OOM_ADJUST_FILE
to manage that if it's a problem.  (Recent kernels are alleged
to usually do the right thing without that, though.)

            regards, tom lane



Re: Getting out ahead of OOM

From
Rui DeSousa
Date:


On Mar 7, 2025, at 2:07 PM, Joseph Hammerman <joe.hammerman@datadoghq.com> wrote:

Has anyone had success tracking all the Postgres memory allocation configurables and using that to administratively prevent OOMing?

Don’t use memory limits in Kubernetes and we also run Postgres on dedicated Kubernetes clusters.

Shared memory will get counted multiple times.  Each login session; as it maps in the shared buffers, it will get wrongly counted as memory used (it is shared memory!).  

I have instances running on Kubernetes that only use 6GB of memory; however Kubernetes is wrongly reporting 50GB used due to number of active seasons.  Our Postgres pods use to get terminated when “exceeding" the limit but not! Until removed the use of memory limits in Kubernetes.




Re: Getting out ahead of OOM

From
Joe Conway
Date:
On 3/7/25 14:26, Tom Lane wrote:
> Joseph Hammerman <joe.hammerman@datadoghq.com> writes:
>> We run Postgres in a Kubernetes environment, and we have not to date been
>> able to convince our Compute team to create a class of Kubernetes hosts
>> that have memory overcommit disabled.
> 
> :-(
> 
>> Has anyone had success tracking all the Postgres memory allocation
>> configurables and using that to administratively prevent OOMing?
> 
> I doubt anyone has tried that.  I would look into whether running
> the postmaster under a suitable ulimit helps.  I seem to recall
> discussions that in Linux, "ulimit -v" works better than the other
> likely-looking options.  But that might be stale information.

Problem with ulimit is that it is per process, but within a Kubernetes 
pod the memory accounting is for all the pod's processes.

>> Alternatively, has anyone has success implementing an extension or periodic
>> process to monitor the memory consumption of the Postgres children and
>> killing them before the OOM event occurs?
> 
> That's not going to be noticeably nicer than the kernel-induced
> OOM, I think.  The one thing it might do for you is ensure that
> the kill happens to a child process and not the postmaster; but
> you can already use PG_OOM_ADJUST_VALUE and PG_OOM_ADJUST_FILE
> to manage that if it's a problem.  (Recent kernels are alleged
> to usually do the right thing without that, though.)

Actually the problem here is likely that the Kubernetes Postgres pod was 
started with a memory limit. Disabling memory overcommit at the lost 
level will not help you if there is a memory limit set for the pod 
because that in turn sets memory.limit for the cgroup related to the pod 
and the oom killer will strike when memory.usage_in_bytes exceeds that 
value irrespective of the free memory at the host level. In these cases 
the oom_score_adj values don't end up mattering much.

This is a fairly complex topic -- I wrote a blog a few years ago which 
may or may not be out of date at this point:

https://www.crunchydata.com/blog/deep-postgresql-thoughts-the-linux-assassin

Additionally Jeremy Schneider wrote a more recent one that you might 
find helpful:

https://ardentperf.com/2024/09/22/kubernetes-requests-and-limits-for-postgres/

My quick and dirty recommendations:
1. Use cgroup v2 on the host if at all possible
2. Do not under any circumstances disable swap on the host. This is an
    anti-pattern unfortunately followed widely the last time I looked.
3. If nothing else, avoid setting a memory.limit on the cgroup. That
    will at least get you back to not getting whacked unless there is
    host level memory pressure. The blogs discuss how to do that with
    Kube pod settings.

HTH,

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Getting out ahead of OOM

From
Joseph Hammerman
Date:
Joe, Tom, thanks for your detailed and interesting responses. Tom, thank you for all of your contributions to Postgres and to the F/OSS community!

Joe, can you expand on your recommendation to use cgroup-v2? We're trying to collect our complete rationale for our request to our internal team that is tasked with rolling out this configuration change.

Thanks in advance,
Joe

On Sun, Mar 9, 2025 at 1:37 PM Joe Conway <mail@joeconway.com> wrote:
On 3/7/25 14:26, Tom Lane wrote:
> Joseph Hammerman <joe.hammerman@datadoghq.com> writes:
>> We run Postgres in a Kubernetes environment, and we have not to date been
>> able to convince our Compute team to create a class of Kubernetes hosts
>> that have memory overcommit disabled.
>
> :-(
>
>> Has anyone had success tracking all the Postgres memory allocation
>> configurables and using that to administratively prevent OOMing?
>
> I doubt anyone has tried that.  I would look into whether running
> the postmaster under a suitable ulimit helps.  I seem to recall
> discussions that in Linux, "ulimit -v" works better than the other
> likely-looking options.  But that might be stale information.

Problem with ulimit is that it is per process, but within a Kubernetes
pod the memory accounting is for all the pod's processes.

>> Alternatively, has anyone has success implementing an extension or periodic
>> process to monitor the memory consumption of the Postgres children and
>> killing them before the OOM event occurs?
>
> That's not going to be noticeably nicer than the kernel-induced
> OOM, I think.  The one thing it might do for you is ensure that
> the kill happens to a child process and not the postmaster; but
> you can already use PG_OOM_ADJUST_VALUE and PG_OOM_ADJUST_FILE
> to manage that if it's a problem.  (Recent kernels are alleged
> to usually do the right thing without that, though.)

Actually the problem here is likely that the Kubernetes Postgres pod was
started with a memory limit. Disabling memory overcommit at the lost
level will not help you if there is a memory limit set for the pod
because that in turn sets memory.limit for the cgroup related to the pod
and the oom killer will strike when memory.usage_in_bytes exceeds that
value irrespective of the free memory at the host level. In these cases
the oom_score_adj values don't end up mattering much.

This is a fairly complex topic -- I wrote a blog a few years ago which
may or may not be out of date at this point:

https://www.crunchydata.com/blog/deep-postgresql-thoughts-the-linux-assassin

Additionally Jeremy Schneider wrote a more recent one that you might
find helpful:

https://ardentperf.com/2024/09/22/kubernetes-requests-and-limits-for-postgres/

My quick and dirty recommendations:
1. Use cgroup v2 on the host if at all possible
2. Do not under any circumstances disable swap on the host. This is an
    anti-pattern unfortunately followed widely the last time I looked.
3. If nothing else, avoid setting a memory.limit on the cgroup. That
    will at least get you back to not getting whacked unless there is
    host level memory pressure. The blogs discuss how to do that with
    Kube pod settings.

HTH,

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Getting out ahead of OOM

From
Joe Conway
Date:
On 3/12/25 18:21, Joseph Hammerman wrote:
> Joe, can you expand on your recommendation to use cgroup-v2? We're 
> trying to collect our complete rationale for our request to our internal 
> team that is tasked with rolling out this configuration change.

cgroup-v2 has a much better measure of memory pressure (see PSI[1]), 
better ability to reclaim memory pages[4], safer delegation, and other 
advantages. When I last looked the kube support for it was still brand 
new, but it appears to be well supported now [2][3]. In particular, this 
statement from [3] is important:

   Memory QoS uses memory.high to throttle workload approaching its
   memory limit, ensuring that the system is not overwhelmed by
   instantaneous memory allocation.

With cgroup-v1 a kube memory limit would set memory.limit and usage of 
the pod (sum across all processes in the pod cgroup) was tracked with 
memory.usage_in_bytes. Whenever the latter exceeds the former, the OOM 
killer will whack the process in the pod cgroup with the highest 
oom_score, irrespective of how much free memory may be available at the 
host level.

With cgroup-v2 it appears that kube uses memory.high[5], which is more 
of a throttle/soft limit. In cgroup-v2 there is also a new memory.max[6] 
which is essentially the same as what memory.limit was in v1. Exceeding 
memory.max would invoke the OOM killer, but since kubernetes limits the 
pod memory with memory.high, the OOM killer should be avoided.

Note that I cannot claim a bunch of hands on experience with this 
(cgroup-v2 with kubernetes), so please do your own testing and YMMV, etc.

[1] https://docs.kernel.org/accounting/psi.html#psi
[2] https://kubernetes.io/docs/concepts/architecture/cgroups/
[3] 
https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#memory-qos-with-cgroup-v2
[4] 
https://docs.kernel.org/admin-guide/cgroup-v2.html#:~:text=memory.reclaim
[5] https://docs.kernel.org/admin-guide/cgroup-v2.html#:~:text=memory.high
[6] https://docs.kernel.org/admin-guide/cgroup-v2.html#:~:text=memory.max


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com