Thread: Getting out ahead of OOM
Hello admins,
We run Postgres in a Kubernetes environment, and we have not to date been able to convince our Compute team to create a class of Kubernetes hosts that have memory overcommit disabled.
Has anyone had success tracking all the Postgres memory allocation configurables and using that to administratively prevent OOMing?
Alternatively, has anyone has success implementing an extension or periodic process to monitor the memory consumption of the Postgres children and killing them before the OOM event occurs?
If there are adjacent ideas or approaches that I have not considered, please feel free to share those with me as well.
Thanks in advance for any assistance anyone can provide,
Joseph Hammerman
Joseph Hammerman <joe.hammerman@datadoghq.com> writes: > We run Postgres in a Kubernetes environment, and we have not to date been > able to convince our Compute team to create a class of Kubernetes hosts > that have memory overcommit disabled. :-( > Has anyone had success tracking all the Postgres memory allocation > configurables and using that to administratively prevent OOMing? I doubt anyone has tried that. I would look into whether running the postmaster under a suitable ulimit helps. I seem to recall discussions that in Linux, "ulimit -v" works better than the other likely-looking options. But that might be stale information. > Alternatively, has anyone has success implementing an extension or periodic > process to monitor the memory consumption of the Postgres children and > killing them before the OOM event occurs? That's not going to be noticeably nicer than the kernel-induced OOM, I think. The one thing it might do for you is ensure that the kill happens to a child process and not the postmaster; but you can already use PG_OOM_ADJUST_VALUE and PG_OOM_ADJUST_FILE to manage that if it's a problem. (Recent kernels are alleged to usually do the right thing without that, though.) regards, tom lane
On Mar 7, 2025, at 2:07 PM, Joseph Hammerman <joe.hammerman@datadoghq.com> wrote:Has anyone had success tracking all the Postgres memory allocation configurables and using that to administratively prevent OOMing?
Don’t use memory limits in Kubernetes and we also run Postgres on dedicated Kubernetes clusters.
Shared memory will get counted multiple times. Each login session; as it maps in the shared buffers, it will get wrongly counted as memory used (it is shared memory!).
I have instances running on Kubernetes that only use 6GB of memory; however Kubernetes is wrongly reporting 50GB used due to number of active seasons. Our Postgres pods use to get terminated when “exceeding" the limit but not! Until removed the use of memory limits in Kubernetes.
On 3/7/25 14:26, Tom Lane wrote: > Joseph Hammerman <joe.hammerman@datadoghq.com> writes: >> We run Postgres in a Kubernetes environment, and we have not to date been >> able to convince our Compute team to create a class of Kubernetes hosts >> that have memory overcommit disabled. > > :-( > >> Has anyone had success tracking all the Postgres memory allocation >> configurables and using that to administratively prevent OOMing? > > I doubt anyone has tried that. I would look into whether running > the postmaster under a suitable ulimit helps. I seem to recall > discussions that in Linux, "ulimit -v" works better than the other > likely-looking options. But that might be stale information. Problem with ulimit is that it is per process, but within a Kubernetes pod the memory accounting is for all the pod's processes. >> Alternatively, has anyone has success implementing an extension or periodic >> process to monitor the memory consumption of the Postgres children and >> killing them before the OOM event occurs? > > That's not going to be noticeably nicer than the kernel-induced > OOM, I think. The one thing it might do for you is ensure that > the kill happens to a child process and not the postmaster; but > you can already use PG_OOM_ADJUST_VALUE and PG_OOM_ADJUST_FILE > to manage that if it's a problem. (Recent kernels are alleged > to usually do the right thing without that, though.) Actually the problem here is likely that the Kubernetes Postgres pod was started with a memory limit. Disabling memory overcommit at the lost level will not help you if there is a memory limit set for the pod because that in turn sets memory.limit for the cgroup related to the pod and the oom killer will strike when memory.usage_in_bytes exceeds that value irrespective of the free memory at the host level. In these cases the oom_score_adj values don't end up mattering much. This is a fairly complex topic -- I wrote a blog a few years ago which may or may not be out of date at this point: https://www.crunchydata.com/blog/deep-postgresql-thoughts-the-linux-assassin Additionally Jeremy Schneider wrote a more recent one that you might find helpful: https://ardentperf.com/2024/09/22/kubernetes-requests-and-limits-for-postgres/ My quick and dirty recommendations: 1. Use cgroup v2 on the host if at all possible 2. Do not under any circumstances disable swap on the host. This is an anti-pattern unfortunately followed widely the last time I looked. 3. If nothing else, avoid setting a memory.limit on the cgroup. That will at least get you back to not getting whacked unless there is host level memory pressure. The blogs discuss how to do that with Kube pod settings. HTH, -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Joe, Tom, thanks for your detailed and interesting responses. Tom, thank you for all of your contributions to Postgres and to the F/OSS community!
Joe, can you expand on your recommendation to use cgroup-v2? We're trying to collect our complete rationale for our request to our internal team that is tasked with rolling out this configuration change.
Thanks in advance,
Thanks in advance,
Joe
On Sun, Mar 9, 2025 at 1:37 PM Joe Conway <mail@joeconway.com> wrote:
On 3/7/25 14:26, Tom Lane wrote:
> Joseph Hammerman <joe.hammerman@datadoghq.com> writes:
>> We run Postgres in a Kubernetes environment, and we have not to date been
>> able to convince our Compute team to create a class of Kubernetes hosts
>> that have memory overcommit disabled.
>
> :-(
>
>> Has anyone had success tracking all the Postgres memory allocation
>> configurables and using that to administratively prevent OOMing?
>
> I doubt anyone has tried that. I would look into whether running
> the postmaster under a suitable ulimit helps. I seem to recall
> discussions that in Linux, "ulimit -v" works better than the other
> likely-looking options. But that might be stale information.
Problem with ulimit is that it is per process, but within a Kubernetes
pod the memory accounting is for all the pod's processes.
>> Alternatively, has anyone has success implementing an extension or periodic
>> process to monitor the memory consumption of the Postgres children and
>> killing them before the OOM event occurs?
>
> That's not going to be noticeably nicer than the kernel-induced
> OOM, I think. The one thing it might do for you is ensure that
> the kill happens to a child process and not the postmaster; but
> you can already use PG_OOM_ADJUST_VALUE and PG_OOM_ADJUST_FILE
> to manage that if it's a problem. (Recent kernels are alleged
> to usually do the right thing without that, though.)
Actually the problem here is likely that the Kubernetes Postgres pod was
started with a memory limit. Disabling memory overcommit at the lost
level will not help you if there is a memory limit set for the pod
because that in turn sets memory.limit for the cgroup related to the pod
and the oom killer will strike when memory.usage_in_bytes exceeds that
value irrespective of the free memory at the host level. In these cases
the oom_score_adj values don't end up mattering much.
This is a fairly complex topic -- I wrote a blog a few years ago which
may or may not be out of date at this point:
https://www.crunchydata.com/blog/deep-postgresql-thoughts-the-linux-assassin
Additionally Jeremy Schneider wrote a more recent one that you might
find helpful:
https://ardentperf.com/2024/09/22/kubernetes-requests-and-limits-for-postgres/
My quick and dirty recommendations:
1. Use cgroup v2 on the host if at all possible
2. Do not under any circumstances disable swap on the host. This is an
anti-pattern unfortunately followed widely the last time I looked.
3. If nothing else, avoid setting a memory.limit on the cgroup. That
will at least get you back to not getting whacked unless there is
host level memory pressure. The blogs discuss how to do that with
Kube pod settings.
HTH,
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 3/12/25 18:21, Joseph Hammerman wrote: > Joe, can you expand on your recommendation to use cgroup-v2? We're > trying to collect our complete rationale for our request to our internal > team that is tasked with rolling out this configuration change. cgroup-v2 has a much better measure of memory pressure (see PSI[1]), better ability to reclaim memory pages[4], safer delegation, and other advantages. When I last looked the kube support for it was still brand new, but it appears to be well supported now [2][3]. In particular, this statement from [3] is important: Memory QoS uses memory.high to throttle workload approaching its memory limit, ensuring that the system is not overwhelmed by instantaneous memory allocation. With cgroup-v1 a kube memory limit would set memory.limit and usage of the pod (sum across all processes in the pod cgroup) was tracked with memory.usage_in_bytes. Whenever the latter exceeds the former, the OOM killer will whack the process in the pod cgroup with the highest oom_score, irrespective of how much free memory may be available at the host level. With cgroup-v2 it appears that kube uses memory.high[5], which is more of a throttle/soft limit. In cgroup-v2 there is also a new memory.max[6] which is essentially the same as what memory.limit was in v1. Exceeding memory.max would invoke the OOM killer, but since kubernetes limits the pod memory with memory.high, the OOM killer should be avoided. Note that I cannot claim a bunch of hands on experience with this (cgroup-v2 with kubernetes), so please do your own testing and YMMV, etc. [1] https://docs.kernel.org/accounting/psi.html#psi [2] https://kubernetes.io/docs/concepts/architecture/cgroups/ [3] https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#memory-qos-with-cgroup-v2 [4] https://docs.kernel.org/admin-guide/cgroup-v2.html#:~:text=memory.reclaim [5] https://docs.kernel.org/admin-guide/cgroup-v2.html#:~:text=memory.high [6] https://docs.kernel.org/admin-guide/cgroup-v2.html#:~:text=memory.max -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com