Home > mailing lists

Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet - Mailing list pgsql-performance

From	Frits Hoogland
Subject	Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet
Date	August 8 11:21:40
Msg-id	1A770E71-8F3C-4D92-816A-44C63AC1AFA7@gmail.com Whole thread Raw
In response to	Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet (Joe Conway <mail@joeconway.com>)
List	pgsql-performance

Tree view

Joe, I am trying to help, and make people think about things correctly.

The linux kernel is actually constantly changing, sometimes subtle and sometimes less subtle, and there is a general lack of very clear statistics indicating the more nuanced memory operations, and the documentation about it.

And: there are a lot of myths about memory management, which either are myths because it's a situation that was once true but given the changes of the kernel code is not true anymore, but also sometimes just a myth.

The best technical description of recent memory management that I could find is: https://lpc.events/event/11/contributions/896/attachments/793/1493/slides-r2.pdf

Op 6 aug 2025 om 18:33 heeft Joe Conway <mail@joeconway.com> het volgende geschreven:

* Swap is what is used when anonymous memory must be reclaimed to
allow for an allocation of anonymous memory.

Correct. Swapped out pages are anonymous memory pages exclusively.

It's the result of memory reclaim for anonymous pages, which cannot be discarded like (non dirty and non-pinned) file pages, which don't need saving the page content.

* The Linux kernel will aggressively use all available memory for
file buffers, pushing usage against the limits.

It's an explicit design of the linux kernel to not reclaim file pages when they are unpinned/not used anymore, leaving them as a cached page.

(anonymous pages are freed explciitly when released by the ower and put on the free list)

There is no aggresive push, file pages are left after use, so there is no pushing usage against the limits.

It's the swapper ('page daemon') that eventually (based on a zone limit called 'memory low', which is vm.min_free_kbytes *2), based on LRU, frees file pages, and when free memory gets to vm.min_free_kbytes*1 (called 'pages min') forces tasks to free memory theirselves (called 'direct reclaim').

* Especially in the older 4 series kernels, file buffers often
cannot be reclaimed fast enough

I am not sure what is described here, and whether this is about the swapper or direct reclaim.

There is no need to do this 'fast enough', see the above slide deck.

This probably is aimed at the swapper not reclaiming 'fast enough', however, that is not how this works: if memory requests makes free memory go to 'pages min', a task will perform 'direct reclaim'.

* With no swap and a large-ish anonymous memory request, it is
easy to push over the limit to cause the OOM killer to strike.

I am afraid that this is not a correct representation of the actual mechanism, again: look at the slide deck and explanations above.

The swapper frees memory, which is used by a task requesting pages at page fault, for which it doesn't matter if that is anonymous memory or file memory..

If memory gets down to pages min, the swapper did not reclaim memory fast enough, and a task will perform direct reclaim.

The decision on what memory type to reclaim in case of direct reclaim is file memory or anonymous memory.

If there is no swap, the option to use anonymous memory is not available, because anonymous pages cannot be discarded like non-dirty, unpinned file pages can, they have to be preserved.

If swappiness is set to 0, but swap is available, some documentation suggests it will never use anonymous memory, however I found this not to be true, linux might still choose anonymous memory to reclaim. Obviously, the lower swappiness, the lesser reclaim will choose anonymous memory pages.

What you seem to suggest, is that with no swap, and thus the option to use anonymous pages for reclaim the reclaim mechanism is dependent on the speed of (file) reclaim, possibly from the swapper. I hope it's clear this is not true.

Obviously, when there is swap, the total amount of pages that become potentially available for reclaim becomes higher, because the size of swap anonymous pages can be reclaimed.

But then if that amount is set to a low amount (as suggested: You don't need a huge amount'), the actual increase in pages availability for reclaim is negligible, and thus the benefit that it provides for not running out of memory.

* On the other hand, with swap enabled anon memory can be
reclaimed giving the kernel more time to deal with file buffer
reclamation.

See the explanation with the previous comments. Time is not a component in reclaim for failure to find pages for a task that page faults for memory addition, because a task will do direct reclaim if it exhausts free memory provided by the swapper.

At least that is what I have observed.

The kernel code for direct reclaim shows that when direct reclaim has finished scanning memory pages (either only file pages with no swap, of in the case of having swap, the file and anonymous pages), and wasn't able to satisfy the request for the pages it needs, it will trigger the kernel Out of memory thread, because it has run out of available pages it needs.

Again, like I mentioned in the beginning, there are lots and lots of nuances and mechanisms in play, this is a reasonable basic explanation of the mechanism based on the above slide deck and reading the kernel code.

One thing that can very easily be misleading is that memory is not a general, system wide, pool, but instead separated by zones. This might lead to situation where there still is memory available for reclaim system wide, but not in the zone the process is scanning, and thus might seem to run out of memory triggering the OOM killer when there still is memory, which can be very confusing if you're not aware of these details.

I do have read, experimented, searched, tested and diagnosed a lot of issues. And this is what have come up with, which does fit the kernel code, and documentation that I trust.

Based on these mechanisms, and especially for database systems, removing swap is a way to take away a mechanism that has no benefit for database systems on modern, high memory, systems.

That does not mean it's not beneficial in other cases. If memory usage is very dynamic, memory is more constrained, and the operation is less latency sensitive, it might be a good idea to have an overflow, with all the downsides that it brings.

Frits Hoogland

On 7 Aug 2025, at 03:12, Joe Conway <mail@joeconway.com> wrote:

On 8/6/25 17:14, Frits Hoogland wrote:
As I said, do not disable swap. You don't need a huge amount, but
maybe 16 GB or so would do it.

Joe, please, can you state a technical reason for saying this?
All you are saying is ‘don’t do this’.
I’ve stated my reasons for why this doesn’t make sense, and you don’t give any reason.

What do you call the below?

Op 6 aug 2025 om 18:33 heeft Joe Conway <mail@joeconway.com> het volgende geschreven:

* Swap is what is used when anonymous memory must be reclaimed to
allow for an allocation of anonymous memory.
* The Linux kernel will aggressively use all available memory for
file buffers, pushing usage against the limits.
* Especially in the older 4 series kernels, file buffers often
cannot be reclaimed fast enough
* With no swap and a large-ish anonymous memory request, it is
easy to push over the limit to cause the OOM killer to strike.
* On the other hand, with swap enabled anon memory can be
reclaimed giving the kernel more time to deal with file buffer
reclamation.
At least that is what I have observed.

If you don't think that is adequate technical reason, feel free to ignore my advice.

--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com

pgsql-performance by date:

From: Joe Conway
Date: 07 August, 04:12:41
Subject: Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet - Mailing list pgsql-performance

Previous