Re: PGC_SIGHUP shared_buffers? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: PGC_SIGHUP shared_buffers?
Date
Msg-id 20240216190851.wvrjuchidenbg3si@awork3.anarazel.de
Whole thread Raw
In response to PGC_SIGHUP shared_buffers?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: PGC_SIGHUP shared_buffers?
List pgsql-hackers
Hi,

On 2024-02-16 09:58:43 +0530, Robert Haas wrote:
> I remember Magnus making a comment many years ago to the effect that
> every setting that is PGC_POSTMASTER is a bug, but some of those bugs
> are very difficult to fix. Perhaps the use of the word bug is
> arguable, but I think the sentiment is apt, especially with regard to
> shared_buffers. Changing without a server restart would be really
> nice, but it's hard to figure out how to do it. I can think of a few
> basic approaches, and I'd like to know (a) which ones people think are
> good and which ones people think suck (maybe they all suck) and (b) if
> anybody's got any other ideas not mentioned here.

IMO the ability to *shrink* shared_buffers dynamically and cheaply is more
important than growing it in a way, except that they are related of
course. Idling hardware is expensive, thus overcommitting hardware is very
attractive (I count "serverless" as part of that). To be able to overcommit
effectively, unused long-lived memory has to be released. I.e. shared buffers
needs to be shrinkable.



Perhaps worth noting that there are two things limiting the size of shared
buffers: 1) the available buffer space 2) the available buffer *mapping*
space. I think making the buffer mapping resizable is considerably harder than
the buffers themselves. Of course pre-reserving memory for a buffer mapping
suitable for a huge shared_buffers is more feasible than pre-allocating all
that memory for the buffers themselves. But it' still mean youd have a maximum
set at server start.


> 1. Complicate the Buffer->pointer mapping. Right now, BufferGetBlock()
> is basically just BufferBlocks + (buffer - 1) * BLCKSZ, which means
> that we're expecting to find all of the buffers in a single giant
> array. Years ago, somebody proposed changing the implementation to
> essentially WhereIsTheBuffer[buffer], which was heavily criticized on
> performance grounds, because it requires an extra memory access. A
> gentler version of this might be something like
> WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ;
> i.e. instead of allowing every single buffer to be at some random
> address, manage chunks of the buffer pool. This makes the lookup array
> potentially quite a lot smaller, which might mitigate performance
> concerns. For example, if you had one chunk per GB of shared_buffers,
> your mapping array would need only a handful of cache lines, or a few
> handfuls on really big systems.

Such a scheme still leaves you with a dependend memory read for a quite
frequent operation. It could turn out to nto matter hugely if the mapping
array is cache resident, but I don't know if we can realistically bank on
that.

I'm also somewhat concerned about the coarse granularity being problematic. It
seems like it'd lead to a desire to make the granule small, causing slowness.


One big advantage of a scheme like this is that it'd be a step towards a NUMA
aware buffer mapping and replacement. Practically everything beyond the size
of a small consumer device these days has NUMA characteristics, even if not
"officially visible". We could make clock sweeps (or a better victim buffer
selection algorithm) happen within each "chunk", with some additional
infrastructure to choose which of the chunks to search a buffer in. Using a
chunk on the current numa node, except when there is a lot of imbalance
between buffer usage or replacement rate between chunks.



> 2. Make a Buffer just a disguised pointer. Imagine something like
> typedef struct { Page bp; } *buffer. WIth this approach,
> BufferGetBlock() becomes trivial.

You also additionally need something that allows for efficient iteration over
all shared buffers. Making buffer replacement and checkpointing more expensive
isn't great.


> 3. Reserve lots of address space and then only use some of it. I hear
> rumors that some forks of PG have implemented something like this. The
> idea is that you convince the OS to give you a whole bunch of address
> space, but you try to avoid having all of it be backed by physical
> memory. If you later want to increase shared_buffers, you then get the
> OS to back more of it by physical memory, and if you later want to
> decrease shared_buffers, you hopefully have some way of giving the OS
> the memory back. As compared with the previous two approaches, this
> seems less likely to be noticeable to most PG code.

Another advantage is that you can shrink shared buffers fairly granularly and
cheaply with that approach, compared to having to move buffes entirely out of
a larger mapping to be able to unmap it.


> Problems include (1) you have to somehow figure out how much address space
> to reserve, and that forms an upper bound on how big shared_buffers can grow
> at runtime and

Presumably you'd normally not want to reserve more than the physical amount of
memory on the system. Sure, memory can be hot added, but IME that's quite
rare.


> (2) you have to figure out ways to reserve address space and
> back more or less of it with physical memory that will work on all of the
> platforms that we currently support or might want to support in the future.

We also could decide to only implement 2) on platforms with suitable APIs.


A third issue is that it can confuse administrators inspecting the system with
OS tools. "Postgres uses many terabytes of memory on my system!" due to VIRT
being huge etc.


> 4. Give up on actually changing the size of shared_buffer per se, but
> stick some kind of resizable secondary cache in front of it. Data that
> is going to be manipulated gets brought into a (perhaps small?) "real"
> shared_buffers that behaves just like today, but you have some larger
> data structure which is designed to be easier to resize and maybe
> simpler in some other ways that sits between shared_buffers and the OS
> cache. This doesn't seem super-appealing because it requires a lot of
> data copying, but maybe it's worth considering as a last resort.

Yea, that seems quite unappealing. Needing buffer replacement to be able to
pin a buffer would be ... unattractive.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: table inheritance versus column compression and storage settings
Next
From: Magnus Hagander
Date:
Subject: Re: System username in pg_stat_activity