Thread: PGC_SIGHUP shared_buffers?

PGC_SIGHUP shared_buffers?

From

Robert Haas

Date:

16 February 2024, 04:28:43

Hi,

I remember Magnus making a comment many years ago to the effect that
every setting that is PGC_POSTMASTER is a bug, but some of those bugs
are very difficult to fix. Perhaps the use of the word bug is
arguable, but I think the sentiment is apt, especially with regard to
shared_buffers. Changing without a server restart would be really
nice, but it's hard to figure out how to do it. I can think of a few
basic approaches, and I'd like to know (a) which ones people think are
good and which ones people think suck (maybe they all suck) and (b) if
anybody's got any other ideas not mentioned here.

1. Complicate the Buffer->pointer mapping. Right now, BufferGetBlock()
is basically just BufferBlocks + (buffer - 1) * BLCKSZ, which means
that we're expecting to find all of the buffers in a single giant
array. Years ago, somebody proposed changing the implementation to
essentially WhereIsTheBuffer[buffer], which was heavily criticized on
performance grounds, because it requires an extra memory access. A
gentler version of this might be something like
WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ;
i.e. instead of allowing every single buffer to be at some random
address, manage chunks of the buffer pool. This makes the lookup array
potentially quite a lot smaller, which might mitigate performance
concerns. For example, if you had one chunk per GB of shared_buffers,
your mapping array would need only a handful of cache lines, or a few
handfuls on really big systems.

(I am here ignoring the difficulties of how to orchestrate addition of
or removal of chunks as a SMOP[1]. Feel free to criticize that
hand-waving, but as of this writing, I feel like moderate
determination would suffice.)

2. Make a Buffer just a disguised pointer. Imagine something like
typedef struct { Page bp; } *buffer. WIth this approach,
BufferGetBlock() becomes trivial. The tricky part with this approach
is that you still need a cheap way of finding the buffer header. What
I imagine might work here is to again have some kind of chunked
representation of shared_buffers, where each chunk contains a bunch of
buffer headers at, say, the beginning, followed by a bunch of buffers.
Theoretically, if the chunks are sufficiently strong-aligned, you can
figure out what offset you're at within the chunk without any
additional information and the whole process of locating the buffer
header is just math, with no memory access. But in practice, getting
the chunks to be sufficiently strongly aligned sounds hard, and this
also makes a Buffer 64 bits rather than the current 32. A variant on
this concept might be to make the Buffer even wider and include two
pointers in it i.e. typedef struct { Page bp; BufferDesc *bd; }
Buffer.

3. Reserve lots of address space and then only use some of it. I hear
rumors that some forks of PG have implemented something like this. The
idea is that you convince the OS to give you a whole bunch of address
space, but you try to avoid having all of it be backed by physical
memory. If you later want to increase shared_buffers, you then get the
OS to back more of it by physical memory, and if you later want to
decrease shared_buffers, you hopefully have some way of giving the OS
the memory back. As compared with the previous two approaches, this
seems less likely to be noticeable to most PG code. Problems include
(1) you have to somehow figure out how much address space to reserve,
and that forms an upper bound on how big shared_buffers can grow at
runtime and (2) you have to figure out ways to reserve address space
and back more or less of it with physical memory that will work on all
of the platforms that we currently support or might want to support in
the future.

4. Give up on actually changing the size of shared_buffer per se, but
stick some kind of resizable secondary cache in front of it. Data that
is going to be manipulated gets brought into a (perhaps small?) "real"
shared_buffers that behaves just like today, but you have some larger
data structure which is designed to be easier to resize and maybe
simpler in some other ways that sits between shared_buffers and the OS
cache. This doesn't seem super-appealing because it requires a lot of
data copying, but maybe it's worth considering as a last resort.

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com

[1] https://en.wikipedia.org/wiki/Small_matter_of_programming

Re: PGC_SIGHUP shared_buffers?

From

Andres Freund

Date:

16 February 2024, 19:08:51

Hi,

On 2024-02-16 09:58:43 +0530, Robert Haas wrote:
> I remember Magnus making a comment many years ago to the effect that
> every setting that is PGC_POSTMASTER is a bug, but some of those bugs
> are very difficult to fix. Perhaps the use of the word bug is
> arguable, but I think the sentiment is apt, especially with regard to
> shared_buffers. Changing without a server restart would be really
> nice, but it's hard to figure out how to do it. I can think of a few
> basic approaches, and I'd like to know (a) which ones people think are
> good and which ones people think suck (maybe they all suck) and (b) if
> anybody's got any other ideas not mentioned here.

IMO the ability to *shrink* shared_buffers dynamically and cheaply is more
important than growing it in a way, except that they are related of
course. Idling hardware is expensive, thus overcommitting hardware is very
attractive (I count "serverless" as part of that). To be able to overcommit
effectively, unused long-lived memory has to be released. I.e. shared buffers
needs to be shrinkable.

Perhaps worth noting that there are two things limiting the size of shared
buffers: 1) the available buffer space 2) the available buffer *mapping*
space. I think making the buffer mapping resizable is considerably harder than
the buffers themselves. Of course pre-reserving memory for a buffer mapping
suitable for a huge shared_buffers is more feasible than pre-allocating all
that memory for the buffers themselves. But it' still mean youd have a maximum
set at server start.

> 1. Complicate the Buffer->pointer mapping. Right now, BufferGetBlock()
> is basically just BufferBlocks + (buffer - 1) * BLCKSZ, which means
> that we're expecting to find all of the buffers in a single giant
> array. Years ago, somebody proposed changing the implementation to
> essentially WhereIsTheBuffer[buffer], which was heavily criticized on
> performance grounds, because it requires an extra memory access. A
> gentler version of this might be something like
> WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ;
> i.e. instead of allowing every single buffer to be at some random
> address, manage chunks of the buffer pool. This makes the lookup array
> potentially quite a lot smaller, which might mitigate performance
> concerns. For example, if you had one chunk per GB of shared_buffers,
> your mapping array would need only a handful of cache lines, or a few
> handfuls on really big systems.

Such a scheme still leaves you with a dependend memory read for a quite
frequent operation. It could turn out to nto matter hugely if the mapping
array is cache resident, but I don't know if we can realistically bank on
that.

I'm also somewhat concerned about the coarse granularity being problematic. It
seems like it'd lead to a desire to make the granule small, causing slowness.

One big advantage of a scheme like this is that it'd be a step towards a NUMA
aware buffer mapping and replacement. Practically everything beyond the size
of a small consumer device these days has NUMA characteristics, even if not
"officially visible". We could make clock sweeps (or a better victim buffer
selection algorithm) happen within each "chunk", with some additional
infrastructure to choose which of the chunks to search a buffer in. Using a
chunk on the current numa node, except when there is a lot of imbalance
between buffer usage or replacement rate between chunks.

> 2. Make a Buffer just a disguised pointer. Imagine something like
> typedef struct { Page bp; } *buffer. WIth this approach,
> BufferGetBlock() becomes trivial.

You also additionally need something that allows for efficient iteration over
all shared buffers. Making buffer replacement and checkpointing more expensive
isn't great.

> 3. Reserve lots of address space and then only use some of it. I hear
> rumors that some forks of PG have implemented something like this. The
> idea is that you convince the OS to give you a whole bunch of address
> space, but you try to avoid having all of it be backed by physical
> memory. If you later want to increase shared_buffers, you then get the
> OS to back more of it by physical memory, and if you later want to
> decrease shared_buffers, you hopefully have some way of giving the OS
> the memory back. As compared with the previous two approaches, this
> seems less likely to be noticeable to most PG code.

Another advantage is that you can shrink shared buffers fairly granularly and
cheaply with that approach, compared to having to move buffes entirely out of
a larger mapping to be able to unmap it.

> Problems include (1) you have to somehow figure out how much address space
> to reserve, and that forms an upper bound on how big shared_buffers can grow
> at runtime and

Presumably you'd normally not want to reserve more than the physical amount of
memory on the system. Sure, memory can be hot added, but IME that's quite
rare.

> (2) you have to figure out ways to reserve address space and
> back more or less of it with physical memory that will work on all of the
> platforms that we currently support or might want to support in the future.

We also could decide to only implement 2) on platforms with suitable APIs.

A third issue is that it can confuse administrators inspecting the system with
OS tools. "Postgres uses many terabytes of memory on my system!" due to VIRT
being huge etc.

> 4. Give up on actually changing the size of shared_buffer per se, but
> stick some kind of resizable secondary cache in front of it. Data that
> is going to be manipulated gets brought into a (perhaps small?) "real"
> shared_buffers that behaves just like today, but you have some larger
> data structure which is designed to be easier to resize and maybe
> simpler in some other ways that sits between shared_buffers and the OS
> cache. This doesn't seem super-appealing because it requires a lot of
> data copying, but maybe it's worth considering as a last resort.

Yea, that seems quite unappealing. Needing buffer replacement to be able to
pin a buffer would be ... unattractive.

Greetings,

Andres Freund

Re: PGC_SIGHUP shared_buffers?

From

Heikki Linnakangas

Date:

16 February 2024, 20:24:21

On 16/02/2024 06:28, Robert Haas wrote:
> 3. Reserve lots of address space and then only use some of it. I hear
> rumors that some forks of PG have implemented something like this. The
> idea is that you convince the OS to give you a whole bunch of address
> space, but you try to avoid having all of it be backed by physical
> memory. If you later want to increase shared_buffers, you then get the
> OS to back more of it by physical memory, and if you later want to
> decrease shared_buffers, you hopefully have some way of giving the OS
> the memory back. As compared with the previous two approaches, this
> seems less likely to be noticeable to most PG code. Problems include
> (1) you have to somehow figure out how much address space to reserve,
> and that forms an upper bound on how big shared_buffers can grow at
> runtime and (2) you have to figure out ways to reserve address space
> and back more or less of it with physical memory that will work on all
> of the platforms that we currently support or might want to support in
> the future.

A variant of this approach:

5. Re-map the shared_buffers when needed.

Between transactions, a backend should not hold any buffer pins. When 
there are no pins, you can munmap() the shared_buffers and mmap() it at 
a different address.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: PGC_SIGHUP shared_buffers?

From

Thomas Munro

Date:

16 February 2024, 20:37:46

On Fri, Feb 16, 2024 at 5:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
> 3. Reserve lots of address space and then only use some of it. I hear
> rumors that some forks of PG have implemented something like this. The
> idea is that you convince the OS to give you a whole bunch of address
> space, but you try to avoid having all of it be backed by physical
> memory. If you later want to increase shared_buffers, you then get the
> OS to back more of it by physical memory, and if you later want to
> decrease shared_buffers, you hopefully have some way of giving the OS
> the memory back. As compared with the previous two approaches, this
> seems less likely to be noticeable to most PG code. Problems include
> (1) you have to somehow figure out how much address space to reserve,
> and that forms an upper bound on how big shared_buffers can grow at
> runtime and (2) you have to figure out ways to reserve address space
> and back more or less of it with physical memory that will work on all
> of the platforms that we currently support or might want to support in
> the future.

FTR I'm aware of a working experimental prototype along these lines,
that will be presented in Vancouver:


https://www.pgevents.ca/events/pgconfdev2024/sessions/session/31-enhancing-postgresql-plasticity-new-frontiers-in-memory-management/

Re: PGC_SIGHUP shared_buffers?

From

Matthias van de Meent

Date:

17 February 2024, 22:40:51

On Fri, 16 Feb 2024 at 21:24, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 16/02/2024 06:28, Robert Haas wrote:
> > 3. Reserve lots of address space and then only use some of it. I hear
> > rumors that some forks of PG have implemented something like this. The
> > idea is that you convince the OS to give you a whole bunch of address
> > space, but you try to avoid having all of it be backed by physical
> > memory. If you later want to increase shared_buffers, you then get the
> > OS to back more of it by physical memory, and if you later want to
> > decrease shared_buffers, you hopefully have some way of giving the OS
> > the memory back. As compared with the previous two approaches, this
> > seems less likely to be noticeable to most PG code. Problems include
> > (1) you have to somehow figure out how much address space to reserve,
> > and that forms an upper bound on how big shared_buffers can grow at
> > runtime and (2) you have to figure out ways to reserve address space
> > and back more or less of it with physical memory that will work on all
> > of the platforms that we currently support or might want to support in
> > the future.
>
> A variant of this approach:
>
> 5. Re-map the shared_buffers when needed.
>
> Between transactions, a backend should not hold any buffer pins. When
> there are no pins, you can munmap() the shared_buffers and mmap() it at
> a different address.

This can quite realistically fail to find an unused memory region of
sufficient size when the heap is sufficiently fragmented, e.g. through
ASLR, which would make it difficult to use this dynamic
single-allocation shared_buffers in security-hardened environments.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: PGC_SIGHUP shared_buffers?

From

Andres Freund

Date:

18 February 2024, 01:03:13

Hi,

On 2024-02-17 23:40:51 +0100, Matthias van de Meent wrote:
> > 5. Re-map the shared_buffers when needed.
> >
> > Between transactions, a backend should not hold any buffer pins. When
> > there are no pins, you can munmap() the shared_buffers and mmap() it at
> > a different address.

I hadn't quite realized that we don't seem to rely on shared_buffers having a
specific address across processes. That does seem to make it a more viable to
remap mappings in backends.

However, I don't think this works with mmap(MAP_ANONYMOUS) - as long as we are
using the process model. To my knowledge there is no way to get the same
mapping in multiple already existing processes. Even mmap()ing /dev/zero after
sharing file descriptors across processes doesn't work, if I recall correctly.

We would have to use sysv/posix shared memory or such (or mmap() if files in
tmpfs) for the shared buffers allocation.

> This can quite realistically fail to find an unused memory region of
> sufficient size when the heap is sufficiently fragmented, e.g. through
> ASLR, which would make it difficult to use this dynamic
> single-allocation shared_buffers in security-hardened environments.

I haven't seen anywhere close to this bad fragmentation on 64bit machines so
far - have you?

Most implementations of ASLR randomize mmap locations across multiple runs of
the same binary, not within the same binary. There are out-of-tree linux
patches that make mmap() randomize every single allocation, but I am not sure
that we ought to care about such things.

Even if we were to care, on 64bit platforms it doesn't seem likely that we'd
run out of space that quickly.  AMD64 had 48bits of virtual address space from
the start, and on recent CPUs that has grown to 57bits [1], that's a lot of
space.

And if you do run out of VM space, wouldn't that also affect lots of other
things, like mmap() for malloc?

Greetings,

Andres Freund

[1] https://en.wikipedia.org/wiki/Intel_5-level_paging

Re: PGC_SIGHUP shared_buffers?

From

Robert Haas

Date:

18 February 2024, 11:36:09

On Sat, Feb 17, 2024 at 12:38 AM Andres Freund <andres@anarazel.de> wrote:
> IMO the ability to *shrink* shared_buffers dynamically and cheaply is more
> important than growing it in a way, except that they are related of
> course. Idling hardware is expensive, thus overcommitting hardware is very
> attractive (I count "serverless" as part of that). To be able to overcommit
> effectively, unused long-lived memory has to be released. I.e. shared buffers
> needs to be shrinkable.

I see your point, but people want to scale up, too. Of course, those
people will have to live with what we can practically implement.

> Perhaps worth noting that there are two things limiting the size of shared
> buffers: 1) the available buffer space 2) the available buffer *mapping*
> space. I think making the buffer mapping resizable is considerably harder than
> the buffers themselves. Of course pre-reserving memory for a buffer mapping
> suitable for a huge shared_buffers is more feasible than pre-allocating all
> that memory for the buffers themselves. But it' still mean youd have a maximum
> set at server start.

We size the fsync queue based on shared_buffers too. That's a lot less
important, though, and could be worked around in other ways.

> Such a scheme still leaves you with a dependend memory read for a quite
> frequent operation. It could turn out to nto matter hugely if the mapping
> array is cache resident, but I don't know if we can realistically bank on
> that.

I don't know, either. I was hoping you did. :-)

But we can rig up a test pretty easily, I think. We can just create a
fake mapping that gives the same answers as the current calculation
and then beat on it. Of course, if testing shows no difference, there
is the small problem of knowing whether the test scenario was right;
and it's also possible that an initial impact could be mitigated by
removing some gratuitously repeated buffer # -> buffer address
mappings. Still, I think it could provide us with a useful baseline.
I'll throw something together when I have time, unless someone beats
me to it.

> I'm also somewhat concerned about the coarse granularity being problematic. It
> seems like it'd lead to a desire to make the granule small, causing slowness.

How many people set shared_buffers to something that's not a whole
number of GB these days? I mean I bet it happens, but in practice if
you rounded to the nearest GB, or even the nearest 2GB, I bet almost
nobody would really care. I think it's fine to be opinionated here and
hold the line at a relatively large granule, even though in theory
people could want something else.

Alternatively, maybe there could be a provision for the last granule
to be partial, and if you extend further, you throw away the partial
granule and replace it with a whole one. But I'm not even sure that's
worth doing.

> One big advantage of a scheme like this is that it'd be a step towards a NUMA
> aware buffer mapping and replacement. Practically everything beyond the size
> of a small consumer device these days has NUMA characteristics, even if not
> "officially visible". We could make clock sweeps (or a better victim buffer
> selection algorithm) happen within each "chunk", with some additional
> infrastructure to choose which of the chunks to search a buffer in. Using a
> chunk on the current numa node, except when there is a lot of imbalance
> between buffer usage or replacement rate between chunks.

I also wondered whether this might be a useful step toward allowing
different-sized buffers in the same buffer pool (ducks, runs away
quickly). I don't have any particular use for that myself, but it's a
thing some people probably want for some reason or other.

> > 2. Make a Buffer just a disguised pointer. Imagine something like
> > typedef struct { Page bp; } *buffer. WIth this approach,
> > BufferGetBlock() becomes trivial.
>
> You also additionally need something that allows for efficient iteration over
> all shared buffers. Making buffer replacement and checkpointing more expensive
> isn't great.

True, but I don't really see what the problem with this would be in
this approach.

> > 3. Reserve lots of address space and then only use some of it. I hear
> > rumors that some forks of PG have implemented something like this. The
> > idea is that you convince the OS to give you a whole bunch of address
> > space, but you try to avoid having all of it be backed by physical
> > memory. If you later want to increase shared_buffers, you then get the
> > OS to back more of it by physical memory, and if you later want to
> > decrease shared_buffers, you hopefully have some way of giving the OS
> > the memory back. As compared with the previous two approaches, this
> > seems less likely to be noticeable to most PG code.
>
> Another advantage is that you can shrink shared buffers fairly granularly and
> cheaply with that approach, compared to having to move buffes entirely out of
> a larger mapping to be able to unmap it.

Don't you have to still move buffers entirely out of the region you
want to unmap?

> > Problems include (1) you have to somehow figure out how much address space
> > to reserve, and that forms an upper bound on how big shared_buffers can grow
> > at runtime and
>
> Presumably you'd normally not want to reserve more than the physical amount of
> memory on the system. Sure, memory can be hot added, but IME that's quite
> rare.

I would think that might not be so rare in a virtualized environment,
which would seem to be one of the most important use cases for this
kind of thing.

Plus, this would mean we'd need to auto-detect system RAM. I'd rather
not go there, and just fix the upper limit via a GUC.

> > (2) you have to figure out ways to reserve address space and
> > back more or less of it with physical memory that will work on all of the
> > platforms that we currently support or might want to support in the future.
>
> We also could decide to only implement 2) on platforms with suitable APIs.

Yep, fair.

> A third issue is that it can confuse administrators inspecting the system with
> OS tools. "Postgres uses many terabytes of memory on my system!" due to VIRT
> being huge etc.

Mmph. That's disagreeable but probably not a reason to entirely
abandon any particular approach.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: PGC_SIGHUP shared_buffers?

From

Robert Haas

Date:

18 February 2024, 11:53:43

On Sat, Feb 17, 2024 at 1:54 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> A variant of this approach:
>
> 5. Re-map the shared_buffers when needed.
>
> Between transactions, a backend should not hold any buffer pins. When
> there are no pins, you can munmap() the shared_buffers and mmap() it at
> a different address.

I really like this idea, but I think Andres has latched onto the key
issue, which is that it supposes that the underlying shared memory
object upon which shared_buffers is based can be made bigger and
smaller, and that doesn't work for anonymous mappings AFAIK.

Maybe that's not really a problem any more, though. If we don't depend
on the address of shared_buffers anywhere, we could move it into a
DSM. Now that the stats collector uses DSM, it's surely already a
requirement that DSM works on every machine that runs PostgreSQL.

We'd still need to do something about the buffer mapping table,
though, and I bet dshash is not a reasonable answer on performance
grounds.

Also, it would be nice if the granularity of resizing could be
something less than a whole transaction, because transactions can run
for a long time. We don't really need to wait for a transaction
boundary, probably -- a time when we hold zero buffer pins will
probably happen a lot sooner, and at least some of those should be
safe points at which to remap.

Then again, somebody can open a cursor, read from it until it holds a
pin, and then either idle the connection or make it do arbitrary
amounts of unrelated work, forcing the remapping to be postponed for
an arbitrarily long time. But some version of this problem will exist
in any approach to this problem, and long-running pins are a nuisance
for other reasons, too. We probably just have to accept this sort of
issue as a limitation of our implementation.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: PGC_SIGHUP shared_buffers?

From

Konstantin Knizhnik

Date:

18 February 2024, 13:33:30

On 16/02/2024 10:37 pm, Thomas Munro wrote:

On Fri, Feb 16, 2024 at 5:29 PM Robert Haas <robertmhaas@gmail.com> wrote:

3. Reserve lots of address space and then only use some of it. I hear
rumors that some forks of PG have implemented something like this. The
idea is that you convince the OS to give you a whole bunch of address
space, but you try to avoid having all of it be backed by physical
memory. If you later want to increase shared_buffers, you then get the
OS to back more of it by physical memory, and if you later want to
decrease shared_buffers, you hopefully have some way of giving the OS
the memory back. As compared with the previous two approaches, this
seems less likely to be noticeable to most PG code. Problems include
(1) you have to somehow figure out how much address space to reserve,
and that forms an upper bound on how big shared_buffers can grow at
runtime and (2) you have to figure out ways to reserve address space
and back more or less of it with physical memory that will work on all
of the platforms that we currently support or might want to support in
the future.

FTR I'm aware of a working experimental prototype along these lines,
that will be presented in Vancouver:

https://www.pgevents.ca/events/pgconfdev2024/sessions/session/31-enhancing-postgresql-plasticity-new-frontiers-in-memory-management/

If you are interested - this is my attempt to implement resizable shared buffers based on ballooning:

https://github.com/knizhnik/postgres/pull/2

Unused memory is returned to OS using `madvise` (so it is not so portable solution).

Unfortunately there are really many data structure in Postgres which size depends on number of buffers.
In my PR I am using `GetAvailableBuffers()` function instead of `NBuffers`. But it doesn't always help because many of this data structures can not be reallocated.

Another important limitation of this approach are:

1. It is necessary to specify maximal number of shared buffers 2. Only `BufferBlocks` space is shrinked but not buffer descriptors and buffer hash. Estimated memory fooyprint for one page is 132 bytes. If we want to scale shared buffers from 100Mb to 100Gb, size of use memory will be 1.6Gb. And it is quite large.
3. Our CLOCK algorithm becomes very inefficient for large number of shared buffers.

Below are first results (pgbench database with scale 100, pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S ) I get:

| shared_buffers    |            available_buffers | TPS  |
| ------------------| ---------------------------- | ---- |
|          128MB    |                           -1 | 280k |
|            1GB    |                           -1 | 324k |
|            2GB    |                           -1 | 358k |
|           32GB    |                           -1 | 350k |
|            2GB    |                        128Mb | 130k |
|            2GB    |                          1Gb | 311k |
|           32GB    |                        128Mb |  13k |
|           32GB    |                          1Gb | 140k |
|           32GB    |                          2Gb | 348k |

`shared_buffers` specifies maximal shared buffers size and `avaiable_buffer` - current limit.

So when shared_buffers >> available_buffers and dataset doesn't fit in them, we get awful degrade of performance (> 20 times).
Thanks to CLOCK algorithm.
My first thought is to replace clock with LRU based in double-linked list. As far as there is no lockless double-list implementation,
it need some global lock. This lock can become bottleneck. The standard solution is partitioning: use N LRU lists instead of 1.
Just as partitioned has table used by buffer manager to lockup buffers. Actually we can use the same partitions locks to protect LRU list.
But it not clear what to do with ring buffers (strategies).So I decided not to perform such revolution in bufmgr, but optimize clock to more efficiently split reserved buffers.
Just add skip_count field to buffer descriptor. And it helps! Now the worst case shared_buffer/available_buffers = 32Gb/128Mb
shows the same performance 280k as shared_buffers=128Mb without ballooning.

Re: PGC_SIGHUP shared_buffers?

From

Matthias van de Meent

Date:

18 February 2024, 13:48:21

On Sun, 18 Feb 2024 at 02:03, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2024-02-17 23:40:51 +0100, Matthias van de Meent wrote:
> > > 5. Re-map the shared_buffers when needed.
> > >
> > > Between transactions, a backend should not hold any buffer pins. When
> > > there are no pins, you can munmap() the shared_buffers and mmap() it at
> > > a different address.
>
> I hadn't quite realized that we don't seem to rely on shared_buffers having a
> specific address across processes. That does seem to make it a more viable to
> remap mappings in backends.
>
>
> However, I don't think this works with mmap(MAP_ANONYMOUS) - as long as we are
> using the process model. To my knowledge there is no way to get the same
> mapping in multiple already existing processes. Even mmap()ing /dev/zero after
> sharing file descriptors across processes doesn't work, if I recall correctly.
>
> We would have to use sysv/posix shared memory or such (or mmap() if files in
> tmpfs) for the shared buffers allocation.
>
>
>
> > This can quite realistically fail to find an unused memory region of
> > sufficient size when the heap is sufficiently fragmented, e.g. through
> > ASLR, which would make it difficult to use this dynamic
> > single-allocation shared_buffers in security-hardened environments.
>
> I haven't seen anywhere close to this bad fragmentation on 64bit machines so
> far - have you?

No.

> Most implementations of ASLR randomize mmap locations across multiple runs of
> the same binary, not within the same binary. There are out-of-tree linux
> patches that make mmap() randomize every single allocation, but I am not sure
> that we ought to care about such things.

After looking into ASLR a bit more, I realise I was under the mistaken
impression that ASLR would implicate randomized mmaps(), too.
Apparently, that's wrong; ASLR only does some randomization for the
initialization of the process memory layout, and not the process'
allocations.

> Even if we were to care, on 64bit platforms it doesn't seem likely that we'd
> run out of space that quickly.  AMD64 had 48bits of virtual address space from
> the start, and on recent CPUs that has grown to 57bits [1], that's a lot of
> space.

Yeah, that's a lot of space, but it seems to me it's also easily
consumed; one only needs to allocate one allocation in every 4GB of
address space to make allocations of 8GB impossible; a utilization of
~1 byte/MiB. Applying this to 48 bits of virtual address space, a
process only needs to use ~256MB of memory across the address space to
block out any 8GB allocations; for 57 bits that's still "only" 128GB.
But after looking at ASLR a bit more, it is unrealistic that a normal
OS and process stack would get to allocating memory in such a pattern.

> And if you do run out of VM space, wouldn't that also affect lots of other
> things, like mmap() for malloc?

Yes. But I would usually expect that the main shared memory allocation
would be the single largest uninterrupted allocation, so I'd also
expect it to see more such issues than any current user of memory if
we were to start moving (reallocating) that allocation.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: PGC_SIGHUP shared_buffers?

From

Andres Freund

Date:

18 February 2024, 20:35:16

Hi,

On 2024-02-18 17:06:09 +0530, Robert Haas wrote:
> On Sat, Feb 17, 2024 at 12:38 AM Andres Freund <andres@anarazel.de> wrote:
> > IMO the ability to *shrink* shared_buffers dynamically and cheaply is more
> > important than growing it in a way, except that they are related of
> > course. Idling hardware is expensive, thus overcommitting hardware is very
> > attractive (I count "serverless" as part of that). To be able to overcommit
> > effectively, unused long-lived memory has to be released. I.e. shared buffers
> > needs to be shrinkable.
> 
> I see your point, but people want to scale up, too. Of course, those
> people will have to live with what we can practically implement.

Sure, I didn't intend to say that scaling up isn't useful.


> > Perhaps worth noting that there are two things limiting the size of shared
> > buffers: 1) the available buffer space 2) the available buffer *mapping*
> > space. I think making the buffer mapping resizable is considerably harder than
> > the buffers themselves. Of course pre-reserving memory for a buffer mapping
> > suitable for a huge shared_buffers is more feasible than pre-allocating all
> > that memory for the buffers themselves. But it' still mean youd have a maximum
> > set at server start.
> 
> We size the fsync queue based on shared_buffers too. That's a lot less
> important, though, and could be worked around in other ways.

We probably should address that independently of making shared_buffers
PGC_SIGHUP. The queue gets absurdly large once s_b hits a few GB. It's not
that much memory compared to the buffer blocks themselves, but a sync queue of
many millions of entries just doesn't make sense. And a few hundred MB for
that isn't nothing either, even if it's just a fraction of the space for the
buffers. It makes checkpointer more susceptible to OOM as well, because
AbsorbSyncRequests() allocates an array to copy all requests into local
memory.


> > Such a scheme still leaves you with a dependend memory read for a quite
> > frequent operation. It could turn out to nto matter hugely if the mapping
> > array is cache resident, but I don't know if we can realistically bank on
> > that.
> 
> I don't know, either. I was hoping you did. :-)
> 
> But we can rig up a test pretty easily, I think. We can just create a
> fake mapping that gives the same answers as the current calculation
> and then beat on it. Of course, if testing shows no difference, there
> is the small problem of knowing whether the test scenario was right;
> and it's also possible that an initial impact could be mitigated by
> removing some gratuitously repeated buffer # -> buffer address
> mappings. Still, I think it could provide us with a useful baseline.
> I'll throw something together when I have time, unless someone beats
> me to it.

I think such a test would be useful, although I also don't know how confident
we would be if we saw positive results. Probably depends a bit on the
generated code and how plausible it is to not see regressions.


> > I'm also somewhat concerned about the coarse granularity being problematic. It
> > seems like it'd lead to a desire to make the granule small, causing slowness.
> 
> How many people set shared_buffers to something that's not a whole
> number of GB these days?

I'd say the vast majority of postgres instances in production run with less
than 1GB of s_b. Just because numbers wise the majority of instances are
running on small VMs and/or many PG instances are running on one larger
machine.  There are a lot of instances where the total available memory is
less than 2GB.


> I mean I bet it happens, but in practice if you rounded to the nearest GB,
> or even the nearest 2GB, I bet almost nobody would really care. I think it's
> fine to be opinionated here and hold the line at a relatively large granule,
> even though in theory people could want something else.

I don't believe that at all unfortunately.


> > One big advantage of a scheme like this is that it'd be a step towards a NUMA
> > aware buffer mapping and replacement. Practically everything beyond the size
> > of a small consumer device these days has NUMA characteristics, even if not
> > "officially visible". We could make clock sweeps (or a better victim buffer
> > selection algorithm) happen within each "chunk", with some additional
> > infrastructure to choose which of the chunks to search a buffer in. Using a
> > chunk on the current numa node, except when there is a lot of imbalance
> > between buffer usage or replacement rate between chunks.
> 
> I also wondered whether this might be a useful step toward allowing
> different-sized buffers in the same buffer pool (ducks, runs away
> quickly). I don't have any particular use for that myself, but it's a
> thing some people probably want for some reason or other.

I still think that that's something that will just cause a significant cost in
complexity, and secondarily also runtime overhead, at a comparatively marginal
gain.


> > > 2. Make a Buffer just a disguised pointer. Imagine something like
> > > typedef struct { Page bp; } *buffer. WIth this approach,
> > > BufferGetBlock() becomes trivial.
> >
> > You also additionally need something that allows for efficient iteration over
> > all shared buffers. Making buffer replacement and checkpointing more expensive
> > isn't great.
> 
> True, but I don't really see what the problem with this would be in
> this approach.

It's a bit hard to tell at this level of detail :). At the extreme end, if you
end up with a large number of separate allocations for s_b, it surely would.


> > > 3. Reserve lots of address space and then only use some of it. I hear
> > > rumors that some forks of PG have implemented something like this. The
> > > idea is that you convince the OS to give you a whole bunch of address
> > > space, but you try to avoid having all of it be backed by physical
> > > memory. If you later want to increase shared_buffers, you then get the
> > > OS to back more of it by physical memory, and if you later want to
> > > decrease shared_buffers, you hopefully have some way of giving the OS
> > > the memory back. As compared with the previous two approaches, this
> > > seems less likely to be noticeable to most PG code.
> >
> > Another advantage is that you can shrink shared buffers fairly granularly and
> > cheaply with that approach, compared to having to move buffes entirely out of
> > a larger mapping to be able to unmap it.
> 
> Don't you have to still move buffers entirely out of the region you
> want to unmap?

Sure. But you can unmap at the granularity of a hardware page (there is some
fragmentation cost on the OS / hardware page table level
though). Theoretically you could unmap individual 8kB pages.


> > > Problems include (1) you have to somehow figure out how much address space
> > > to reserve, and that forms an upper bound on how big shared_buffers can grow
> > > at runtime and
> >
> > Presumably you'd normally not want to reserve more than the physical amount of
> > memory on the system. Sure, memory can be hot added, but IME that's quite
> > rare.
> 
> I would think that might not be so rare in a virtualized environment,
> which would seem to be one of the most important use cases for this
> kind of thing.

I've not seen it in production in a long time - but that might be because I've
been out of the consulting game for too long. To my knowledge none of the
common cloud providers support it, which of course restricts where it could be
used significantly.  I have far more commonly seen use of "balooning" to
remove unused/rarely used memory from running instances though.


> Plus, this would mean we'd need to auto-detect system RAM. I'd rather
> not go there, and just fix the upper limit via a GUC.

I'd have assumed we'd want a GUC that auto-determines the amount of RAM if set
to -1. I don't think it's that hard to detect the available memory.


> > A third issue is that it can confuse administrators inspecting the system with
> > OS tools. "Postgres uses many terabytes of memory on my system!" due to VIRT
> > being huge etc.
> 
> Mmph. That's disagreeable but probably not a reason to entirely
> abandon any particular approach.

Agreed.

Greetings,

Andres Freund

Re: PGC_SIGHUP shared_buffers?

From

Robert Haas

Date:

19 February 2024, 05:58:38

On Mon, Feb 19, 2024 at 2:05 AM Andres Freund <andres@anarazel.de> wrote:
> We probably should address that independently of making shared_buffers
> PGC_SIGHUP. The queue gets absurdly large once s_b hits a few GB. It's not
> that much memory compared to the buffer blocks themselves, but a sync queue of
> many millions of entries just doesn't make sense. And a few hundred MB for
> that isn't nothing either, even if it's just a fraction of the space for the
> buffers. It makes checkpointer more susceptible to OOM as well, because
> AbsorbSyncRequests() allocates an array to copy all requests into local
> memory.

Sure, that could just be capped, if it makes sense. Although given the
thrust of this discussion, it might be even better to couple it to
something other than the size of shared_buffers.

> I'd say the vast majority of postgres instances in production run with less
> than 1GB of s_b. Just because numbers wise the majority of instances are
> running on small VMs and/or many PG instances are running on one larger
> machine.  There are a lot of instances where the total available memory is
> less than 2GB.

Whoa. That is not my experience at all. If I've ever seen such a small
system since working at EDB (since 2010!) it was just one where the
initdb-time default was never changed.

I can't help wondering if we should have some kind of memory_model
GUC, measured in T-shirt sizes or something. We've coupled a bunch of
things to shared_buffers mostly as a way of distinguishing small
systems from large ones. But if we want to make shared_buffers
dynamically changeable and we don't want to make all that other stuff
dynamically changeable, decoupling those calculations might be an
important thing to do.

On a really small system, do we even need the ability to dynamically
change shared_buffers at all? If we do, then I suspect the granule
needs to be small. But does someone want to take a system with <1GB of
shared_buffers and then scale it way, way up? I suppose it would be
nice to have the option. But you might have to make some choices, like
pick either a 16MB granule or a 128MB granule or a 1GB granule at
startup time and then stick with it? I don't know, I'm just
spitballing here, because I don't know what the real design is going
to look like yet.

> > Don't you have to still move buffers entirely out of the region you
> > want to unmap?
>
> Sure. But you can unmap at the granularity of a hardware page (there is some
> fragmentation cost on the OS / hardware page table level
> though). Theoretically you could unmap individual 8kB pages.

I thought there were problems, at least on some operating systems, if
the address space mappings became too fragmented. At least, I wouldn't
expect that you could use huge pages for shared_buffers and still
unmap little tiny bits. How would that even work?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: PGC_SIGHUP shared_buffers?

From

Joe Conway

Date:

19 February 2024, 14:19:16

On 2/18/24 15:35, Andres Freund wrote:
> On 2024-02-18 17:06:09 +0530, Robert Haas wrote:
>> How many people set shared_buffers to something that's not a whole
>> number of GB these days?
> 
> I'd say the vast majority of postgres instances in production run with less
> than 1GB of s_b. Just because numbers wise the majority of instances are
> running on small VMs and/or many PG instances are running on one larger
> machine.  There are a lot of instances where the total available memory is
> less than 2GB.
> 
>> I mean I bet it happens, but in practice if you rounded to the nearest GB,
>> or even the nearest 2GB, I bet almost nobody would really care. I think it's
>> fine to be opinionated here and hold the line at a relatively large granule,
>> even though in theory people could want something else.
> 
> I don't believe that at all unfortunately.

Couldn't we scale the rounding, e.g. allow small allocations as we do 
now, but above some number always round? E.g. maybe >= 2GB round to the 
nearest 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the 
nearest 1GB, etc?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PGC_SIGHUP shared_buffers?

From

Andres Freund

Date:

19 February 2024, 18:13:09

Hi,

On 2024-02-19 09:19:16 -0500, Joe Conway wrote:
> On 2/18/24 15:35, Andres Freund wrote:
> > On 2024-02-18 17:06:09 +0530, Robert Haas wrote:
> > > How many people set shared_buffers to something that's not a whole
> > > number of GB these days?
> > 
> > I'd say the vast majority of postgres instances in production run with less
> > than 1GB of s_b. Just because numbers wise the majority of instances are
> > running on small VMs and/or many PG instances are running on one larger
> > machine.  There are a lot of instances where the total available memory is
> > less than 2GB.
> > 
> > > I mean I bet it happens, but in practice if you rounded to the nearest GB,
> > > or even the nearest 2GB, I bet almost nobody would really care. I think it's
> > > fine to be opinionated here and hold the line at a relatively large granule,
> > > even though in theory people could want something else.
> > 
> > I don't believe that at all unfortunately.
> 
> Couldn't we scale the rounding, e.g. allow small allocations as we do now,
> but above some number always round? E.g. maybe >= 2GB round to the nearest
> 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB,
> etc?

That'd make the translation considerably more expensive. Which is important,
given how common an operation this is.

Greetings,

Andres Freund

Re: PGC_SIGHUP shared_buffers?

From

Joe Conway

Date:

19 February 2024, 18:54:01

On 2/19/24 13:13, Andres Freund wrote:
> On 2024-02-19 09:19:16 -0500, Joe Conway wrote:
>> Couldn't we scale the rounding, e.g. allow small allocations as we do now,
>> but above some number always round? E.g. maybe >= 2GB round to the nearest
>> 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB,
>> etc?
> 
> That'd make the translation considerably more expensive. Which is important,
> given how common an operation this is.


Perhaps it is not practical, doesn't help, or maybe I misunderstand, but 
my intent was that the rounding be done/enforced when setting the GUC 
value which surely cannot be that often.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PGC_SIGHUP shared_buffers?

From

Andres Freund

Date:

19 February 2024, 19:46:06

Hi,

On 2024-02-19 13:54:01 -0500, Joe Conway wrote:
> On 2/19/24 13:13, Andres Freund wrote:
> > On 2024-02-19 09:19:16 -0500, Joe Conway wrote:
> > > Couldn't we scale the rounding, e.g. allow small allocations as we do now,
> > > but above some number always round? E.g. maybe >= 2GB round to the nearest
> > > 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB,
> > > etc?
> > 
> > That'd make the translation considerably more expensive. Which is important,
> > given how common an operation this is.
> 
> 
> Perhaps it is not practical, doesn't help, or maybe I misunderstand, but my
> intent was that the rounding be done/enforced when setting the GUC value
> which surely cannot be that often.

It'd be used for something like

  WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ;

If CHUNK_SIZE isn't a compile time constant this gets a good bit more
expensive. A lot more, if implemented naively (i.e as actual modulo/division
operations, instead of translating to shifts and masks).

Greetings,

Andres Freund