Thread: Re: Changing shared_buffers without restart

Re: Changing shared_buffers without restart

From
Robert Haas
Date:
On Fri, Oct 18, 2024 at 3:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> changing shared memory mapping layout. Any feedback is appreciated.

A lot of people would like to have this feature, so I hope this
proposal works out. Thanks for working on it.

I think the idea of having multiple shared memory segments is
interesting and makes sense, but I would prefer to see them called
"segments" rather than "slots" just as do we do for DSMs. The name
"slot" is somewhat overused, and invites confusion with replication
slots, inter alia. I think it's possible that having multiple fixed
shared memory segments will spell trouble on Windows, where we already
need to use a retry loop to try to get the main shared memory segment
mapped at the correct address. If there are multiple segments and we
need whatever ASLR stuff happens on Windows to not place anything else
overlapping with any of them, that means there's more chances for
stuff to fail than if we just need one address range to be free.
Granted, the individual ranges are smaller, so maybe it's fine? But I
don't know.

The big thing that worries me is synchronization, and while I've only
looked at the patch set briefly, it doesn't look to me as though
there's enough machinery here to make that work correctly. Suppose
that shared_buffers=8GB (a million buffers) and I change it to
shared_buffers=16GB (2 million buffers). As soon as any one backend
has seen that changed and expanded shared_buffers, there's a
possibility that some other backend which has not yet seen the change
might see a buffer number greater than a million. If it tries to use
that buffer number before it absorbs the change, something bad will
happen. The most obvious way for it to see such a buffer number - and
possibly the only one - is to do a lookup in the buffer mapping table
and find a buffer ID there that was inserted by some other backend
that has already seen the change.

Fixing this seems tricky. My understanding is that BufferGetBlock() is
extremely performance-critical, so having to do a bounds check there
to make sure that a given buffer number is in range would probably be
bad for performance. Also, even if the overhead weren't prohibitive, I
don't think we can safely stick code that unmaps and remaps shared
memory segments into a function that currently just does math, because
we've probably got places where we assume this operation can't fail --
as well as places where we assume that if we call BufferGetBlock(i)
and then BufferGetBlock(j), the second call won't change the answer to
the first.

It seems to me that it's probably only safe to swap out a backend's
notion of where shared_buffers is located when the backend holds on
buffer pins, and maybe not even all such places, because it would be a
problem if a backend looks up the address of a buffer before actually
pinning it, on the assumption that the answer can't change. I don't
know if that ever happens, but it would be a legal coding pattern
today. Doing it between statements seems safe as long as there are no
cursors holding pins. Doing it in the middle of a statement is
probably possible if we can verify that we're at a "safe" point in the
code, but I'm not sure exactly which points are safe. If we have no
code anywhere that assumes the address of an unpinned buffer can't
change before we pin it, then I guess the check for pins is the only
thing we need, but I don't know that to be the case.

I guess I would have imagined that a change like this would have to be
done in phases. In phase 1, we'd tell all of the backends that
shared_buffers had expanded to some new, larger value; but the new
buffers wouldn't be usable for anything yet. Then, once we confirmed
that everyone had the memo, we'd tell all the backends that those
buffers are now available for use. If shared_buffers were contracted,
phase 1 would tell all of the backends that shared_buffers had
contracted to some new, smaller value. Once a particular backend
learns about that, they will refuse to put any new pages into those
high-numbered buffers, but the existing contents would still be valid.
Once everyone has been told about this, we can go through and evict
all of those buffers, and then let everyone know that's done. Then
they shrink their mappings.

It looks to me like the patch doesn't expand the buffer mapping table,
which seems essential. But maybe I missed that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Mon, Nov 25, 2024 at 02:33:48PM GMT, Robert Haas wrote:
>
> I think the idea of having multiple shared memory segments is
> interesting and makes sense, but I would prefer to see them called
> "segments" rather than "slots" just as do we do for DSMs. The name
> "slot" is somewhat overused, and invites confusion with replication
> slots, inter alia. I think it's possible that having multiple fixed
> shared memory segments will spell trouble on Windows, where we already
> need to use a retry loop to try to get the main shared memory segment
> mapped at the correct address. If there are multiple segments and we
> need whatever ASLR stuff happens on Windows to not place anything else
> overlapping with any of them, that means there's more chances for
> stuff to fail than if we just need one address range to be free.
> Granted, the individual ranges are smaller, so maybe it's fine? But I
> don't know.

I haven't had a chance to experiment with that on Windows, but I'm
hoping that in the worst case fallback to a single mapping via proposed
infrastructure (and the consequent limitations) would be acceptable.

> The big thing that worries me is synchronization, and while I've only
> looked at the patch set briefly, it doesn't look to me as though
> there's enough machinery here to make that work correctly. Suppose
> that shared_buffers=8GB (a million buffers) and I change it to
> shared_buffers=16GB (2 million buffers). As soon as any one backend
> has seen that changed and expanded shared_buffers, there's a
> possibility that some other backend which has not yet seen the change
> might see a buffer number greater than a million. If it tries to use
> that buffer number before it absorbs the change, something bad will
> happen. The most obvious way for it to see such a buffer number - and
> possibly the only one - is to do a lookup in the buffer mapping table
> and find a buffer ID there that was inserted by some other backend
> that has already seen the change.

Right, I haven't put much efforts into synchronization yet. It's in my
bucket list for the next iteration of the patch.

> code, but I'm not sure exactly which points are safe. If we have no
> code anywhere that assumes the address of an unpinned buffer can't
> change before we pin it, then I guess the check for pins is the only
> thing we need, but I don't know that to be the case.

Probably I'm missing something here. What scenario do you have in mind,
when the address of a buffer is changing?

> I guess I would have imagined that a change like this would have to be
> done in phases. In phase 1, we'd tell all of the backends that
> shared_buffers had expanded to some new, larger value; but the new
> buffers wouldn't be usable for anything yet. Then, once we confirmed
> that everyone had the memo, we'd tell all the backends that those
> buffers are now available for use. If shared_buffers were contracted,
> phase 1 would tell all of the backends that shared_buffers had
> contracted to some new, smaller value. Once a particular backend
> learns about that, they will refuse to put any new pages into those
> high-numbered buffers, but the existing contents would still be valid.
> Once everyone has been told about this, we can go through and evict
> all of those buffers, and then let everyone know that's done. Then
> they shrink their mappings.

Yep, sounds good. I was pondering about more crude approach, but doing
this in phases seems to be a way to go.

> It looks to me like the patch doesn't expand the buffer mapping table,
> which seems essential. But maybe I missed that.

Do you mean the "Shared Buffer Lookup Table"? It does expand it, but
under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look
at the code, I see a few issues around that -- so I would have to
improve it anyway, thanks for pointing that out.



Re: Changing shared_buffers without restart

From
Robert Haas
Date:
On Tue, Nov 26, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I haven't had a chance to experiment with that on Windows, but I'm
> hoping that in the worst case fallback to a single mapping via proposed
> infrastructure (and the consequent limitations) would be acceptable.

Yeah, if you can still fall back to a single mapping, I think that's
OK. It would be nicer if it could work on every platform in the same
way, but half a loaf is better than none.

> > code, but I'm not sure exactly which points are safe. If we have no
> > code anywhere that assumes the address of an unpinned buffer can't
> > change before we pin it, then I guess the check for pins is the only
> > thing we need, but I don't know that to be the case.
>
> Probably I'm missing something here. What scenario do you have in mind,
> when the address of a buffer is changing?

I was assuming that if you expand the mapping for shared_buffers, you
can't count on the new mapping being at the same address as the old
mapping. If you can, that makes things simpler, but what if the OS has
mapped something else just afterward, in the address space that you're
hoping to use when you expand the mapping?

> > It looks to me like the patch doesn't expand the buffer mapping table,
> > which seems essential. But maybe I missed that.
>
> Do you mean the "Shared Buffer Lookup Table"? It does expand it, but
> under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look
> at the code, I see a few issues around that -- so I would have to
> improve it anyway, thanks for pointing that out.

Yeah, we -- or at least I -- usually call that the buffer mapping
table. There are identifiers like BufMappingPartitionLock, for
example. I'm slightly surprised that the ShmemInitHash() call uses
something else as the identifier, but I guess that's how it is.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Wed, Nov 27, 2024 at 10:20:27AM GMT, Robert Haas wrote:
> > >
> > > code, but I'm not sure exactly which points are safe. If we have no
> > > code anywhere that assumes the address of an unpinned buffer can't
> > > change before we pin it, then I guess the check for pins is the only
> > > thing we need, but I don't know that to be the case.
> >
> > Probably I'm missing something here. What scenario do you have in mind,
> > when the address of a buffer is changing?
>
> I was assuming that if you expand the mapping for shared_buffers, you
> can't count on the new mapping being at the same address as the old
> mapping. If you can, that makes things simpler, but what if the OS has
> mapped something else just afterward, in the address space that you're
> hoping to use when you expand the mapping?

Yes, that's the whole point of the exercise with remap -- to keep
addresses unchanged, making buffer management simpler and allowing
resize mappings quicker. The trade off is that we would need to take
care of shared mapping placing.

My understanding is that clashing of mappings (either at creation time
or when resizing) could happen only withing the process address space,
and the assumption is that by the time we prepare the mapping layout all
the rest of mappings for the process are already done. But I agree, it's
an interesting question -- I'm going to investigate if those assumptions
could be wrong under certain conditions. Currently if something else is
mapped at the same address where we want to expand the mapping, we will
get an error and can decide how to proceed (e.g. if it happens at
creation time, proceed with a single mapping, otherwise ignore mapping
resize).



Re: Changing shared_buffers without restart

From
Robert Haas
Date:
On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> My understanding is that clashing of mappings (either at creation time
> or when resizing) could happen only withing the process address space,
> and the assumption is that by the time we prepare the mapping layout all
> the rest of mappings for the process are already done.

I don't think that's correct at all. First, the user could type LOAD
'whatever' at any time. But second, even if they don't or you prohibit
them from doing so, the process could allocate memory for any of a
million different things, and that could require mapping a new region
of memory, and the OS could choose to place that just after an
existing mapping, or at least close enough that we can't expand the
object size as much as desired.

If we had an upper bound on the size of shared_buffers and could
reserve that amount of address space at startup time but only actually
map a portion of it, then we could later remap and expand into the
reserved space. Without that, I think there's absolutely no guarantee
that the amount of address space that we need is available when we
want to extend a mapping.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Changing shared_buffers without restart

From
Jelte Fennema-Nio
Date:
On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote:
> If we had an upper bound on the size of shared_buffers

I think a fairly reliable upper bound is the amount of physical memory
on the system at time of postmaster start. We could make it a GUC to
set the upper bound for the rare cases where people do stuff like
adding swap space later or doing online VM growth. We could even have
the default be something like 4x the physical memory to accommodate
those people by default.

> reserve that amount of address space at startup time but only actually
> map a portion of it

Or is this the difficult part?



Re: Changing shared_buffers without restart

From
Andres Freund
Date:
Hi,

On 2024-11-27 16:05:47 -0500, Robert Haas wrote:
> On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > My understanding is that clashing of mappings (either at creation time
> > or when resizing) could happen only withing the process address space,
> > and the assumption is that by the time we prepare the mapping layout all
> > the rest of mappings for the process are already done.
> 
> I don't think that's correct at all. First, the user could type LOAD
> 'whatever' at any time. But second, even if they don't or you prohibit
> them from doing so, the process could allocate memory for any of a
> million different things, and that could require mapping a new region
> of memory, and the OS could choose to place that just after an
> existing mapping, or at least close enough that we can't expand the
> object size as much as desired.
> 
> If we had an upper bound on the size of shared_buffers and could
> reserve that amount of address space at startup time but only actually
> map a portion of it, then we could later remap and expand into the
> reserved space. Without that, I think there's absolutely no guarantee
> that the amount of address space that we need is available when we
> want to extend a mapping.

Strictly speaking we don't actually need to map shared buffers to the same
location in each process... We do need that for most other uses of shared
memory, including the buffer mapping table, but not for the buffer data
itself.

Whether it's worth the complexity of dealing with differing locations is
another matter.

Greetings,

Andres Freund



Re: Changing shared_buffers without restart

From
Robert Haas
Date:
On Wed, Nov 27, 2024 at 4:28 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote:
> > If we had an upper bound on the size of shared_buffers
>
> I think a fairly reliable upper bound is the amount of physical memory
> on the system at time of postmaster start. We could make it a GUC to
> set the upper bound for the rare cases where people do stuff like
> adding swap space later or doing online VM growth. We could even have
> the default be something like 4x the physical memory to accommodate
> those people by default.

Yes, Peter mentioned similar ideas on this thread last week.

> > reserve that amount of address space at startup time but only actually
> > map a portion of it
>
> Or is this the difficult part?

I'm not sure how difficult this is, although I'm pretty sure that it's
more difficult than adding a GUC. My point wasn't so much whether this
is easy or hard but rather that it's essential if you want to avoid
having addresses change when the resizing happens.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Changing shared_buffers without restart

From
Robert Haas
Date:
On Wed, Nov 27, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:
> Strictly speaking we don't actually need to map shared buffers to the same
> location in each process... We do need that for most other uses of shared
> memory, including the buffer mapping table, but not for the buffer data
> itself.

Well, if it can move, then you have to make sure it doesn't move while
someone's holding onto a pointer into it. I'm not exactly sure how
hard it is to guarantee that, but we certainly do construct pointers
into shared_buffers and use them at least for short periods of time,
so it's not a purely academic concern.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Wed, Nov 27, 2024 at 04:05:47PM GMT, Robert Haas wrote:
> On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > My understanding is that clashing of mappings (either at creation time
> > or when resizing) could happen only withing the process address space,
> > and the assumption is that by the time we prepare the mapping layout all
> > the rest of mappings for the process are already done.
>
> I don't think that's correct at all. First, the user could type LOAD
> 'whatever' at any time. But second, even if they don't or you prohibit
> them from doing so, the process could allocate memory for any of a
> million different things, and that could require mapping a new region
> of memory, and the OS could choose to place that just after an
> existing mapping, or at least close enough that we can't expand the
> object size as much as desired.
>
> If we had an upper bound on the size of shared_buffers and could
> reserve that amount of address space at startup time but only actually
> map a portion of it, then we could later remap and expand into the
> reserved space. Without that, I think there's absolutely no guarantee
> that the amount of address space that we need is available when we
> want to extend a mapping.

Just done a couple of experiments, and I think this could be addressed by
careful placing of mappings as well, based on two assumptions: for a new
mapping the kernel always picks up a lowest address that allows enough space,
and the maximum amount of allocable memory for other mappings could be derived
from total available memory. With that in mind the shared mapping layout will
have to have a large gap at the start, between the lowest address and the
shared mappings used for buffers and rest -- the gap where all the other
mapping (allocations, libraries, madvise, etc) will land. It's similar to
address space reserving you mentioned above, will reduce possibility of
clashing significantly, and looks something like this:

    01339000-0139e000 [heap]
    0139e000-014aa000 [heap]
    7f2dd72f6000-7f2dfbc9c000 /memfd:strategy (deleted)
    7f2e0209c000-7f2e269b0000 /memfd:checkpoint (deleted)
    7f2e2cdb0000-7f2e516b4000 /memfd:iocv (deleted)
    7f2e57ab4000-7f2e7c478000 /memfd:descriptors (deleted)
    7f2ebc478000-7f2ee8d3c000 /memfd:buffers (deleted)
    ^ note the distance between two mappings,
      which is intended for resize
    7f3168d3c000-7f318d600000 /memfd:main (deleted)
    ^ here is where the gap starts
    7f4194c00000-7f4194e7d000
    ^ this one is an anonymous maping created due to large
      memory allocation after shared mappings were created
    7f4195000000-7f419527d000
    7f41952dc000-7f4195416000
    7f4195416000-7f4195600000 /dev/shm/PostgreSQL.2529797530
    7f4195600000-7f41a311d000 /usr/lib/locale/locale-archive
    7f41a317f000-7f41a3200000
    7f41a3200000-7f41a3201000 /usr/lib64/libicudata.so.74.2

The assumption about picking up a lowest address is just how it works right now
on Linux, this fact is already used in the patch. The idea that we could put
upper boundary on the size of other mappings based on total available memory
comes from the fact that anonymous mappings, that are much larger than memory,
will fail without overcommit. With overcommit it becomes different, but if
allocations are hitting that limit I can imagine there are bigger problems than
shared buffer resize.

This approach follows the same ideas already used in the patch, and have the
same trade offs: no address changes, but questions about portability.



Re: Changing shared_buffers without restart

From
Robert Haas
Date:
On Thu, Nov 28, 2024 at 11:30 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> on Linux, this fact is already used in the patch. The idea that we could put
> upper boundary on the size of other mappings based on total available memory
> comes from the fact that anonymous mappings, that are much larger than memory,
> will fail without overcommit. With overcommit it becomes different, but if
> allocations are hitting that limit I can imagine there are bigger problems than
> shared buffer resize.
>
> This approach follows the same ideas already used in the patch, and have the
> same trade offs: no address changes, but questions about portability.

I definitely welcome the fact that you have some platform-specific
knowledge of the Linux behavior, because that's expertise that is
obviously quite useful here and which I lack. I'm personally not
overly concerned about whether it works on every other platform -- I
would prefer an implementation that works everywhere, but I'd rather
have one that works on Linux than have nothing. It's unclear to me why
operating systems don't offer better primitives for this sort of thing
-- in theory there could be a system call that sets aside a pool of
address space and then other system calls that let you allocate
shared/unshared memory within that space or even at specific
addresses, but actually such things don't exist.

All that having been said, what does concern me a bit is our ability
to predict what Linux will do well enough to keep what we're doing
safe; and also whether the Linux behavior might abruptly change in the
future. Users would be sad if we released this feature and then a
future kernel upgrade causes PostgreSQL to completely stop working. I
don't know how the Linux kernel developers actually feel about this
sort of thing, but if I imagine myself as a kernel developer, I can
totally see myself saying "well, we never promised that this would
work in any particular way, so we're free to change it whenever we
like." We've certainly used that argument here countless times.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Changing shared_buffers without restart

From
Matthias van de Meent
Date:
On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote:
>
> [...] It's unclear to me why
> operating systems don't offer better primitives for this sort of thing
> -- in theory there could be a system call that sets aside a pool of
> address space and then other system calls that let you allocate
> shared/unshared memory within that space or even at specific
> addresses, but actually such things don't exist.

Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall
allows you to request memory from the OS at arbitrary addresses - it's
just that stdlib's malloc doens't expose the 'alloc at this address'
part of that API.

Windows seems to have an equivalent API in VirtualAlloc*. Both the
Windows API and Linux's mmap have an optional address argument, which
(when not NULL) is where the allocation will be placed (some
conditions apply, based on flags and specific API used), so, assuming
we have some control on where to allocate memory, we should be able to
reserve enough memory by using these APIs.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Thu, Nov 28, 2024 at 12:18:54PM GMT, Robert Haas wrote:
>
> All that having been said, what does concern me a bit is our ability
> to predict what Linux will do well enough to keep what we're doing
> safe; and also whether the Linux behavior might abruptly change in the
> future. Users would be sad if we released this feature and then a
> future kernel upgrade causes PostgreSQL to completely stop working. I
> don't know how the Linux kernel developers actually feel about this
> sort of thing, but if I imagine myself as a kernel developer, I can
> totally see myself saying "well, we never promised that this would
> work in any particular way, so we're free to change it whenever we
> like." We've certainly used that argument here countless times.

Agree, at the moment I can't say for sure how reliable this behavior is
in long term. I'll try to see if there are ways to get more confidence
about that.



Re: Changing shared_buffers without restart

From
Tom Lane
Date:
Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote:
>> [...] It's unclear to me why
>> operating systems don't offer better primitives for this sort of thing
>> -- in theory there could be a system call that sets aside a pool of
>> address space and then other system calls that let you allocate
>> shared/unshared memory within that space or even at specific
>> addresses, but actually such things don't exist.

> Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall
> allows you to request memory from the OS at arbitrary addresses - it's
> just that stdlib's malloc doens't expose the 'alloc at this address'
> part of that API.

I think what Robert is concerned about is that there is exactly 0
guarantee that that will succeed, because you have no control over
system-driven allocations of address space (for example, loading
of extensions or JIT code).  In fact, given things like ASLR, there
is pressure on the kernel crew to make that *less* predictable not
more so.  So even if we devise a method that seems to work reliably
today, we could have little faith that it would work with next year's
kernels.

            regards, tom lane



Re: Changing shared_buffers without restart

From
Matthias van de Meent
Date:
On Thu, 28 Nov 2024 at 19:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> > On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote:
> >> [...] It's unclear to me why
> >> operating systems don't offer better primitives for this sort of thing
> >> -- in theory there could be a system call that sets aside a pool of
> >> address space and then other system calls that let you allocate
> >> shared/unshared memory within that space or even at specific
> >> addresses, but actually such things don't exist.
>
> > Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall
> > allows you to request memory from the OS at arbitrary addresses - it's
> > just that stdlib's malloc doens't expose the 'alloc at this address'
> > part of that API.
>
> I think what Robert is concerned about is that there is exactly 0
> guarantee that that will succeed, because you have no control over
> system-driven allocations of address space (for example, loading
> of extensions or JIT code).  In fact, given things like ASLR, there
> is pressure on the kernel crew to make that *less* predictable not
> more so.

I see what you mean, but I think that shouldn't be much of an issue.
I'm not a kernel hacker, but I've never heard about anyone arguing to
remove mmap's mapping-overwriting behavior for user-controlled
mappings - it seems too useful as a way to guarantee relative memory
addresses (agreed, there is now mseal(2), but that is the user asking
for security on their own mapping, this isn't applied to arbitrary
mappings).

I mean, we can do the following to get a nice contiguous empty address
space no other mmap(NULL)s will get put into:

    /* reserve size bytes of memory */
    base = mmap(NULL, size, PROT_NONE, ...flags, ...);
    /* use the first small_size bytes of that reservation */
    allocated_in_reserved = mmap(base, small_size, PROT_READ |
PROT_WRITE, MAP_FIXED, ...);

With the PROT_NONE protection option the OS doesn't actually allocate
any backing memory, but guarantees no other mmap(NULL, ...) will get
placed in that area such that it overlaps with that allocation until
the area is munmap-ed, thus allowing us to reserve a chunk of address
space without actually using (much) memory. Deallocations have to go
through mmap(... PROT_NONE, ...) instead of munmap if we'd want to
keep the full area reserved, but I think that's not that much of an
issue.

I also highly doubt Linux will remove or otherwise limit the PROT_NONE
option to such a degree that we won't be able to "balloon" the memory
address space for (e.g.) dynamic shared buffer resizing.

See also: FreeBSD's MAP_GUARD mmap flag, Window's MEM_RESERVE and
MEM_RESERVE_PLACEHOLDER flags for VirtualAlloc[2][Ex].
See also [0] where PROT_NONE is explicitly called out as a tool for
reserving memory address space.

> So even if we devise a method that seems to work reliably
> today, we could have little faith that it would work with next year's
> kernels.

I really don't think that userspace memory address space reservations
through e.g. PROT_NONE or MEM_RESERVE[_PLACEHOLDER] will be retired
anytime soon, at least not without the relevant kernels also providing
effective alternatives.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://www.gnu.org/software/libc/manual/html_node/Memory-Protection.html



Re: Changing shared_buffers without restart

From
Tom Lane
Date:
Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> I mean, we can do the following to get a nice contiguous empty address
> space no other mmap(NULL)s will get put into:

>     /* reserve size bytes of memory */
>     base = mmap(NULL, size, PROT_NONE, ...flags, ...);
>     /* use the first small_size bytes of that reservation */
>     allocated_in_reserved = mmap(base, small_size, PROT_READ |
> PROT_WRITE, MAP_FIXED, ...);

> With the PROT_NONE protection option the OS doesn't actually allocate
> any backing memory, but guarantees no other mmap(NULL, ...) will get
> placed in that area such that it overlaps with that allocation until
> the area is munmap-ed, thus allowing us to reserve a chunk of address
> space without actually using (much) memory.

Well, that's all great if it works portably.  But I don't see one word
in either POSIX or the Linux mmap(2) man page that promises those
semantics for PROT_NONE.  I also wonder how well a giant chunk of
"unbacked" address space will interoperate with the OOM killer,
top(1)'s display of used memory, and other things that have caused us
headaches with large shared-memory arenas.

Maybe those issues are all in the past and this'll work great.
I'm not holding my breath though.

            regards, tom lane



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote:
>
> I mean, we can do the following to get a nice contiguous empty address
> space no other mmap(NULL)s will get put into:
>
>     /* reserve size bytes of memory */
>     base = mmap(NULL, size, PROT_NONE, ...flags, ...);
>     /* use the first small_size bytes of that reservation */
>     allocated_in_reserved = mmap(base, small_size, PROT_READ |
> PROT_WRITE, MAP_FIXED, ...);
>
> With the PROT_NONE protection option the OS doesn't actually allocate
> any backing memory, but guarantees no other mmap(NULL, ...) will get
> placed in that area such that it overlaps with that allocation until
> the area is munmap-ed, thus allowing us to reserve a chunk of address
> space without actually using (much) memory.

From what I understand it's not much different from the scenario when we
just map as much as we want in advance. The actual memory will not be
allocated in both cases due to CoW, oom_score seems to be the same. I
agree it sounds attractive, but after some experimenting it looks like
it won't work with huge pages insige a cgroup v2 (=container).

The reason is Linux has recently learned to apply memory reservation
limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays
this feature is often configured out of the box in various container
orchestrators, meaning that a scenario "set hugetlb=1GB on a container,
reserve 32GB with PROT_NONE" will fail. I've also tried to mix and
match, reserve some address space via non-hugetlb mapping, and allocate
a hugetlb out of it, but it doesn't work either (the smaller mmap
complains about MAP_HUGETLB with EINVAL).



Re: Changing shared_buffers without restart

From
Andres Freund
Date:
Hi,

On 2024-11-28 17:30:32 +0100, Dmitry Dolgov wrote:
> The assumption about picking up a lowest address is just how it works right now
> on Linux, this fact is already used in the patch. The idea that we could put
> upper boundary on the size of other mappings based on total available memory
> comes from the fact that anonymous mappings, that are much larger than memory,
> will fail without overcommit.

The overcommit issue shouldn't be a big hurdle - by mmap()ing with
MAP_NORESERVE the space isn't reserved. Then madvise with MADV_POPULATE_WRITE
can be used to actually populate the used range of the mapping and MADV_REMOVE
can be used to shrink the mapping again.


> With overcommit it becomes different, but if allocations are hitting that
> limit I can imagine there are bigger problems than shared buffer resize.

I'm fairly sure it'll not work to just disregard issues around overcommit. A
overly large memory allocation, without MAP_NORESERVE, will actually reduce
the amount of memory that can be used for other allocations. That's obviously
problematic, because you'll now have a smaller shared buffers, but can't use
the memory for work_mem type allocations...

Greetings,

Andres Freund



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Fri, Nov 29, 2024 at 05:47:27PM GMT, Dmitry Dolgov wrote:
> > On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote:
> >
> > I mean, we can do the following to get a nice contiguous empty address
> > space no other mmap(NULL)s will get put into:
> >
> >     /* reserve size bytes of memory */
> >     base = mmap(NULL, size, PROT_NONE, ...flags, ...);
> >     /* use the first small_size bytes of that reservation */
> >     allocated_in_reserved = mmap(base, small_size, PROT_READ |
> > PROT_WRITE, MAP_FIXED, ...);
> >
> > With the PROT_NONE protection option the OS doesn't actually allocate
> > any backing memory, but guarantees no other mmap(NULL, ...) will get
> > placed in that area such that it overlaps with that allocation until
> > the area is munmap-ed, thus allowing us to reserve a chunk of address
> > space without actually using (much) memory.
>
> From what I understand it's not much different from the scenario when we
> just map as much as we want in advance. The actual memory will not be
> allocated in both cases due to CoW, oom_score seems to be the same. I
> agree it sounds attractive, but after some experimenting it looks like
> it won't work with huge pages insige a cgroup v2 (=container).
>
> The reason is Linux has recently learned to apply memory reservation
> limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays
> this feature is often configured out of the box in various container
> orchestrators, meaning that a scenario "set hugetlb=1GB on a container,
> reserve 32GB with PROT_NONE" will fail. I've also tried to mix and
> match, reserve some address space via non-hugetlb mapping, and allocate
> a hugetlb out of it, but it doesn't work either (the smaller mmap
> complains about MAP_HUGETLB with EINVAL).

I've asked about that in linux-mm [1]. To my surprise, the
recommendations were to stick to creating a large mapping in advance,
and slice smaller mappings out of that, which could be resized later.
The OOM score should not be affected, and hugetlb could be avoided using
MAP_NORESERVE flag for the initial mapping (I've experimented with that,
seems to be working just fine, even if the slices are not using
MAP_NORESERVE).

I guess that would mean I'll try to experiment with this approach as
well. But what others think? How much research do we need to do, to gain
some confidence about large shared mappings and make it realistically
acceptable?

[1]: https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/t/



Re: Changing shared_buffers without restart

From
Robert Haas
Date:
On Mon, Dec 2, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I've asked about that in linux-mm [1]. To my surprise, the
> recommendations were to stick to creating a large mapping in advance,
> and slice smaller mappings out of that, which could be resized later.
> The OOM score should not be affected, and hugetlb could be avoided using
> MAP_NORESERVE flag for the initial mapping (I've experimented with that,
> seems to be working just fine, even if the slices are not using
> MAP_NORESERVE).
>
> I guess that would mean I'll try to experiment with this approach as
> well. But what others think? How much research do we need to do, to gain
> some confidence about large shared mappings and make it realistically
> acceptable?

Personally, I like this approach. It seems to me that this opens up
the possibility of a system where the virtual addresses of data
structures in shared memory never change, which I think will avoid an
absolutely massive amount of implementation complexity. It's obviously
not ideal that we have to specify in advance an upper limit on the
potential size of shared_buffers, but we can live with it. It's better
than what we have today; and certainly cloud providers will have no
issue with pre-setting that to a reasonable value. I don't know if we
can port it to other operating systems, but it seems at least possible
that they offer similar primitives, or will in the future; if not, we
can disable the feature on those platforms.

I still think the synchronization is going to be tricky. For example
when you go to shrink a mapping, you need to make sure that it's free
of buffers that anyone might touch; and when you grow a mapping, you
need to make sure that nobody tries to touch that address space before
they grow the mapping, which goes back to my earlier point about
someone doing a lookup into the buffer mapping table and finding a
buffer number that is beyond the end of what they've already mapped.
But I think it may be doable with sufficient cleverness.

--
Robert Haas
EDB: http://www.enterprisedb.com