Thread: Re: Changing shared_buffers without restart

Re: Changing shared_buffers without restart

From

Dmitry Dolgov

Date:

19 November 2024, 16:29:12

> On Tue, Nov 19, 2024 at 01:57:00PM GMT, Peter Eisentraut wrote:
> On 18.10.24 21:21, Dmitry Dolgov wrote:
> > v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch
> >
> > Preparation, introduces the possibility to work with many shmem mappings. To
> > make it less invasive, I've duplicated the shmem API to extend it with the
> > shmem_slot argument, while redirecting the original API to it. There are
> > probably better ways of doing that, I'm open for suggestions.
>
> After studying this a bit, I tend to think you should just change the
> existing APIs in place.  So for example,
>
> void *ShmemAlloc(Size size);
>
> becomes
>
> void *ShmemAlloc(int shmem_slot, Size size);
>
> There aren't that many callers, and all these duplicated interfaces almost
> add more new code than they save.
>
> It might be worth making exceptions for interfaces that are likely to be
> used by extensions.  For example, I see pg_stat_statements using
> ShmemInitStruct() and ShmemInitHash().  But that seems to be it.  Are there
> any other examples out there?  Maybe there are many more that I don't see
> right now.  But at least for the initialization functions, it doesn't seem
> worth it to preserve the existing interfaces exactly.
>
> In any case, I think the slot number should be the first argument.  This
> matches how MemoryContextAlloc() or also talloc() work.

Yeah, agree. I'll reshape this part, thanks.

> (Now here is an idea:  Could these just be memory contexts?  Instead of
> making six shared memory slots, could you make six memory contexts with a
> special shared memory type.  And ShmemAlloc becomes the allocation function,
> etc.?)

Sound interesting. I don't know how good the memory context interface
would fit here, but I'll do some investigation.

> I noticed the existing code made inconsistent use of PGShmemHeader * vs.
> void *, which also bled into your patch.  I made the attached little patch
> to clean that up a bit.

Right, it was bothering me the whole time, but not strong enough to make
me fix this in the PoC just yet.

> I suggest splitting the struct ShmemSegment into one struct for the three
> memory addresses and a separate array just for the slock_t's.  The former
> struct can then stay private in storage/ipc/shmem.c, only the locks need to
> be exported.
>
> Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS.
>
> Also, maybe some of this should be declared in storage/shmem.h rather than
> in storage/pg_shmem.h.  We have the existing ShmemLock in there, so it would
> be a bit confusing to have the per-segment locks elsewhere.
>
> [...]
>
> I'm wondering about this change:
>
> -#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
> +#define PG_MMAP_FLAGS                  (MAP_SHARED|MAP_HASSEMAPHORE)
>
> It looks like this would affect all mmap() calls, not only the one you're
> changing.  But that's the only one that uses this macro!  I don't understand
> why we need this; I don't see anything in the commit log about this ever
> being used for any portability.  I think we should just get rid of it and
> have mmap() use the right flags directly.
>
> I see that FreeBSD has a memfd_create() function.  Might be worth a try.
> Obviously, this whole thing needs a configure test for memfd_create()
> anyway.

Yep, those points make sense to me.

> > v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch
> >
> > Allows an anonyous file to back a shared mapping. This makes certain things
> > easier, e.g. mappings visual representation, and gives an fd for possible
> > future customizations.
>
> I think this could be a useful patch just by itself, without the rest of the
> series, because of
>
> > * By default, Linux will not add file-backed shared mappings into a
> > core dump, making it more convenient to work with them in PostgreSQL:
> > no more huge dumps to process.
>
> This could be significant operational benefit.
>
> When you say "by default", is this adjustable?  Does someone actually want
> the whole shared memory in their core file?  (If it's adjustable, is it also
> adjustable for anonymous mappings?)

Yes, there is /proc/<pid>/coredump_filter [1], that allows to specify
what to include. One can ask to exclude anon, file-backed and hugetlb
shared memory, with the only caveat that it's per process. I guess
normally no one wants to have a full shared memory in the coredump, but
there could be exceptions.

> I see that memfd_create() has a MFD_HUGETLB flag.  It's not very clear how
> that interacts with the MAP_HUGETLB flag for mmap().  Do you need to specify
> both of them if you want huge pages?

Correct, both (one flag in memfd_create and one for mmap) are needed to
use huge pages.

[1]:
https://www.kernel.org/doc/html/latest/filesystems/proc.html#proc-pid-coredump-filter-core-dump-filtering-settings

Re: Changing shared_buffers without restart

From

Peter Eisentraut

Date:

21 November 2024, 10:55:40

On 19.11.24 14:29, Dmitry Dolgov wrote:
>> I see that memfd_create() has a MFD_HUGETLB flag.  It's not very clear how
>> that interacts with the MAP_HUGETLB flag for mmap().  Do you need to specify
>> both of them if you want huge pages?
> Correct, both (one flag in memfd_create and one for mmap) are needed to
> use huge pages.

I was worried because the FreeBSD man page says

MFD_HUGETLB      This flag is currently unsupported.

It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant.

But you should make sure in your patch that the right set of flags for 
huge pages is passed.

Re: Changing shared_buffers without restart

From

Peter Eisentraut

Date:

26 November 2024, 10:53:39

On 19.11.24 14:29, Dmitry Dolgov wrote:
>> I noticed the existing code made inconsistent use of PGShmemHeader * vs.
>> void *, which also bled into your patch.  I made the attached little patch
>> to clean that up a bit.
> Right, it was bothering me the whole time, but not strong enough to make
> me fix this in the PoC just yet.

I committed a bit of this, so check that when you're rebasing your patch 
set.

Re: Changing shared_buffers without restart

From

Thomas Munro

Date:

17 April, 18:54:31

On Thu, Nov 21, 2024 at 8:55 PM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 19.11.24 14:29, Dmitry Dolgov wrote:
> >> I see that memfd_create() has a MFD_HUGETLB flag.  It's not very clear how
> >> that interacts with the MAP_HUGETLB flag for mmap().  Do you need to specify
> >> both of them if you want huge pages?
> > Correct, both (one flag in memfd_create and one for mmap) are needed to
> > use huge pages.
>
> I was worried because the FreeBSD man page says
>
> MFD_HUGETLB       This flag is currently unsupported.
>
> It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant.
>
> But you should make sure in your patch that the right set of flags for
> huge pages is passed.

MFD_HUGETLB does actually work on FreeBSD, but the man page doesn't
admit it (guessing an oversight, not sure, will see).  And you don't
need the corresponding (non-existent) mmap flag.  You also have to
specify a size eg MFD_HUGETLB | MFD_HUGE_2MB or you get ENOTSUPP, but
other than that quirk I see it definitely working with eg procstat -v.
That might be because FreeBSD doesn't have a default huge page size
concept?  On Linux that's a boot time setting, I guess rarely changed.
I contemplated that once before, when I wrote a quick demo patch[1] to
implement huge_pages=on for FreeBSD (ie explicit rather than
transparent).  I used a different function, not the Linuxoid one but
it's the same under the covers, and I wrote:

+ /*
+ * Find the matching page size index, or if huge_page_size wasn't set,
+ * then skip the smallest size and take the next one after that.
+ */

Swapping that topic back in, I was left wondering: (1) how to choose
between SHM_LARGEPAGE_ALLOC_DEFAULT, a policy that will cause
ftruncate() to try to defragment physical memory to fulfil your
request and can eat some serious CPU, and SHM_LARGEPAGE_ALLOC_NOWAIT,
and (2) if it's the second thing, well Linux is like that in respect
of failing fast, but for it to succeed you have to configure
nr_hugepages in the OS as a separate administrative step and *that's*
when it does any defragmentation required, and that's another concept
FreeBSD doesn't have.  It's a bit of a weird concept too, I mean those
pages are not reserved for you in any way and anyone could nab them,
which is undeniably practical but it lacks a few qualities one might
hope for in a kernel facility...  IDK.  Anyway, the Linux-like
memfd_create() always does it the _DEFAULT way.  EIther way, we can't
have identical "try" semantics: it'll actually put some effort into
trying, perhaps burning many seconds of CPU.

I took a peek at what we're doing for Windows and the man pages tell
me that it's like that too.  I don't recall hearing any complaints
about that, but it's gated on a Windows permission that I assume very
few enabled, so "try" probably isn't trying for most systems.
Quoting:

"Large-page memory regions may be difficult to obtain after the system
has been running for a long time because the physical space for each
large page must be contiguous, but the memory may have become
fragmented. Allocating large pages under these conditions can
significantly affect system performance. Therefore, applications
should avoid making repeated large-page allocations and instead
allocate all large pages one time, at startup."

For Windows we also interpret "on" with GetLargePageMinimum(), which
sounds like my "second known page size" idea.

To make Windows do the thing that this thread wants, I found a thread
saying that calling VirtualAlloc(..., MEM_RESET) and then convincing
every process to call VirtualUnlock(...) might work:

https://groups.google.com/g/microsoft.public.win32.programmer.kernel/c/3SvznY38SSc/m/4Sx_xwon1vsJ

I'm not sure what to do about the other Unixen.  One option is
nothing, no feature, patches welcome.  Another is to use
shm_open(<made up name>), like DSM segments, except we never need to
reopen these ones so we could immediately call shm_unlink() to leave
only a very short window to crash and leak a name.  It'd be low risk
name pollution in a name space that POSIX forgot to provide any way to
list.  The other idea is  non-standard madvise tricks but they seem
far too squishy to be part of a "portable" fallback if they even work
at all, so it might be better not to have the feature than that I
think.

Re: Changing shared_buffers without restart

From

Thomas Munro

Date:

18 April, 04:27:35

On Fri, Apr 18, 2025 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I contemplated that once before, when I wrote a quick demo patch[1] to
> implement huge_pages=on for FreeBSD (ie explicit rather than
> transparent).  I used a different function, not the Linuxoid one but

Oops, I forgot to supply that link[1].  And by the way all that
technical mumbo jumbo about FreeBSD was just me writing up why I
didn't pull the trigger and add explicit huge_pages support for it.
The short version is: you shouldn't try to use that flag at all on
FreeBSD yet, as it's a separate research project to add that feature.
I care about PostgreSQL/FreeBSD personally and may consider that again
as I learn more about virtual memory topics, but actually its
transparent super pages seem to do a pretty decent job already and
people don't seem to want to turn them off.

For an actionable plan that should be portable everywhere, how about
this: use shm_open(<tempname>, O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)
followed by shm_unlink(<tempname>) to make this work on every Unix
(FreeBSD could use its slightly better SHM_ANON as the name and skip
the unlink), and redirect to memfd inside #ifdef __linux__.  One thing
to consider is that shm_open() descriptors are implicitly set to
FD_CLOEXEC per POSIX, so I think you need to clear that flag with
fcntl() in EXEC_BACKEND builds, and then also set it again in children
so that they don't pass the descriptor to subprograms they run with
system() etc.  memfd_create() needs the same consideration, except its
default is the other way: I think you need to supply the MFD_CLOEXEC
flag explicitly, unless it's an EXEC_BACKEND build, and use the same
fnctl() to clear it in children if it is.  To restate that the other
way around, in non-EXEC_BACKEND builds shm_open() already does the
right thing and memfd_create() needs MFD_CLOEXEC, with no extra steps
after that.

The only systems I'm aware of that *don't* have shm_open() are (1)
Android, but it's Linux so I assume it has memfd_create() (just for
fun: you can run PostgreSQL on a phone with termux[2], and you can see
that their package supplies a fake shm_open() that redirects to plain
open(); I guess didn't realise they could have supplied an ENOSYS
dummy and just set dynamic_shared_memory_type=mmap instead, and we'd
have done that for them!), and (2) the capability-based research OS
projects like Capsicum (and probably the others like it) that rip out
all the global namespace Unix APIs for approximately the same reason
as Android (PostgreSQL can't run under those yet, but just for fun: I
had PostgreSQL mostly working under Capsicum once, and noticed that
the problems to be solved had significant overlap with the
multithreading project: the global namespace stuff like signals/PIDs
and onymous IPC go away, and the only other major thing is absolute
paths, many of which are easily made relative to a pgdata fd and
handled with openat() in fd.c, but I digress...).

[1] https://www.postgresql.org/message-id/CA%2BhUKGLmBWHF6gusP55R7jVS1%3D6T%3DGphbZpUXiOgMMHDUkVCgw%40mail.gmail.com
[2] https://github.com/termux/termux-packages/tree/master/packages/postgresql