Thread: Re: Changing shared_buffers without restart
> On Tue, Nov 19, 2024 at 01:57:00PM GMT, Peter Eisentraut wrote: > On 18.10.24 21:21, Dmitry Dolgov wrote: > > v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch > > > > Preparation, introduces the possibility to work with many shmem mappings. To > > make it less invasive, I've duplicated the shmem API to extend it with the > > shmem_slot argument, while redirecting the original API to it. There are > > probably better ways of doing that, I'm open for suggestions. > > After studying this a bit, I tend to think you should just change the > existing APIs in place. So for example, > > void *ShmemAlloc(Size size); > > becomes > > void *ShmemAlloc(int shmem_slot, Size size); > > There aren't that many callers, and all these duplicated interfaces almost > add more new code than they save. > > It might be worth making exceptions for interfaces that are likely to be > used by extensions. For example, I see pg_stat_statements using > ShmemInitStruct() and ShmemInitHash(). But that seems to be it. Are there > any other examples out there? Maybe there are many more that I don't see > right now. But at least for the initialization functions, it doesn't seem > worth it to preserve the existing interfaces exactly. > > In any case, I think the slot number should be the first argument. This > matches how MemoryContextAlloc() or also talloc() work. Yeah, agree. I'll reshape this part, thanks. > (Now here is an idea: Could these just be memory contexts? Instead of > making six shared memory slots, could you make six memory contexts with a > special shared memory type. And ShmemAlloc becomes the allocation function, > etc.?) Sound interesting. I don't know how good the memory context interface would fit here, but I'll do some investigation. > I noticed the existing code made inconsistent use of PGShmemHeader * vs. > void *, which also bled into your patch. I made the attached little patch > to clean that up a bit. Right, it was bothering me the whole time, but not strong enough to make me fix this in the PoC just yet. > I suggest splitting the struct ShmemSegment into one struct for the three > memory addresses and a separate array just for the slock_t's. The former > struct can then stay private in storage/ipc/shmem.c, only the locks need to > be exported. > > Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS. > > Also, maybe some of this should be declared in storage/shmem.h rather than > in storage/pg_shmem.h. We have the existing ShmemLock in there, so it would > be a bit confusing to have the per-segment locks elsewhere. > > [...] > > I'm wondering about this change: > > -#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE) > +#define PG_MMAP_FLAGS (MAP_SHARED|MAP_HASSEMAPHORE) > > It looks like this would affect all mmap() calls, not only the one you're > changing. But that's the only one that uses this macro! I don't understand > why we need this; I don't see anything in the commit log about this ever > being used for any portability. I think we should just get rid of it and > have mmap() use the right flags directly. > > I see that FreeBSD has a memfd_create() function. Might be worth a try. > Obviously, this whole thing needs a configure test for memfd_create() > anyway. Yep, those points make sense to me. > > v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch > > > > Allows an anonyous file to back a shared mapping. This makes certain things > > easier, e.g. mappings visual representation, and gives an fd for possible > > future customizations. > > I think this could be a useful patch just by itself, without the rest of the > series, because of > > > * By default, Linux will not add file-backed shared mappings into a > > core dump, making it more convenient to work with them in PostgreSQL: > > no more huge dumps to process. > > This could be significant operational benefit. > > When you say "by default", is this adjustable? Does someone actually want > the whole shared memory in their core file? (If it's adjustable, is it also > adjustable for anonymous mappings?) Yes, there is /proc/<pid>/coredump_filter [1], that allows to specify what to include. One can ask to exclude anon, file-backed and hugetlb shared memory, with the only caveat that it's per process. I guess normally no one wants to have a full shared memory in the coredump, but there could be exceptions. > I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how > that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify > both of them if you want huge pages? Correct, both (one flag in memfd_create and one for mmap) are needed to use huge pages. [1]: https://www.kernel.org/doc/html/latest/filesystems/proc.html#proc-pid-coredump-filter-core-dump-filtering-settings
On 19.11.24 14:29, Dmitry Dolgov wrote: >> I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how >> that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify >> both of them if you want huge pages? > Correct, both (one flag in memfd_create and one for mmap) are needed to > use huge pages. I was worried because the FreeBSD man page says MFD_HUGETLB This flag is currently unsupported. It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant. But you should make sure in your patch that the right set of flags for huge pages is passed.
On 19.11.24 14:29, Dmitry Dolgov wrote: >> I noticed the existing code made inconsistent use of PGShmemHeader * vs. >> void *, which also bled into your patch. I made the attached little patch >> to clean that up a bit. > Right, it was bothering me the whole time, but not strong enough to make > me fix this in the PoC just yet. I committed a bit of this, so check that when you're rebasing your patch set.
On Thu, Nov 21, 2024 at 8:55 PM Peter Eisentraut <peter@eisentraut.org> wrote: > On 19.11.24 14:29, Dmitry Dolgov wrote: > >> I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how > >> that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify > >> both of them if you want huge pages? > > Correct, both (one flag in memfd_create and one for mmap) are needed to > > use huge pages. > > I was worried because the FreeBSD man page says > > MFD_HUGETLB This flag is currently unsupported. > > It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant. > > But you should make sure in your patch that the right set of flags for > huge pages is passed. MFD_HUGETLB does actually work on FreeBSD, but the man page doesn't admit it (guessing an oversight, not sure, will see). And you don't need the corresponding (non-existent) mmap flag. You also have to specify a size eg MFD_HUGETLB | MFD_HUGE_2MB or you get ENOTSUPP, but other than that quirk I see it definitely working with eg procstat -v. That might be because FreeBSD doesn't have a default huge page size concept? On Linux that's a boot time setting, I guess rarely changed. I contemplated that once before, when I wrote a quick demo patch[1] to implement huge_pages=on for FreeBSD (ie explicit rather than transparent). I used a different function, not the Linuxoid one but it's the same under the covers, and I wrote: + /* + * Find the matching page size index, or if huge_page_size wasn't set, + * then skip the smallest size and take the next one after that. + */ Swapping that topic back in, I was left wondering: (1) how to choose between SHM_LARGEPAGE_ALLOC_DEFAULT, a policy that will cause ftruncate() to try to defragment physical memory to fulfil your request and can eat some serious CPU, and SHM_LARGEPAGE_ALLOC_NOWAIT, and (2) if it's the second thing, well Linux is like that in respect of failing fast, but for it to succeed you have to configure nr_hugepages in the OS as a separate administrative step and *that's* when it does any defragmentation required, and that's another concept FreeBSD doesn't have. It's a bit of a weird concept too, I mean those pages are not reserved for you in any way and anyone could nab them, which is undeniably practical but it lacks a few qualities one might hope for in a kernel facility... IDK. Anyway, the Linux-like memfd_create() always does it the _DEFAULT way. EIther way, we can't have identical "try" semantics: it'll actually put some effort into trying, perhaps burning many seconds of CPU. I took a peek at what we're doing for Windows and the man pages tell me that it's like that too. I don't recall hearing any complaints about that, but it's gated on a Windows permission that I assume very few enabled, so "try" probably isn't trying for most systems. Quoting: "Large-page memory regions may be difficult to obtain after the system has been running for a long time because the physical space for each large page must be contiguous, but the memory may have become fragmented. Allocating large pages under these conditions can significantly affect system performance. Therefore, applications should avoid making repeated large-page allocations and instead allocate all large pages one time, at startup." For Windows we also interpret "on" with GetLargePageMinimum(), which sounds like my "second known page size" idea. To make Windows do the thing that this thread wants, I found a thread saying that calling VirtualAlloc(..., MEM_RESET) and then convincing every process to call VirtualUnlock(...) might work: https://groups.google.com/g/microsoft.public.win32.programmer.kernel/c/3SvznY38SSc/m/4Sx_xwon1vsJ I'm not sure what to do about the other Unixen. One option is nothing, no feature, patches welcome. Another is to use shm_open(<made up name>), like DSM segments, except we never need to reopen these ones so we could immediately call shm_unlink() to leave only a very short window to crash and leak a name. It'd be low risk name pollution in a name space that POSIX forgot to provide any way to list. The other idea is non-standard madvise tricks but they seem far too squishy to be part of a "portable" fallback if they even work at all, so it might be better not to have the feature than that I think.
On Fri, Apr 18, 2025 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I contemplated that once before, when I wrote a quick demo patch[1] to > implement huge_pages=on for FreeBSD (ie explicit rather than > transparent). I used a different function, not the Linuxoid one but Oops, I forgot to supply that link[1]. And by the way all that technical mumbo jumbo about FreeBSD was just me writing up why I didn't pull the trigger and add explicit huge_pages support for it. The short version is: you shouldn't try to use that flag at all on FreeBSD yet, as it's a separate research project to add that feature. I care about PostgreSQL/FreeBSD personally and may consider that again as I learn more about virtual memory topics, but actually its transparent super pages seem to do a pretty decent job already and people don't seem to want to turn them off. For an actionable plan that should be portable everywhere, how about this: use shm_open(<tempname>, O_CREAT | O_EXCL, S_IRUSR | S_IWUSR) followed by shm_unlink(<tempname>) to make this work on every Unix (FreeBSD could use its slightly better SHM_ANON as the name and skip the unlink), and redirect to memfd inside #ifdef __linux__. One thing to consider is that shm_open() descriptors are implicitly set to FD_CLOEXEC per POSIX, so I think you need to clear that flag with fcntl() in EXEC_BACKEND builds, and then also set it again in children so that they don't pass the descriptor to subprograms they run with system() etc. memfd_create() needs the same consideration, except its default is the other way: I think you need to supply the MFD_CLOEXEC flag explicitly, unless it's an EXEC_BACKEND build, and use the same fnctl() to clear it in children if it is. To restate that the other way around, in non-EXEC_BACKEND builds shm_open() already does the right thing and memfd_create() needs MFD_CLOEXEC, with no extra steps after that. The only systems I'm aware of that *don't* have shm_open() are (1) Android, but it's Linux so I assume it has memfd_create() (just for fun: you can run PostgreSQL on a phone with termux[2], and you can see that their package supplies a fake shm_open() that redirects to plain open(); I guess didn't realise they could have supplied an ENOSYS dummy and just set dynamic_shared_memory_type=mmap instead, and we'd have done that for them!), and (2) the capability-based research OS projects like Capsicum (and probably the others like it) that rip out all the global namespace Unix APIs for approximately the same reason as Android (PostgreSQL can't run under those yet, but just for fun: I had PostgreSQL mostly working under Capsicum once, and noticed that the problems to be solved had significant overlap with the multithreading project: the global namespace stuff like signals/PIDs and onymous IPC go away, and the only other major thing is absolute paths, many of which are easily made relative to a pgdata fd and handled with openat() in fd.c, but I digress...). [1] https://www.postgresql.org/message-id/CA%2BhUKGLmBWHF6gusP55R7jVS1%3D6T%3DGphbZpUXiOgMMHDUkVCgw%40mail.gmail.com [2] https://github.com/termux/termux-packages/tree/master/packages/postgresql