Re: Changing shared_buffers without restart - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Changing shared_buffers without restart |
Date | |
Msg-id | CA+hUKGJ-RfwSe3=ZS2HRV9rvgrZTJJButfE8Kh5C6Ta2Eb+mPQ@mail.gmail.com Whole thread Raw |
In response to | Re: Changing shared_buffers without restart (Peter Eisentraut <peter@eisentraut.org>) |
Responses |
Re: Changing shared_buffers without restart
|
List | pgsql-hackers |
On Thu, Nov 21, 2024 at 8:55 PM Peter Eisentraut <peter@eisentraut.org> wrote: > On 19.11.24 14:29, Dmitry Dolgov wrote: > >> I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how > >> that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify > >> both of them if you want huge pages? > > Correct, both (one flag in memfd_create and one for mmap) are needed to > > use huge pages. > > I was worried because the FreeBSD man page says > > MFD_HUGETLB This flag is currently unsupported. > > It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant. > > But you should make sure in your patch that the right set of flags for > huge pages is passed. MFD_HUGETLB does actually work on FreeBSD, but the man page doesn't admit it (guessing an oversight, not sure, will see). And you don't need the corresponding (non-existent) mmap flag. You also have to specify a size eg MFD_HUGETLB | MFD_HUGE_2MB or you get ENOTSUPP, but other than that quirk I see it definitely working with eg procstat -v. That might be because FreeBSD doesn't have a default huge page size concept? On Linux that's a boot time setting, I guess rarely changed. I contemplated that once before, when I wrote a quick demo patch[1] to implement huge_pages=on for FreeBSD (ie explicit rather than transparent). I used a different function, not the Linuxoid one but it's the same under the covers, and I wrote: + /* + * Find the matching page size index, or if huge_page_size wasn't set, + * then skip the smallest size and take the next one after that. + */ Swapping that topic back in, I was left wondering: (1) how to choose between SHM_LARGEPAGE_ALLOC_DEFAULT, a policy that will cause ftruncate() to try to defragment physical memory to fulfil your request and can eat some serious CPU, and SHM_LARGEPAGE_ALLOC_NOWAIT, and (2) if it's the second thing, well Linux is like that in respect of failing fast, but for it to succeed you have to configure nr_hugepages in the OS as a separate administrative step and *that's* when it does any defragmentation required, and that's another concept FreeBSD doesn't have. It's a bit of a weird concept too, I mean those pages are not reserved for you in any way and anyone could nab them, which is undeniably practical but it lacks a few qualities one might hope for in a kernel facility... IDK. Anyway, the Linux-like memfd_create() always does it the _DEFAULT way. EIther way, we can't have identical "try" semantics: it'll actually put some effort into trying, perhaps burning many seconds of CPU. I took a peek at what we're doing for Windows and the man pages tell me that it's like that too. I don't recall hearing any complaints about that, but it's gated on a Windows permission that I assume very few enabled, so "try" probably isn't trying for most systems. Quoting: "Large-page memory regions may be difficult to obtain after the system has been running for a long time because the physical space for each large page must be contiguous, but the memory may have become fragmented. Allocating large pages under these conditions can significantly affect system performance. Therefore, applications should avoid making repeated large-page allocations and instead allocate all large pages one time, at startup." For Windows we also interpret "on" with GetLargePageMinimum(), which sounds like my "second known page size" idea. To make Windows do the thing that this thread wants, I found a thread saying that calling VirtualAlloc(..., MEM_RESET) and then convincing every process to call VirtualUnlock(...) might work: https://groups.google.com/g/microsoft.public.win32.programmer.kernel/c/3SvznY38SSc/m/4Sx_xwon1vsJ I'm not sure what to do about the other Unixen. One option is nothing, no feature, patches welcome. Another is to use shm_open(<made up name>), like DSM segments, except we never need to reopen these ones so we could immediately call shm_unlink() to leave only a very short window to crash and leak a name. It'd be low risk name pollution in a name space that POSIX forgot to provide any way to list. The other idea is non-standard madvise tricks but they seem far too squishy to be part of a "portable" fallback if they even work at all, so it might be better not to have the feature than that I think.
pgsql-hackers by date: