Re: Changing shared_buffers without restart - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Changing shared_buffers without restart |
Date | |
Msg-id | CA+hUKGLQhsZ1dEf5Zo6JuPbs6n-qX=cTGy49feKf1iFA_TBP1g@mail.gmail.com Whole thread Raw |
In response to | Re: Changing shared_buffers without restart (Dmitry Dolgov <9erthalion6@gmail.com>) |
List | pgsql-hackers |
On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are > questionable from portability point of view. This leaves memfd_create, and I'm > still not completely clear on it's portability -- it seems to be specific to > Linux, but others provide compatible implementation as well. Something like this should work, roughly based on DSM code except here we don't really need the name so we unlink it immediately, at the slight risk of leaking it if the postmaster is killed between those lines (maybe someone should go and tell POSIX to support the special name SHM_ANON or some other way to avoid that; I can't see any portable workaround). Not tested/compiled, just a sketch: #ifdef HAVE_MEMFD_CREATE /* Anonymous shared memory region. */ fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags); #else /* Standard POSIX insists on a name, which we unlink immediately. */ do { char tmp[80]; snprintf(tmp, sizeof(tmp), "PostgreSQL.%u", pg_prng_uint32(&pg_global_prng_state)); fd.= shm_open(tmp, O_CREAT | O_EXCL); if (fd >= 0) shm_unlink(tmp); } while (fd < 0 && errno == EXIST); #endif > Let me experiment with this idea a bit, I would like to make sure there are no > other limitations we might face. One thing I'm still wondering about is whether you really need all this multi-phase barrier stuff, or even need to stop other backends from running at all while doing the resize. I guess that's related to your remapping scheme, but supposing you find the simple ftruncate()-only approach to be good, my next question is: why isn't it enough to wait for all backends to agree to stop allocating new buffers in the range to be truncated, and then left them continue to run as normal? As far as they would be concerned, the in-progress downsize has already happened, though it could be reverted later if the eviction phase fails. Then the coordinator could start evicting buffers and truncating the shared memory object, which are phases/steps, sure, but it's not clear to me why they need other backends' help. It sounds like Windows might need a second ProcSignalBarrier poke in order to call VirtualUnlock() in every backend. That's based on that Usenet discussion I lobbed in here the other day; I haven't tried it myself or fully grokked why it works, and there could well be other ways, IDK. Assuming it's the right approach, between the first poke to make all backends accept the new lower size and the second poke to unlock the memory, I don't see why they need to wait. I suppose it would be the same ProcSignalBarrier, but behave differently based on a control variables. I suppose there could also be a third poke, if you want to consider the operation to be fully complete only once they have all actually done that unlock step, but it may also be OK not to worry about that, IDK. On the other hand, maybe it just feels less risky if you stop the whole world, or maybe you envisage parallelising the eviction work, or there is some correctness concern I haven't grokked yet, but what? > > *You might also want to use fallocate after ftruncate on Linux to > > avoid SIGBUS on allocation failure on first touch page fault, which > > raises portability questions since it's unspecified whether you can do > > that with shm fds and fails on some systems, but it let's call that an > > independent topic as it's not affected by this choice. > > I'm afraid it would be strictly neccessary to do fallocate, otherwise we're > back where we were before reservation accounting for huge pages in Linux (lot's > of people were facing unexpected SIGBUS when dealing with cgroups). Yeah. FWIW here is where we decided to gate that on __linux__ while fixing that for DSM: https://www.postgresql.org/message-id/flat/CAEepm%3D0euOKPaYWz0-gFv9xfG%2B8ptAjhFjiQEX0CCJaYN--sDQ%40mail.gmail.com#c81b941d300f04d382472e6414cec1f4
pgsql-hackers by date: