Re: Changing shared_buffers without restart - Mailing list pgsql-hackers
From | Dmitry Dolgov |
---|---|
Subject | Re: Changing shared_buffers without restart |
Date | |
Msg-id | ywyt5gecsfjzxlvxbvingd2dps2dedpwq5l4xhfmbcci6xta2e@h7ygoxvep725 Whole thread Raw |
In response to | Re: Changing shared_buffers without restart (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
Responses |
Re: Changing shared_buffers without restart
|
List | pgsql-hackers |
> On Fri, Apr 11, 2025 at 08:04:39PM GMT, Ashutosh Bapat wrote: > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > Yes, you're right, plain dynamic Barrier does not ensure all available > > processes will be synchronized. I was aware about the scenario you > > describe, it's mentioned in commentaries for the resize function. I was > > under the impression this should be enough, but after some more thinking > > I'm not so sure anymore. Let me try to structure it as a list of > > possible corner cases that we need to worry about: > > > > * New backend spawned while we're busy resizing shared memory. Those > > should wait until the resizing is complete and get the new size as well. > > > > * Old backend receives a resize message, but exits before attempting to > > resize. Those should be excluded from coordination. > > Should we detach barrier in on_exit()? Yeah, good point. > > I think a relatively elegant solution is to extend ProcSignalBarrier > > mechanism to track not only pss_barrierGeneration, as a sign that > > everything was processed, but also something like > > pss_barrierReceivedGeneration, indicating that the message was received > > everywhere but not processed yet. That would be enough to allow > > processes to wait until the resize message was received everywhere, then > > use a global Barrier to wait until all processes are finished. It's > > somehow similar to your proposal to use two signals, but has less > > implementation overhead. > > The way it's implemented in v4 still has the disjoint group problem. > Assume backends p1, p2, p3. All three of them are executing > ProcessProcSignalBarrier(). All three of them updated > pss_barrierReceivedGeneration > > /* The message is observed, record that */ > pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration, > shared_gen); > > p1, p2 moved faster and reached following code from ProcessBarrierShmemResize() > if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED) > WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation)); > > Since all the processes have received the barrier message, p1, p2 move > ahead and go through all the next phases and finish resizing even > before p3 gets a chance to call ProcessBarrierShmemResize() and attach > itself to Barrier. This could happen because it processed some other > ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has > not attached itself to the barrier. Once p1, p2 finish, p3 will attach > itself to the barrier and resize buffers again - reinitializing the > shared memory, which might has been already modified by p1 or p2. Boom > - there's memory corruption. It won't reinitialize anything, since this logic is controlled by the ShmemCtrl->NSharedBuffers, if it's already updated nothing will be changed. About the race condition you mention, there is indeed a window between receiving the ProcSignalBarrier and attaching to the global Barrier in resize, but I don't think any process will be able to touch buffer pool while inside this window. Even if it happens that the remapping itself was blazing fast that this window was enough to make one process late (e.g. if it was busy handling some other signal as you mention), as I've showed above it shouldn't be a problem. I can experiment with this case though, maybe there is a way to completely close this window to not thing about even potential scenarios. > > * Shared memory address space is now reserved for future usage, making > > shared memory segments clash (e.g. due to memory allocation) > > impossible. There is a new GUC to control how much space to reserve, > > which is called max_available_memory -- on the assumption that most of > > the time it would make sense to set its value to the total amount of > > memory on the machine. I'm open for suggestions regarding the name. > > With 0006 applied > + /* Clean up some reserved space to resize into */ > + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1) > ze, m->shmem))); > ... snip ... > + ptr = mremap(m->shmem, m->shmem_size, new_size, 0); > > We unmap the portion of reserved address space where the existing > segment would expand into. As long as we are just expanding this will > work. I am wondering how would this work for shrinking buffers? What > scheme do you have in mind? I didn't like this part originally, and after changes to support hugetlb I think it's worth it just to replace mremap with munmap/mmap. That way there will be no such question, e.g. if a segment is getting shrinked the unmapped area will again become a part of the reserved space. > > * There is one more patch to address hugepages remap. As mentioned in > > this thread above, Linux kernel has certain limitations when it comes > > to mremap for segments allocated with huge pages. To work around it's > > possible to replace mremap with a sequence of unmap and map again, > > relying on the anon file behind the segment to keep the memory > > content. I haven't found any downsides of this approach so far, but it > > makes the anonymous file patch 0007 mandatory. > > In 0008 > if (munmap(m->shmem, m->shmem_size) < 0) > ... snip ... > /* Resize the backing anon file. */ > if(ftruncate(m->segment_fd, new_size) == -1) > ... > /* Reclaim the space */ > ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE, > mmap_flags | MAP_FIXED, m->segment_fd, 0); > > How are we preventing something get mapped into the space after > m->shmem + newsize? We will need to add an unallocated but reserved > addressed space map after m->shmem+newsize right? Nope, the segment is allocated from the reserved space already, with some chunk of it left after the segment's end for resizing purposes. We only take some part of the designated space, the rest is still reserved.
pgsql-hackers by date: