Thread: Re: Changing shared_buffers without restart
> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > changing shared memory mapping layout. Any feedback is appreciated. Hi, Here is a new version of the patch, which contains a proposal about how to coordinate shared memory resizing between backends. The rest is more or less the same, a feedback about coordination is appreciated. It's a lot to read, but the main difference is about: 1. Allowing to decouple a GUC value change from actually applying it, sort of a "pending" change. The idea is to let a custom logic be triggered on an assign hook, and then take responsibility for what happens later and how it's going to be applied. This allows to use regular GUC infrastructure in cases where value change requires some complicated processing. I was trying to make the change not so invasive, plus it's missing GUC reporting yet. 2. Shared memory resizing patch became more complicated thanks to some coordination between backends. The current implementation was chosen from few more or less equal alternatives, which are evolving along following lines: * There should be one "coordinator" process overseeing the change. Having postmaster to fulfill this role like in this patch seems like a natural idea, but it poses certain challenges since it doesn't have locking infrastructure. Another option would be to elect a single backend to be a coordinator, which will handle the postmaster as a special case. If there will ever be a "coordinator" worker in Postgres, that would be useful here. * The coordinator uses EmitProcSignalBarrier to reach out to all other backends and trigger the resize process. Backends join a Barrier to synchronize and wait untill everyone is finished. * There is some resizing state stored in shared memory, which is there to handle backends that were for some reason late or didn't receive the signal. What to store there is open for discussion. * Since we want to make sure all processes share the same understanding of what NBuffers value is, any failure is mostly a hard stop, since to rollback the change coordination is needed as well and sounds a bit too complicated for now. We've tested this change manually for now, although it might be useful to try out injection points. The testing strategy, which has caught plenty of bugs, was simply to run pgbench workload against a running instance and change shared_buffers on the fly. Some more subtle cases were verified by manually injecting delays to trigger expected scenarios. To reiterate, here is patches breakdown: Patches 1-3 prepare the infrastructure and shared memory layout. They could be useful even with multithreaded PostgreSQL, when there will be no need for shared memory. I assume, in the multithreaded world there still will be need for a contiguous chunk of memory to share between threads, and its layout would be similar to the one with shared memory mappings. Note that patch nr 2 is going away as soon as I'll get to implement shared memory address reservation, but for now it's needed. Patch 4 is a new addition to handle "pending" GUC changes. Patch 5 actually does resizing. It's shared memory specific of course, and utilized Linux specific mremap, meaning open portability questions. Patch 6 is somewhat independent, but quite convenient to have. It also utilizes Linux specific call memfd_create. I would like to get some feedback on the synchronization part. While waiting I'll proceed implementing shared memory address space reservation and Ashutosh will continue with buffer eviction to support shared memory reduction.
Attachment
- v2-0001-Allow-to-use-multiple-shared-memory-mappings.patch
- v2-0002-Allow-placing-shared-memory-mapping-with-an-offse.patch
- v2-0003-Introduce-multiple-shmem-segments-for-shared-buff.patch
- v2-0004-Introduce-pending-flag-for-GUC-assign-hooks.patch
- v2-0005-Allow-to-resize-shared-memory-without-restart.patch
- v2-0006-Use-anonymous-files-to-back-shared-memory-segment.patch
> On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote: > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > > changing shared memory mapping layout. Any feedback is appreciated. > > Hi, > > Here is a new version of the patch, which contains a proposal about how to > coordinate shared memory resizing between backends. The rest is more or less > the same, a feedback about coordination is appreciated. It's a lot to read, but > the main difference is about: Just one note, there are still couple of compilation warnings in the code, which I haven't addressed yet. Those will go away in the next version.
On Thu, Feb 27, 2025 at 1:58 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote: > > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > > > changing shared memory mapping layout. Any feedback is appreciated. > > > > Hi, > > > > Here is a new version of the patch, which contains a proposal about how to > > coordinate shared memory resizing between backends. The rest is more or less > > the same, a feedback about coordination is appreciated. It's a lot to read, but > > the main difference is about: > > Just one note, there are still couple of compilation warnings in the > code, which I haven't addressed yet. Those will go away in the next > version. PFA the patchset which implements shrinking shared buffers. 0001-0006 are same as the previous patchset 0007 fixes compilation warnings from previous patches - I think those should be absorbed into their respective patches 0008 adds TODOs that need some code changes or at least need some consideration. Some of them might point to the causes of Assertion failures seen with this patch set. 0009 adds WIP support for shrinking shared buffers - I think this should be absorbed into 0005 0010 WIP fix for Assertion failures seen from BgBufferSync() - I am still investigating those. I am using the attached script to shake the patch well. It runs pgbench and concurrently resizes the shared_buffers. I am seeing Assertion failures when running the script in both cases, expanding and shrinking the buffers. I am investigating "failed Assert("strategy_delta >= 0")," next. -- Best Wishes, Ashutosh Bapat
Attachment
- 0004-Introduce-pending-flag-for-GUC-assign-hooks-20250228.patch
- 0003-Introduce-multiple-shmem-segments-for-share-20250228.patch
- 0005-Allow-to-resize-shared-memory-without-resta-20250228.patch
- 0002-Allow-placing-shared-memory-mapping-with-an-20250228.patch
- 0001-Allow-to-use-multiple-shared-memory-mapping-20250228.patch
- 0006-Use-anonymous-files-to-back-shared-memory-s-20250228.patch
- 0010-WIP-Reinitialize-buffer-sync-strategy-20250228.patch
- 0007-Fix-compilation-failures-in-previous-patche-20250228.patch
- 0009-WIP-Support-shrinking-shared-buffers-20250228.patch
- 0008-Add-TODOs-and-questions-about-previous-comm-20250228.patch
- pgbench-concurrent-resize-buffers.sh
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > > changing shared memory mapping layout. Any feedback is appreciated. > > Hi, > > Here is a new version of the patch, which contains a proposal about how to > coordinate shared memory resizing between backends. The rest is more or less > the same, a feedback about coordination is appreciated. It's a lot to read, but > the main difference is about: Thanks Dmitry for the summary. > > 1. Allowing to decouple a GUC value change from actually applying it, sort of a > "pending" change. The idea is to let a custom logic be triggered on an assign > hook, and then take responsibility for what happens later and how it's going to > be applied. This allows to use regular GUC infrastructure in cases where value > change requires some complicated processing. I was trying to make the change > not so invasive, plus it's missing GUC reporting yet. > > 2. Shared memory resizing patch became more complicated thanks to some > coordination between backends. The current implementation was chosen from few > more or less equal alternatives, which are evolving along following lines: > > * There should be one "coordinator" process overseeing the change. Having > postmaster to fulfill this role like in this patch seems like a natural idea, > but it poses certain challenges since it doesn't have locking infrastructure. > Another option would be to elect a single backend to be a coordinator, which > will handle the postmaster as a special case. If there will ever be a > "coordinator" worker in Postgres, that would be useful here. > > * The coordinator uses EmitProcSignalBarrier to reach out to all other backends > and trigger the resize process. Backends join a Barrier to synchronize and wait > untill everyone is finished. > > * There is some resizing state stored in shared memory, which is there to > handle backends that were for some reason late or didn't receive the signal. > What to store there is open for discussion. > > * Since we want to make sure all processes share the same understanding of what > NBuffers value is, any failure is mostly a hard stop, since to rollback the > change coordination is needed as well and sounds a bit too complicated for now. > I think we should add a way to monitor the progress of resizing; at least whether resizing is complete and whether the new GUC value is in effect. > We've tested this change manually for now, although it might be useful to try > out injection points. The testing strategy, which has caught plenty of bugs, > was simply to run pgbench workload against a running instance and change > shared_buffers on the fly. Some more subtle cases were verified by manually > injecting delays to trigger expected scenarios. I have shared a script with my changes but it's far from being full testing. We will need to use injection points to test specific scenarios. -- Best Wishes, Ashutosh Bapat
Dmitry / Ashutosh,
Thanks for the patch set. I've been doing some testing with it and in particular want to see if this solution would work with hugepage bufferpool.
I ran some simple tests (outside of PG) on linux kernel v6.1, which has this commit that added some hugepage support to mremap (
https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/).
From reading the kernel code and testing, for a hugepage-backed mapping it seems mremap supports only shrinking but not growing. Further, for shrinking, what I observed is that after mremap is called the hugepage memory
is not released back to the OS, rather it's released when the fd is closed (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
I ran some simple tests (outside of PG) on linux kernel v6.1, which has this commit that added some hugepage support to mremap (
https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/).
From reading the kernel code and testing, for a hugepage-backed mapping it seems mremap supports only shrinking but not growing. Further, for shrinking, what I observed is that after mremap is called the hugepage memory
is not released back to the OS, rather it's released when the fd is closed (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
I'm not sure if this behavior is expected, but being able to release memory back to the OS immediately after mremap would be important for use cases such as supporting "serverless" PG instances on the cloud.
I'm no expert in the linux kernel so I could be missing something. It'd be great if you or somebody can comment on these observations and whether this mremap-based solution would work with hugepage bufferpool.
I also attached the test program in case someone can spot I did something wrong.
Regards,
Jack Ng
I'm no expert in the linux kernel so I could be missing something. It'd be great if you or somebody can comment on these observations and whether this mremap-based solution would work with hugepage bufferpool.
I also attached the test program in case someone can spot I did something wrong.
Regards,
Jack Ng
On Tue, Mar 18, 2025 at 11:02 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > changing shared memory mapping layout. Any feedback is appreciated.
>
> Hi,
>
> Here is a new version of the patch, which contains a proposal about how to
> coordinate shared memory resizing between backends. The rest is more or less
> the same, a feedback about coordination is appreciated. It's a lot to read, but
> the main difference is about:
Thanks Dmitry for the summary.
>
> 1. Allowing to decouple a GUC value change from actually applying it, sort of a
> "pending" change. The idea is to let a custom logic be triggered on an assign
> hook, and then take responsibility for what happens later and how it's going to
> be applied. This allows to use regular GUC infrastructure in cases where value
> change requires some complicated processing. I was trying to make the change
> not so invasive, plus it's missing GUC reporting yet.
>
> 2. Shared memory resizing patch became more complicated thanks to some
> coordination between backends. The current implementation was chosen from few
> more or less equal alternatives, which are evolving along following lines:
>
> * There should be one "coordinator" process overseeing the change. Having
> postmaster to fulfill this role like in this patch seems like a natural idea,
> but it poses certain challenges since it doesn't have locking infrastructure.
> Another option would be to elect a single backend to be a coordinator, which
> will handle the postmaster as a special case. If there will ever be a
> "coordinator" worker in Postgres, that would be useful here.
>
> * The coordinator uses EmitProcSignalBarrier to reach out to all other backends
> and trigger the resize process. Backends join a Barrier to synchronize and wait
> untill everyone is finished.
>
> * There is some resizing state stored in shared memory, which is there to
> handle backends that were for some reason late or didn't receive the signal.
> What to store there is open for discussion.
>
> * Since we want to make sure all processes share the same understanding of what
> NBuffers value is, any failure is mostly a hard stop, since to rollback the
> change coordination is needed as well and sounds a bit too complicated for now.
>
I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.
> We've tested this change manually for now, although it might be useful to try
> out injection points. The testing strategy, which has caught plenty of bugs,
> was simply to run pgbench workload against a running instance and change
> shared_buffers on the fly. Some more subtle cases were verified by manually
> injecting delays to trigger expected scenarios.
I have shared a script with my changes but it's far from being full
testing. We will need to use injection points to test specific
scenarios.
--
Best Wishes,
Ashutosh Bapat
Attachment
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote: > > I ran some simple tests (outside of PG) on linux kernel v6.1, which has > this commit that added some hugepage support to mremap ( > https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/ > ). > > From reading the kernel code and testing, for a hugepage-backed mapping it > seems mremap supports only shrinking but not growing. Further, for > shrinking, what I observed is that after mremap is called the hugepage > memory > is not released back to the OS, rather it's released when the fd is closed > (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS). > I'm not sure if this behavior is expected, but being able to release memory > back to the OS immediately after mremap would be important for use cases > such as supporting "serverless" PG instances on the cloud. > > I'm no expert in the linux kernel so I could be missing something. It'd be > great if you or somebody can comment on these observations and whether this > mremap-based solution would work with hugepage bufferpool. Hm, I think you're right. I didn't realize there is such limitation, but just verified on the latest kernel build and hit the same condition on increasing hugetlb mapping you've mentioned above. That's annoying of course, but I've got another approach I was originally experimenting with -- instead of mremap do munmap and mmap with the new size and rely on the anonymous fd to keep the memory content in between. I'm currently reworking mmap'ing part of the patch, let me check if this new approach is something we could universally rely on.
Thanks for your insights and confirmation, Dmitry.
Right, I think the anonymous fd approach would work to keep the memory contents intact in between munmap and mmap with the new size, so bufferpool expansion would work.
But it seems shrinking would still be problematic, since that approach requires the anonymous fd to remain open (for memory content protection), and so munmap would not release the memory back to the OS right away (gets released when the fd is closed). From testing this is true for hugepage memory at least.
Regards,
Jack Ng
Right, I think the anonymous fd approach would work to keep the memory contents intact in between munmap and mmap with the new size, so bufferpool expansion would work.
But it seems shrinking would still be problematic, since that approach requires the anonymous fd to remain open (for memory content protection), and so munmap would not release the memory back to the OS right away (gets released when the fd is closed). From testing this is true for hugepage memory at least.
Is there a way around this? Or maybe I misunderstood what you have in mind ;)
Regards,
Jack Ng
On Thu, Mar 20, 2025 at 6:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:
>
> I ran some simple tests (outside of PG) on linux kernel v6.1, which has
> this commit that added some hugepage support to mremap (
> https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
> ).
>
> From reading the kernel code and testing, for a hugepage-backed mapping it
> seems mremap supports only shrinking but not growing. Further, for
> shrinking, what I observed is that after mremap is called the hugepage
> memory
> is not released back to the OS, rather it's released when the fd is closed
> (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
> I'm not sure if this behavior is expected, but being able to release memory
> back to the OS immediately after mremap would be important for use cases
> such as supporting "serverless" PG instances on the cloud.
>
> I'm no expert in the linux kernel so I could be missing something. It'd be
> great if you or somebody can comment on these observations and whether this
> mremap-based solution would work with hugepage bufferpool.
Hm, I think you're right. I didn't realize there is such limitation, but
just verified on the latest kernel build and hit the same condition on
increasing hugetlb mapping you've mentioned above. That's annoying of
course, but I've got another approach I was originally experimenting
with -- instead of mremap do munmap and mmap with the new size and rely
on the anonymous fd to keep the memory content in between. I'm currently
reworking mmap'ing part of the patch, let me check if this new approach
is something we could universally rely on.
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote: > Thanks for your insights and confirmation, Dmitry. > Right, I think the anonymous fd approach would work to keep the memory > contents intact in between munmap and mmap with the new size, so bufferpool > expansion would work. > But it seems shrinking would still be problematic, since that approach > requires the anonymous fd to remain open (for memory content protection), > and so munmap would not release the memory back to the OS right away (gets > released when the fd is closed). From testing this is true for hugepage > memory at least. > Is there a way around this? Or maybe I misunderstood what you have in mind > ;) The anonymous file will be truncated to it's new shrinked size before mapping it second time (I think this part is missing in your test example), to my understanding after a quick look at do_vmi_align_munmap, this should be enough to make the memory reclaimable.
You're right Dmitry, truncating the anonymous file before mapping it again does the trick! I see 'HugePages_Free' increases to the expected size right after the ftruncate call for shrinking.
This alternative approach looks very promising. Thanks.
Regards,
Jack Ng
On Fri, Mar 21, 2025 at 5:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
> Thanks for your insights and confirmation, Dmitry.
> Right, I think the anonymous fd approach would work to keep the memory
> contents intact in between munmap and mmap with the new size, so bufferpool
> expansion would work.
> But it seems shrinking would still be problematic, since that approach
> requires the anonymous fd to remain open (for memory content protection),
> and so munmap would not release the memory back to the OS right away (gets
> released when the fd is closed). From testing this is true for hugepage
> memory at least.
> Is there a way around this? Or maybe I misunderstood what you have in mind
> ;)
The anonymous file will be truncated to it's new shrinked size before
mapping it second time (I think this part is missing in your test
example), to my understanding after a quick look at do_vmi_align_munmap,
this should be enough to make the memory reclaimable.