Thread: Re: Changing shared_buffers without restart

Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Here is a new version of the patch, which contains a proposal about how to
coordinate shared memory resizing between backends. The rest is more or less
the same, a feedback about coordination is appreciated. It's a lot to read, but
the main difference is about:

1. Allowing to decouple a GUC value change from actually applying it, sort of a
"pending" change. The idea is to let a custom logic be triggered on an assign
hook, and then take responsibility for what happens later and how it's going to
be applied. This allows to use regular GUC infrastructure in cases where value
change requires some complicated processing. I was trying to make the change
not so invasive, plus it's missing GUC reporting yet.

2. Shared memory resizing patch became more complicated thanks to some
coordination between backends. The current implementation was chosen from few
more or less equal alternatives, which are evolving along following lines:

* There should be one "coordinator" process overseeing the change. Having
postmaster to fulfill this role like in this patch seems like a natural idea,
but it poses certain challenges since it doesn't have locking infrastructure.
Another option would be to elect a single backend to be a coordinator, which
will handle the postmaster as a special case. If there will ever be a
"coordinator" worker in Postgres, that would be useful here.

* The coordinator uses EmitProcSignalBarrier to reach out to all other backends
and trigger the resize process. Backends join a Barrier to synchronize and wait
untill everyone is finished.

* There is some resizing state stored in shared memory, which is there to
handle backends that were for some reason late or didn't receive the signal.
What to store there is open for discussion.

* Since we want to make sure all processes share the same understanding of what
NBuffers value is, any failure is mostly a hard stop, since to rollback the
change coordination is needed as well and sounds a bit too complicated for now.

We've tested this change manually for now, although it might be useful to try
out injection points. The testing strategy, which has caught plenty of bugs,
was simply to run pgbench workload against a running instance and change
shared_buffers on the fly. Some more subtle cases were verified by manually
injecting delays to trigger expected scenarios.

To reiterate, here is patches breakdown:

Patches 1-3 prepare the infrastructure and shared memory layout. They could be
useful even with multithreaded PostgreSQL, when there will be no need for
shared memory. I assume, in the multithreaded world there still will be need
for a contiguous chunk of memory to share between threads, and its layout would
be similar to the one with shared memory mappings. Note that patch nr 2 is
going away as soon as I'll get to implement shared memory address reservation,
but for now it's needed.

Patch 4 is a new addition to handle "pending" GUC changes.

Patch 5 actually does resizing. It's shared memory specific of course, and
utilized Linux specific mremap, meaning open portability questions.

Patch 6 is somewhat independent, but quite convenient to have. It also utilizes
Linux specific call memfd_create.

I would like to get some feedback on the synchronization part. While waiting
I'll proceed implementing shared memory address space reservation and Ashutosh
will continue with buffer eviction to support shared memory reduction.

Attachment

Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote:
> > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > changing shared memory mapping layout. Any feedback is appreciated.
>
> Hi,
>
> Here is a new version of the patch, which contains a proposal about how to
> coordinate shared memory resizing between backends. The rest is more or less
> the same, a feedback about coordination is appreciated. It's a lot to read, but
> the main difference is about:

Just one note, there are still couple of compilation warnings in the
code, which I haven't addressed yet. Those will go away in the next
version.



Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Thu, Feb 27, 2025 at 1:58 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote:
> > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > > changing shared memory mapping layout. Any feedback is appreciated.
> >
> > Hi,
> >
> > Here is a new version of the patch, which contains a proposal about how to
> > coordinate shared memory resizing between backends. The rest is more or less
> > the same, a feedback about coordination is appreciated. It's a lot to read, but
> > the main difference is about:
>
> Just one note, there are still couple of compilation warnings in the
> code, which I haven't addressed yet. Those will go away in the next
> version.

PFA the patchset which implements shrinking shared buffers.
0001-0006 are same as the previous patchset
0007 fixes compilation warnings from previous patches - I think those
should be absorbed into their respective patches
0008 adds TODOs that need some code changes or at least need some
consideration. Some of them might point to the causes of Assertion
failures seen with this patch set.
0009 adds WIP support for shrinking shared buffers - I think this
should be absorbed into 0005
0010 WIP fix for Assertion failures seen from BgBufferSync() - I am
still investigating those.

I am using the attached script to shake the patch well.  It runs
pgbench and concurrently resizes the shared_buffers. I am seeing
Assertion failures when running the script in both cases, expanding
and shrinking the buffers. I am investigating "failed
Assert("strategy_delta >= 0")," next.

--
Best Wishes,
Ashutosh Bapat

Attachment

Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > changing shared memory mapping layout. Any feedback is appreciated.
>
> Hi,
>
> Here is a new version of the patch, which contains a proposal about how to
> coordinate shared memory resizing between backends. The rest is more or less
> the same, a feedback about coordination is appreciated. It's a lot to read, but
> the main difference is about:

Thanks Dmitry for the summary.

>
> 1. Allowing to decouple a GUC value change from actually applying it, sort of a
> "pending" change. The idea is to let a custom logic be triggered on an assign
> hook, and then take responsibility for what happens later and how it's going to
> be applied. This allows to use regular GUC infrastructure in cases where value
> change requires some complicated processing. I was trying to make the change
> not so invasive, plus it's missing GUC reporting yet.
>
> 2. Shared memory resizing patch became more complicated thanks to some
> coordination between backends. The current implementation was chosen from few
> more or less equal alternatives, which are evolving along following lines:
>
> * There should be one "coordinator" process overseeing the change. Having
> postmaster to fulfill this role like in this patch seems like a natural idea,
> but it poses certain challenges since it doesn't have locking infrastructure.
> Another option would be to elect a single backend to be a coordinator, which
> will handle the postmaster as a special case. If there will ever be a
> "coordinator" worker in Postgres, that would be useful here.
>
> * The coordinator uses EmitProcSignalBarrier to reach out to all other backends
> and trigger the resize process. Backends join a Barrier to synchronize and wait
> untill everyone is finished.
>
> * There is some resizing state stored in shared memory, which is there to
> handle backends that were for some reason late or didn't receive the signal.
> What to store there is open for discussion.
>
> * Since we want to make sure all processes share the same understanding of what
> NBuffers value is, any failure is mostly a hard stop, since to rollback the
> change coordination is needed as well and sounds a bit too complicated for now.
>

I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.

> We've tested this change manually for now, although it might be useful to try
> out injection points. The testing strategy, which has caught plenty of bugs,
> was simply to run pgbench workload against a running instance and change
> shared_buffers on the fly. Some more subtle cases were verified by manually
> injecting delays to trigger expected scenarios.

I have shared a script with my changes but it's far from being full
testing. We will need to use injection points to test specific
scenarios.

--
Best Wishes,
Ashutosh Bapat



Re: Changing shared_buffers without restart

From
Ni Ku
Date:
Dmitry / Ashutosh,
Thanks for the patch set. I've been doing some testing with it and in particular want to see if this solution would work with hugepage bufferpool.

I ran some simple tests (outside of PG) on linux kernel v6.1, which has this commit that added some hugepage support to mremap (
https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/).

From reading the kernel code and testing, for a hugepage-backed mapping it seems mremap supports only shrinking but not growing. Further, for shrinking, what I observed is that after mremap is called the hugepage memory
is not released back to the OS, rather it's released when the fd is closed (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
I'm not sure if this behavior is expected, but being able to release memory back to the OS immediately after mremap would be important for use cases such as supporting "serverless" PG instances on the cloud.

I'm no expert in the linux kernel so I could be missing something. It'd be great if you or somebody can comment on these observations and whether this mremap-based solution would work with hugepage bufferpool.

I also attached the test program in case someone can spot I did something wrong.

Regards,

Jack Ng

On Tue, Mar 18, 2025 at 11:02 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > changing shared memory mapping layout. Any feedback is appreciated.
>
> Hi,
>
> Here is a new version of the patch, which contains a proposal about how to
> coordinate shared memory resizing between backends. The rest is more or less
> the same, a feedback about coordination is appreciated. It's a lot to read, but
> the main difference is about:

Thanks Dmitry for the summary.

>
> 1. Allowing to decouple a GUC value change from actually applying it, sort of a
> "pending" change. The idea is to let a custom logic be triggered on an assign
> hook, and then take responsibility for what happens later and how it's going to
> be applied. This allows to use regular GUC infrastructure in cases where value
> change requires some complicated processing. I was trying to make the change
> not so invasive, plus it's missing GUC reporting yet.
>
> 2. Shared memory resizing patch became more complicated thanks to some
> coordination between backends. The current implementation was chosen from few
> more or less equal alternatives, which are evolving along following lines:
>
> * There should be one "coordinator" process overseeing the change. Having
> postmaster to fulfill this role like in this patch seems like a natural idea,
> but it poses certain challenges since it doesn't have locking infrastructure.
> Another option would be to elect a single backend to be a coordinator, which
> will handle the postmaster as a special case. If there will ever be a
> "coordinator" worker in Postgres, that would be useful here.
>
> * The coordinator uses EmitProcSignalBarrier to reach out to all other backends
> and trigger the resize process. Backends join a Barrier to synchronize and wait
> untill everyone is finished.
>
> * There is some resizing state stored in shared memory, which is there to
> handle backends that were for some reason late or didn't receive the signal.
> What to store there is open for discussion.
>
> * Since we want to make sure all processes share the same understanding of what
> NBuffers value is, any failure is mostly a hard stop, since to rollback the
> change coordination is needed as well and sounds a bit too complicated for now.
>

I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.

> We've tested this change manually for now, although it might be useful to try
> out injection points. The testing strategy, which has caught plenty of bugs,
> was simply to run pgbench workload against a running instance and change
> shared_buffers on the fly. Some more subtle cases were verified by manually
> injecting delays to trigger expected scenarios.

I have shared a script with my changes but it's far from being full
testing. We will need to use injection points to test specific
scenarios.

--
Best Wishes,
Ashutosh Bapat




Attachment

Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:
>
> I ran some simple tests (outside of PG) on linux kernel v6.1, which has
> this commit that added some hugepage support to mremap (
> https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
> ).
>
> From reading the kernel code and testing, for a hugepage-backed mapping it
> seems mremap supports only shrinking but not growing. Further, for
> shrinking, what I observed is that after mremap is called the hugepage
> memory
> is not released back to the OS, rather it's released when the fd is closed
> (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
> I'm not sure if this behavior is expected, but being able to release memory
> back to the OS immediately after mremap would be important for use cases
> such as supporting "serverless" PG instances on the cloud.
>
> I'm no expert in the linux kernel so I could be missing something. It'd be
> great if you or somebody can comment on these observations and whether this
> mremap-based solution would work with hugepage bufferpool.

Hm, I think you're right. I didn't realize there is such limitation, but
just verified on the latest kernel build and hit the same condition on
increasing hugetlb mapping you've mentioned above. That's annoying of
course, but I've got another approach I was originally experimenting
with -- instead of mremap do munmap and mmap with the new size and rely
on the anonymous fd to keep the memory content in between. I'm currently
reworking mmap'ing part of the patch, let me check if this new approach
is something we could universally rely on.



Re: Changing shared_buffers without restart

From
Ni Ku
Date:
Thanks for your insights and confirmation, Dmitry.
Right, I think the anonymous fd approach would work to keep the memory contents intact in between munmap and mmap with the new size, so bufferpool expansion would work.
But it seems shrinking would still be problematic, since that approach requires the anonymous fd to remain open (for memory content protection), and so munmap would not release the memory back to the OS right away (gets released when the fd is closed). From testing this is true for hugepage memory at least.
Is there a way around this? Or maybe I misunderstood what you have in mind ;)

Regards,

Jack Ng

On Thu, Mar 20, 2025 at 6:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:
>
> I ran some simple tests (outside of PG) on linux kernel v6.1, which has
> this commit that added some hugepage support to mremap (
> https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
> ).
>
> From reading the kernel code and testing, for a hugepage-backed mapping it
> seems mremap supports only shrinking but not growing. Further, for
> shrinking, what I observed is that after mremap is called the hugepage
> memory
> is not released back to the OS, rather it's released when the fd is closed
> (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
> I'm not sure if this behavior is expected, but being able to release memory
> back to the OS immediately after mremap would be important for use cases
> such as supporting "serverless" PG instances on the cloud.
>
> I'm no expert in the linux kernel so I could be missing something. It'd be
> great if you or somebody can comment on these observations and whether this
> mremap-based solution would work with hugepage bufferpool.

Hm, I think you're right. I didn't realize there is such limitation, but
just verified on the latest kernel build and hit the same condition on
increasing hugetlb mapping you've mentioned above. That's annoying of
course, but I've got another approach I was originally experimenting
with -- instead of mremap do munmap and mmap with the new size and rely
on the anonymous fd to keep the memory content in between. I'm currently
reworking mmap'ing part of the patch, let me check if this new approach
is something we could universally rely on.

Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
> Thanks for your insights and confirmation, Dmitry.
> Right, I think the anonymous fd approach would work to keep the memory
> contents intact in between munmap and mmap with the new size, so bufferpool
> expansion would work.
> But it seems shrinking would still be problematic, since that approach
> requires the anonymous fd to remain open (for memory content protection),
> and so munmap would not release the memory back to the OS right away (gets
> released when the fd is closed). From testing this is true for hugepage
> memory at least.
> Is there a way around this? Or maybe I misunderstood what you have in mind
> ;)

The anonymous file will be truncated to it's new shrinked size before
mapping it second time (I think this part is missing in your test
example), to my understanding after a quick look at do_vmi_align_munmap,
this should be enough to make the memory reclaimable.



Re: Changing shared_buffers without restart

From
Ni Ku
Date:
You're right Dmitry, truncating the anonymous file before mapping it again does the trick! I see 'HugePages_Free' increases to the expected size right after the ftruncate call for shrinking.
This alternative approach looks very promising. Thanks.

Regards,

Jack Ng

On Fri, Mar 21, 2025 at 5:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
> Thanks for your insights and confirmation, Dmitry.
> Right, I think the anonymous fd approach would work to keep the memory
> contents intact in between munmap and mmap with the new size, so bufferpool
> expansion would work.
> But it seems shrinking would still be problematic, since that approach
> requires the anonymous fd to remain open (for memory content protection),
> and so munmap would not release the memory back to the OS right away (gets
> released when the fd is closed). From testing this is true for hugepage
> memory at least.
> Is there a way around this? Or maybe I misunderstood what you have in mind
> ;)

The anonymous file will be truncated to it's new shrinked size before
mapping it second time (I think this part is missing in your test
example), to my understanding after a quick look at do_vmi_align_munmap,
this should be enough to make the memory reclaimable.