Thread: Re: Changing shared_buffers without restart
> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > changing shared memory mapping layout. Any feedback is appreciated. Hi, Here is a new version of the patch, which contains a proposal about how to coordinate shared memory resizing between backends. The rest is more or less the same, a feedback about coordination is appreciated. It's a lot to read, but the main difference is about: 1. Allowing to decouple a GUC value change from actually applying it, sort of a "pending" change. The idea is to let a custom logic be triggered on an assign hook, and then take responsibility for what happens later and how it's going to be applied. This allows to use regular GUC infrastructure in cases where value change requires some complicated processing. I was trying to make the change not so invasive, plus it's missing GUC reporting yet. 2. Shared memory resizing patch became more complicated thanks to some coordination between backends. The current implementation was chosen from few more or less equal alternatives, which are evolving along following lines: * There should be one "coordinator" process overseeing the change. Having postmaster to fulfill this role like in this patch seems like a natural idea, but it poses certain challenges since it doesn't have locking infrastructure. Another option would be to elect a single backend to be a coordinator, which will handle the postmaster as a special case. If there will ever be a "coordinator" worker in Postgres, that would be useful here. * The coordinator uses EmitProcSignalBarrier to reach out to all other backends and trigger the resize process. Backends join a Barrier to synchronize and wait untill everyone is finished. * There is some resizing state stored in shared memory, which is there to handle backends that were for some reason late or didn't receive the signal. What to store there is open for discussion. * Since we want to make sure all processes share the same understanding of what NBuffers value is, any failure is mostly a hard stop, since to rollback the change coordination is needed as well and sounds a bit too complicated for now. We've tested this change manually for now, although it might be useful to try out injection points. The testing strategy, which has caught plenty of bugs, was simply to run pgbench workload against a running instance and change shared_buffers on the fly. Some more subtle cases were verified by manually injecting delays to trigger expected scenarios. To reiterate, here is patches breakdown: Patches 1-3 prepare the infrastructure and shared memory layout. They could be useful even with multithreaded PostgreSQL, when there will be no need for shared memory. I assume, in the multithreaded world there still will be need for a contiguous chunk of memory to share between threads, and its layout would be similar to the one with shared memory mappings. Note that patch nr 2 is going away as soon as I'll get to implement shared memory address reservation, but for now it's needed. Patch 4 is a new addition to handle "pending" GUC changes. Patch 5 actually does resizing. It's shared memory specific of course, and utilized Linux specific mremap, meaning open portability questions. Patch 6 is somewhat independent, but quite convenient to have. It also utilizes Linux specific call memfd_create. I would like to get some feedback on the synchronization part. While waiting I'll proceed implementing shared memory address space reservation and Ashutosh will continue with buffer eviction to support shared memory reduction.
Attachment
- v2-0001-Allow-to-use-multiple-shared-memory-mappings.patch
- v2-0002-Allow-placing-shared-memory-mapping-with-an-offse.patch
- v2-0003-Introduce-multiple-shmem-segments-for-shared-buff.patch
- v2-0004-Introduce-pending-flag-for-GUC-assign-hooks.patch
- v2-0005-Allow-to-resize-shared-memory-without-restart.patch
- v2-0006-Use-anonymous-files-to-back-shared-memory-segment.patch
> On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote: > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > > changing shared memory mapping layout. Any feedback is appreciated. > > Hi, > > Here is a new version of the patch, which contains a proposal about how to > coordinate shared memory resizing between backends. The rest is more or less > the same, a feedback about coordination is appreciated. It's a lot to read, but > the main difference is about: Just one note, there are still couple of compilation warnings in the code, which I haven't addressed yet. Those will go away in the next version.
On Thu, Feb 27, 2025 at 1:58 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote: > > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > > > changing shared memory mapping layout. Any feedback is appreciated. > > > > Hi, > > > > Here is a new version of the patch, which contains a proposal about how to > > coordinate shared memory resizing between backends. The rest is more or less > > the same, a feedback about coordination is appreciated. It's a lot to read, but > > the main difference is about: > > Just one note, there are still couple of compilation warnings in the > code, which I haven't addressed yet. Those will go away in the next > version. PFA the patchset which implements shrinking shared buffers. 0001-0006 are same as the previous patchset 0007 fixes compilation warnings from previous patches - I think those should be absorbed into their respective patches 0008 adds TODOs that need some code changes or at least need some consideration. Some of them might point to the causes of Assertion failures seen with this patch set. 0009 adds WIP support for shrinking shared buffers - I think this should be absorbed into 0005 0010 WIP fix for Assertion failures seen from BgBufferSync() - I am still investigating those. I am using the attached script to shake the patch well. It runs pgbench and concurrently resizes the shared_buffers. I am seeing Assertion failures when running the script in both cases, expanding and shrinking the buffers. I am investigating "failed Assert("strategy_delta >= 0")," next. -- Best Wishes, Ashutosh Bapat
Attachment
- 0004-Introduce-pending-flag-for-GUC-assign-hooks-20250228.patch
- 0003-Introduce-multiple-shmem-segments-for-share-20250228.patch
- 0005-Allow-to-resize-shared-memory-without-resta-20250228.patch
- 0002-Allow-placing-shared-memory-mapping-with-an-20250228.patch
- 0001-Allow-to-use-multiple-shared-memory-mapping-20250228.patch
- 0006-Use-anonymous-files-to-back-shared-memory-s-20250228.patch
- 0010-WIP-Reinitialize-buffer-sync-strategy-20250228.patch
- 0007-Fix-compilation-failures-in-previous-patche-20250228.patch
- 0009-WIP-Support-shrinking-shared-buffers-20250228.patch
- 0008-Add-TODOs-and-questions-about-previous-comm-20250228.patch
- pgbench-concurrent-resize-buffers.sh
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > > changing shared memory mapping layout. Any feedback is appreciated. > > Hi, > > Here is a new version of the patch, which contains a proposal about how to > coordinate shared memory resizing between backends. The rest is more or less > the same, a feedback about coordination is appreciated. It's a lot to read, but > the main difference is about: Thanks Dmitry for the summary. > > 1. Allowing to decouple a GUC value change from actually applying it, sort of a > "pending" change. The idea is to let a custom logic be triggered on an assign > hook, and then take responsibility for what happens later and how it's going to > be applied. This allows to use regular GUC infrastructure in cases where value > change requires some complicated processing. I was trying to make the change > not so invasive, plus it's missing GUC reporting yet. > > 2. Shared memory resizing patch became more complicated thanks to some > coordination between backends. The current implementation was chosen from few > more or less equal alternatives, which are evolving along following lines: > > * There should be one "coordinator" process overseeing the change. Having > postmaster to fulfill this role like in this patch seems like a natural idea, > but it poses certain challenges since it doesn't have locking infrastructure. > Another option would be to elect a single backend to be a coordinator, which > will handle the postmaster as a special case. If there will ever be a > "coordinator" worker in Postgres, that would be useful here. > > * The coordinator uses EmitProcSignalBarrier to reach out to all other backends > and trigger the resize process. Backends join a Barrier to synchronize and wait > untill everyone is finished. > > * There is some resizing state stored in shared memory, which is there to > handle backends that were for some reason late or didn't receive the signal. > What to store there is open for discussion. > > * Since we want to make sure all processes share the same understanding of what > NBuffers value is, any failure is mostly a hard stop, since to rollback the > change coordination is needed as well and sounds a bit too complicated for now. > I think we should add a way to monitor the progress of resizing; at least whether resizing is complete and whether the new GUC value is in effect. > We've tested this change manually for now, although it might be useful to try > out injection points. The testing strategy, which has caught plenty of bugs, > was simply to run pgbench workload against a running instance and change > shared_buffers on the fly. Some more subtle cases were verified by manually > injecting delays to trigger expected scenarios. I have shared a script with my changes but it's far from being full testing. We will need to use injection points to test specific scenarios. -- Best Wishes, Ashutosh Bapat
I ran some simple tests (outside of PG) on linux kernel v6.1, which has this commit that added some hugepage support to mremap (
https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/).
From reading the kernel code and testing, for a hugepage-backed mapping it seems mremap supports only shrinking but not growing. Further, for shrinking, what I observed is that after mremap is called the hugepage memory
is not released back to the OS, rather it's released when the fd is closed (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
I'm no expert in the linux kernel so I could be missing something. It'd be great if you or somebody can comment on these observations and whether this mremap-based solution would work with hugepage bufferpool.
I also attached the test program in case someone can spot I did something wrong.
Regards,
Jack Ng
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > changing shared memory mapping layout. Any feedback is appreciated.
>
> Hi,
>
> Here is a new version of the patch, which contains a proposal about how to
> coordinate shared memory resizing between backends. The rest is more or less
> the same, a feedback about coordination is appreciated. It's a lot to read, but
> the main difference is about:
Thanks Dmitry for the summary.
>
> 1. Allowing to decouple a GUC value change from actually applying it, sort of a
> "pending" change. The idea is to let a custom logic be triggered on an assign
> hook, and then take responsibility for what happens later and how it's going to
> be applied. This allows to use regular GUC infrastructure in cases where value
> change requires some complicated processing. I was trying to make the change
> not so invasive, plus it's missing GUC reporting yet.
>
> 2. Shared memory resizing patch became more complicated thanks to some
> coordination between backends. The current implementation was chosen from few
> more or less equal alternatives, which are evolving along following lines:
>
> * There should be one "coordinator" process overseeing the change. Having
> postmaster to fulfill this role like in this patch seems like a natural idea,
> but it poses certain challenges since it doesn't have locking infrastructure.
> Another option would be to elect a single backend to be a coordinator, which
> will handle the postmaster as a special case. If there will ever be a
> "coordinator" worker in Postgres, that would be useful here.
>
> * The coordinator uses EmitProcSignalBarrier to reach out to all other backends
> and trigger the resize process. Backends join a Barrier to synchronize and wait
> untill everyone is finished.
>
> * There is some resizing state stored in shared memory, which is there to
> handle backends that were for some reason late or didn't receive the signal.
> What to store there is open for discussion.
>
> * Since we want to make sure all processes share the same understanding of what
> NBuffers value is, any failure is mostly a hard stop, since to rollback the
> change coordination is needed as well and sounds a bit too complicated for now.
>
I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.
> We've tested this change manually for now, although it might be useful to try
> out injection points. The testing strategy, which has caught plenty of bugs,
> was simply to run pgbench workload against a running instance and change
> shared_buffers on the fly. Some more subtle cases were verified by manually
> injecting delays to trigger expected scenarios.
I have shared a script with my changes but it's far from being full
testing. We will need to use injection points to test specific
scenarios.
--
Best Wishes,
Ashutosh Bapat
Attachment
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote: > > I ran some simple tests (outside of PG) on linux kernel v6.1, which has > this commit that added some hugepage support to mremap ( > https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/ > ). > > From reading the kernel code and testing, for a hugepage-backed mapping it > seems mremap supports only shrinking but not growing. Further, for > shrinking, what I observed is that after mremap is called the hugepage > memory > is not released back to the OS, rather it's released when the fd is closed > (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS). > I'm not sure if this behavior is expected, but being able to release memory > back to the OS immediately after mremap would be important for use cases > such as supporting "serverless" PG instances on the cloud. > > I'm no expert in the linux kernel so I could be missing something. It'd be > great if you or somebody can comment on these observations and whether this > mremap-based solution would work with hugepage bufferpool. Hm, I think you're right. I didn't realize there is such limitation, but just verified on the latest kernel build and hit the same condition on increasing hugetlb mapping you've mentioned above. That's annoying of course, but I've got another approach I was originally experimenting with -- instead of mremap do munmap and mmap with the new size and rely on the anonymous fd to keep the memory content in between. I'm currently reworking mmap'ing part of the patch, let me check if this new approach is something we could universally rely on.
Right, I think the anonymous fd approach would work to keep the memory contents intact in between munmap and mmap with the new size, so bufferpool expansion would work.
But it seems shrinking would still be problematic, since that approach requires the anonymous fd to remain open (for memory content protection), and so munmap would not release the memory back to the OS right away (gets released when the fd is closed). From testing this is true for hugepage memory at least.
Regards,
Jack Ng
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:
>
> I ran some simple tests (outside of PG) on linux kernel v6.1, which has
> this commit that added some hugepage support to mremap (
> https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
> ).
>
> From reading the kernel code and testing, for a hugepage-backed mapping it
> seems mremap supports only shrinking but not growing. Further, for
> shrinking, what I observed is that after mremap is called the hugepage
> memory
> is not released back to the OS, rather it's released when the fd is closed
> (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
> I'm not sure if this behavior is expected, but being able to release memory
> back to the OS immediately after mremap would be important for use cases
> such as supporting "serverless" PG instances on the cloud.
>
> I'm no expert in the linux kernel so I could be missing something. It'd be
> great if you or somebody can comment on these observations and whether this
> mremap-based solution would work with hugepage bufferpool.
Hm, I think you're right. I didn't realize there is such limitation, but
just verified on the latest kernel build and hit the same condition on
increasing hugetlb mapping you've mentioned above. That's annoying of
course, but I've got another approach I was originally experimenting
with -- instead of mremap do munmap and mmap with the new size and rely
on the anonymous fd to keep the memory content in between. I'm currently
reworking mmap'ing part of the patch, let me check if this new approach
is something we could universally rely on.
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote: > Thanks for your insights and confirmation, Dmitry. > Right, I think the anonymous fd approach would work to keep the memory > contents intact in between munmap and mmap with the new size, so bufferpool > expansion would work. > But it seems shrinking would still be problematic, since that approach > requires the anonymous fd to remain open (for memory content protection), > and so munmap would not release the memory back to the OS right away (gets > released when the fd is closed). From testing this is true for hugepage > memory at least. > Is there a way around this? Or maybe I misunderstood what you have in mind > ;) The anonymous file will be truncated to it's new shrinked size before mapping it second time (I think this part is missing in your test example), to my understanding after a quick look at do_vmi_align_munmap, this should be enough to make the memory reclaimable.
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
> Thanks for your insights and confirmation, Dmitry.
> Right, I think the anonymous fd approach would work to keep the memory
> contents intact in between munmap and mmap with the new size, so bufferpool
> expansion would work.
> But it seems shrinking would still be problematic, since that approach
> requires the anonymous fd to remain open (for memory content protection),
> and so munmap would not release the memory back to the OS right away (gets
> released when the fd is closed). From testing this is true for hugepage
> memory at least.
> Is there a way around this? Or maybe I misunderstood what you have in mind
> ;)
The anonymous file will be truncated to it's new shrinked size before
mapping it second time (I think this part is missing in your test
example), to my understanding after a quick look at do_vmi_align_munmap,
this should be enough to make the memory reclaimable.
On Fri, Feb 28, 2025 at 5:31 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > I think we should add a way to monitor the progress of resizing; at > least whether resizing is complete and whether the new GUC value is in > effect. > I further tested this approach by tracing the barrier synchronization using the attached patch with adds a bunch of elogs(). I ran pgbench load and simultaneously executed following commands on a psql connection #alter system set shared_buffers to '200MB'; ALTER SYSTEM #select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) #show shared_buffers; shared_buffers ---------------- 200MB (1 row) #select count(*) from pg_stat_activity; count ------- 6 (1 row) #select pg_backend_pid(); - the backend where all these commands were executed pg_backend_pid ---------------- 878405 (1 row) I see the following in the postgresql error logs. 2025-03-12 11:04:53.812 IST [878167] LOG: received SIGHUP, reloading configuration files 2025-03-12 11:04:53.813 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 -- not all backends have reloaded configuration. 2025-03-12 11:04:53.813 IST [878173] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878173] LOG: attached when barrier was at phase 0 2025-03-12 11:04:53.813 IST [878173] LOG: reached barrier phase 1 2025-03-12 11:04:53.813 IST [878171] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878172] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878171] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878172] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878340] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.813 IST [878340] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.813 IST [878338] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.813 IST [878339] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.813 IST [878338] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.813 IST [878339] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878341] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878337] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878337] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878173] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878337] LOG: buffer resizing operation finished at phase 4 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.814 IST [878168] LOG: attached when barrier was at phase 0 2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 1 2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878168] LOG: buffer resizing operation finished at phase 3 2025-03-12 11:04:53.815 IST [878169] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.815 IST [878169] LOG: attached when barrier was at phase 0 2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 1 2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 2 2025-03-12 11:04:53.815 IST [878169] LOG: buffer resizing operation finished at phase 3 2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: attached when barrier was at phase 0 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 1 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 2 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: buffer resizing operation finished at phase 3 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; To tell the story in short. pid 173 (for the sake of brevity I am just mentioning the last three digits of PID) attached to the barrier first and immediately reached phase 1. 171, 172, 340, 338, 339, 341, 337 - all attached barrier in phase 1. All of these backends completed the phases in synchronous fashion. But 168, 169 and 405 were yet to attach to the barrier since they hadn't loaded their configurations yet. Each of these backends then finished all phases independent of others. For your reference #select pid, application_name, backend_type from pg_stat_activity where pid in (878169, 878168); pid | application_name | backend_type --------+------------------+------------------- 878168 | | checkpointer 878169 | | background writer (2 rows) This is because the BarrierArriveAndWait() only waits for all the attached backends. It doesn't wait for backends which are yet to attach. I think what we want is *all* the backends should execute all the phases synchronously and wait for others to finish. If we don't do that, there's a possibility that some of them would see inconsistent buffer states or even worse may not have necessary memory mapped and resized - thus causing segfaults. Am I correct? I think what needs to be done is that every backend should wait for other backends to attach themselves to the barrier before moving to the first phase. One way I can think of is we use two signal barriers - one to ensure that all the backends have attached themselves and second for the actual resizing. But then the postmaster needs to wait for all the processes to process the first signal barrier. A postmaster can not wait on anything. Maybe there's a way to poll, but I didn't find it. Does that mean that we have to make some other backend a coordinator? -- Best Wishes, Ashutosh Bapat
> On Mon, Apr 07, 2025 at 11:50:46AM GMT, Ashutosh Bapat wrote: > This is because the BarrierArriveAndWait() only waits for all the > attached backends. It doesn't wait for backends which are yet to > attach. I think what we want is *all* the backends should execute all > the phases synchronously and wait for others to finish. If we don't do > that, there's a possibility that some of them would see inconsistent > buffer states or even worse may not have necessary memory mapped and > resized - thus causing segfaults. Am I correct? > > I think what needs to be done is that every backend should wait for other > backends to attach themselves to the barrier before moving to the > first phase. One way I can think of is we use two signal barriers - > one to ensure that all the backends have attached themselves and > second for the actual resizing. But then the postmaster needs to wait for > all the processes to process the first signal barrier. A postmaster can > not wait on anything. Maybe there's a way to poll, but I didn't find > it. Does that mean that we have to make some other backend a coordinator? Yes, you're right, plain dynamic Barrier does not ensure all available processes will be synchronized. I was aware about the scenario you describe, it's mentioned in commentaries for the resize function. I was under the impression this should be enough, but after some more thinking I'm not so sure anymore. Let me try to structure it as a list of possible corner cases that we need to worry about: * New backend spawned while we're busy resizing shared memory. Those should wait until the resizing is complete and get the new size as well. * Old backend receives a resize message, but exits before attempting to resize. Those should be excluded from coordination. * A backend is blocked and not responding before or after the ProcSignalBarrier message was sent. I'm thinking about a failure situation, when one rogue backend is doing something without checking for interrupts. We need to wait for those to become responsive, and potentially abort shared memory resize after some timeout. * Backends join the barrier in disjoint groups with some time in between, which is longer than what it takes to resize shared memory. That means that relying only on the shared dynamic barrier is not enough -- it will only synchronize resize procedure withing those groups. Out of those I think the third poses some problems, e.g. if we shrinking the shared memory, but one backend is accessing buffer pool without checking for interrupts. In the v3 implementation this won't be handled correctly, other backends will ignore such rogue process. Independently from that we could reason about the logic much easier if it's guaranteed that all the process to resize shared memory will wait for each other to start simultaneously. Looks like to achieve that we need a slightly different combination of a global Barrier and ProcSignalBarrier mechanism. We can't use ProcSignalBarrier as it is, because processes need to wait for each other, and at the same time finish processing to bump the generation. We also can't use a simple dynamic Barrier due to possibility of disjoint groups of processes. A static Barrier is also not easier, because we would need somehow to know exact number of processes, which might change over time. I think a relatively elegant solution is to extend ProcSignalBarrier mechanism to track not only pss_barrierGeneration, as a sign that everything was processed, but also something like pss_barrierReceivedGeneration, indicating that the message was received everywhere but not processed yet. That would be enough to allow processes to wait until the resize message was received everywhere, then use a global Barrier to wait until all processes are finished. It's somehow similar to your proposal to use two signals, but has less implementation overhead. This would also allow different solutions regarding error handling. E.g. we could do an unbounded waiting for all processes we expect to resize, assuming that the user will be able to intervene and fix an issue if there is any. Or we can do a timed waiting, and abort the resize after some timeout of not all processes are ready yet. In the new v4 version of the patch the first option is implemented. On top of that there are following changes: * Shared memory address space is now reserved for future usage, making shared memory segments clash (e.g. due to memory allocation) impossible. There is a new GUC to control how much space to reserve, which is called max_available_memory -- on the assumption that most of the time it would make sense to set its value to the total amount of memory on the machine. I'm open for suggestions regarding the name. * There is one more patch to address hugepages remap. As mentioned in this thread above, Linux kernel has certain limitations when it comes to mremap for segments allocated with huge pages. To work around it's possible to replace mremap with a sequence of unmap and map again, relying on the anon file behind the segment to keep the memory content. I haven't found any downsides of this approach so far, but it makes the anonymous file patch 0007 mandatory.
Attachment
- v4-0001-Allow-to-use-multiple-shared-memory-mappings.patch
- v4-0002-Address-space-reservation-for-shared-memory.patch
- v4-0003-Introduce-multiple-shmem-segments-for-shared-buff.patch
- v4-0004-Introduce-pending-flag-for-GUC-assign-hooks.patch
- v4-0005-Introduce-pss_barrierReceivedGeneration.patch
- v4-0006-Allow-to-resize-shared-memory-without-restart.patch
- v4-0007-Use-anonymous-files-to-back-shared-memory-segment.patch
- v4-0008-Support-resize-for-hugetlb.patch
On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > In the new v4 version > of the patch the first option is implemented. > The patches don't apply cleanly using git am but patch -p1 applies them cleanly. However I see following compilation errors RuntimeError: command "ninja" failed with error [1/1954] Generating src/include/utils/errcodes with a custom command [2/1954] Generating src/include/storage/lwlocknames_h with a custom command [3/1954] Generating src/include/utils/wait_event_names with a custom command [4/1954] Compiling C object src/port/libpgport.a.p/pg_popcount_aarch64.c.o [5/1954] Compiling C object src/port/libpgport.a.p/pg_numa.c.o FAILED: src/port/libpgport.a.p/pg_numa.c.o cc -Isrc/port/libpgport.a.p -Isrc/include -I../../coderoot/pg/src/include -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -g -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -Wdeclaration-after-statement -Wno-format-truncation -Wno-stringop-truncation -fPIC -DFRONTEND -MD -MQ src/port/libpgport.a.p/pg_numa.c.o -MF src/port/libpgport.a.p/pg_numa.c.o.d -o src/port/libpgport.a.p/pg_numa.c.o -c ../../coderoot/pg/src/port/pg_numa.c In file included from ../../coderoot/pg/src/include/storage/spin.h:54, from ../../coderoot/pg/src/include/storage/condition_variable.h:26, from ../../coderoot/pg/src/include/storage/barrier.h:22, from ../../coderoot/pg/src/include/storage/pg_shmem.h:27, from ../../coderoot/pg/src/port/pg_numa.c:26: ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error "s_lock.h may not be included from frontend code" 93 | #error "s_lock.h may not be included from frontend code" | ^~~~~ In file included from ../../coderoot/pg/src/port/pg_numa.c:26: ../../coderoot/pg/src/include/storage/pg_shmem.h:66:9: error: unknown type name ‘pg_atomic_uint32’ 66 | pg_atomic_uint32 NSharedBuffers; | ^~~~~~~~~~~~~~~~ ../../coderoot/pg/src/include/storage/pg_shmem.h:68:9: error: unknown type name ‘pg_atomic_uint64’ 68 | pg_atomic_uint64 Generation; | ^~~~~~~~~~~~~~~~ ../../coderoot/pg/src/port/pg_numa.c: In function ‘pg_numa_get_pagesize’: ../../coderoot/pg/src/port/pg_numa.c:117:17: error: too few arguments to function ‘GetHugePageSize’ 117 | GetHugePageSize(&os_page_size, NULL); | ^~~~~~~~~~~~~~~ In file included from ../../coderoot/pg/src/port/pg_numa.c:26: ../../coderoot/pg/src/include/storage/pg_shmem.h:127:13: note: declared here 127 | extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags, | ^~~~~~~~~~~~~~~ -- Best Wishes, Ashutosh Bapat
> On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote: > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > In the new v4 version > > of the patch the first option is implemented. > > > > The patches don't apply cleanly using git am but patch -p1 applies > them cleanly. However I see following compilation errors > RuntimeError: command "ninja" failed with error Becase it's relatively meaningless to apply a patch to the tip of the master around the release freeze time :) Commit 65c298f61fc has introduced new usage of GetHugePageSize, which was modified in my patch. I'm going to address it with the next rebased version, in the meantime you can always use the specified base commit to apply the changeset: base-commit: 5e1915439085014140314979c4dd5e23bd677cac
On Wed, Apr 9, 2025 at 1:15 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote: > > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > > In the new v4 version > > > of the patch the first option is implemented. > > > > > > > The patches don't apply cleanly using git am but patch -p1 applies > > them cleanly. However I see following compilation errors > > RuntimeError: command "ninja" failed with error > > Becase it's relatively meaningless to apply a patch to the tip of the > master around the release freeze time :) Commit 65c298f61fc has > introduced new usage of GetHugePageSize, which was modified in my patch. > I'm going to address it with the next rebased version, in the meantime > you can always use the specified base commit to apply the changeset: > > base-commit: 5e1915439085014140314979c4dd5e23bd677cac There is a higher chance that people will try these patches now than it was two days before and more chance if they find the patches applicable easily. ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error "s_lock.h may not be included from frontend code" How about this? Why is that happening? -- Best Wishes, Ashutosh Bapat
> On Wed, Apr 09, 2025 at 01:20:16PM GMT, Ashutosh Bapat wrote: > ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error > "s_lock.h may not be included from frontend code" > > How about this? Why is that happening? The same -- as you can see it comes from compiling pg_numa.c, which as it seems used in frontend and imports pg_shmem.h . I wanted to reshuffle includes in the patch anyway, that would be a good excuse to finally do this.
On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Yes, you're right, plain dynamic Barrier does not ensure all available > processes will be synchronized. I was aware about the scenario you > describe, it's mentioned in commentaries for the resize function. I was > under the impression this should be enough, but after some more thinking > I'm not so sure anymore. Let me try to structure it as a list of > possible corner cases that we need to worry about: > > * New backend spawned while we're busy resizing shared memory. Those > should wait until the resizing is complete and get the new size as well. > > * Old backend receives a resize message, but exits before attempting to > resize. Those should be excluded from coordination. Should we detach barrier in on_exit()? > > * A backend is blocked and not responding before or after the > ProcSignalBarrier message was sent. I'm thinking about a failure > situation, when one rogue backend is doing something without checking > for interrupts. We need to wait for those to become responsive, and > potentially abort shared memory resize after some timeout. Right. > > I think a relatively elegant solution is to extend ProcSignalBarrier > mechanism to track not only pss_barrierGeneration, as a sign that > everything was processed, but also something like > pss_barrierReceivedGeneration, indicating that the message was received > everywhere but not processed yet. That would be enough to allow > processes to wait until the resize message was received everywhere, then > use a global Barrier to wait until all processes are finished. It's > somehow similar to your proposal to use two signals, but has less > implementation overhead. The way it's implemented in v4 still has the disjoint group problem. Assume backends p1, p2, p3. All three of them are executing ProcessProcSignalBarrier(). All three of them updated pss_barrierReceivedGeneration /* The message is observed, record that */ pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration, shared_gen); p1, p2 moved faster and reached following code from ProcessBarrierShmemResize() if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED) WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation)); Since all the processes have received the barrier message, p1, p2 move ahead and go through all the next phases and finish resizing even before p3 gets a chance to call ProcessBarrierShmemResize() and attach itself to Barrier. This could happen because it processed some other ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has not attached itself to the barrier. Once p1, p2 finish, p3 will attach itself to the barrier and resize buffers again - reinitializing the shared memory, which might has been already modified by p1 or p2. Boom - there's memory corruption. Either every process has to make sure that all the other extant backends have attached themselves to the barrier OR somebody has to ensure that and signal all the backends to proceed. The implementation doesn't do either. > > * Shared memory address space is now reserved for future usage, making > shared memory segments clash (e.g. due to memory allocation) > impossible. There is a new GUC to control how much space to reserve, > which is called max_available_memory -- on the assumption that most of > the time it would make sense to set its value to the total amount of > memory on the machine. I'm open for suggestions regarding the name. With 0006 applied + /* Clean up some reserved space to resize into */ + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1) ze, m->shmem))); ... snip ... + ptr = mremap(m->shmem, m->shmem_size, new_size, 0); We unmap the portion of reserved address space where the existing segment would expand into. As long as we are just expanding this will work. I am wondering how would this work for shrinking buffers? What scheme do you have in mind? > > * There is one more patch to address hugepages remap. As mentioned in > this thread above, Linux kernel has certain limitations when it comes > to mremap for segments allocated with huge pages. To work around it's > possible to replace mremap with a sequence of unmap and map again, > relying on the anon file behind the segment to keep the memory > content. I haven't found any downsides of this approach so far, but it > makes the anonymous file patch 0007 mandatory. In 0008 if (munmap(m->shmem, m->shmem_size) < 0) ... snip ... /* Resize the backing anon file. */ if(ftruncate(m->segment_fd, new_size) == -1) ... /* Reclaim the space */ ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE, mmap_flags | MAP_FIXED, m->segment_fd, 0); How are we preventing something get mapped into the space after m->shmem + newsize? We will need to add an unallocated but reserved addressed space map after m->shmem+newsize right? -- Best Wishes, Ashutosh Bapat
> On Fri, Apr 11, 2025 at 08:04:39PM GMT, Ashutosh Bapat wrote: > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > Yes, you're right, plain dynamic Barrier does not ensure all available > > processes will be synchronized. I was aware about the scenario you > > describe, it's mentioned in commentaries for the resize function. I was > > under the impression this should be enough, but after some more thinking > > I'm not so sure anymore. Let me try to structure it as a list of > > possible corner cases that we need to worry about: > > > > * New backend spawned while we're busy resizing shared memory. Those > > should wait until the resizing is complete and get the new size as well. > > > > * Old backend receives a resize message, but exits before attempting to > > resize. Those should be excluded from coordination. > > Should we detach barrier in on_exit()? Yeah, good point. > > I think a relatively elegant solution is to extend ProcSignalBarrier > > mechanism to track not only pss_barrierGeneration, as a sign that > > everything was processed, but also something like > > pss_barrierReceivedGeneration, indicating that the message was received > > everywhere but not processed yet. That would be enough to allow > > processes to wait until the resize message was received everywhere, then > > use a global Barrier to wait until all processes are finished. It's > > somehow similar to your proposal to use two signals, but has less > > implementation overhead. > > The way it's implemented in v4 still has the disjoint group problem. > Assume backends p1, p2, p3. All three of them are executing > ProcessProcSignalBarrier(). All three of them updated > pss_barrierReceivedGeneration > > /* The message is observed, record that */ > pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration, > shared_gen); > > p1, p2 moved faster and reached following code from ProcessBarrierShmemResize() > if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED) > WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation)); > > Since all the processes have received the barrier message, p1, p2 move > ahead and go through all the next phases and finish resizing even > before p3 gets a chance to call ProcessBarrierShmemResize() and attach > itself to Barrier. This could happen because it processed some other > ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has > not attached itself to the barrier. Once p1, p2 finish, p3 will attach > itself to the barrier and resize buffers again - reinitializing the > shared memory, which might has been already modified by p1 or p2. Boom > - there's memory corruption. It won't reinitialize anything, since this logic is controlled by the ShmemCtrl->NSharedBuffers, if it's already updated nothing will be changed. About the race condition you mention, there is indeed a window between receiving the ProcSignalBarrier and attaching to the global Barrier in resize, but I don't think any process will be able to touch buffer pool while inside this window. Even if it happens that the remapping itself was blazing fast that this window was enough to make one process late (e.g. if it was busy handling some other signal as you mention), as I've showed above it shouldn't be a problem. I can experiment with this case though, maybe there is a way to completely close this window to not thing about even potential scenarios. > > * Shared memory address space is now reserved for future usage, making > > shared memory segments clash (e.g. due to memory allocation) > > impossible. There is a new GUC to control how much space to reserve, > > which is called max_available_memory -- on the assumption that most of > > the time it would make sense to set its value to the total amount of > > memory on the machine. I'm open for suggestions regarding the name. > > With 0006 applied > + /* Clean up some reserved space to resize into */ > + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1) > ze, m->shmem))); > ... snip ... > + ptr = mremap(m->shmem, m->shmem_size, new_size, 0); > > We unmap the portion of reserved address space where the existing > segment would expand into. As long as we are just expanding this will > work. I am wondering how would this work for shrinking buffers? What > scheme do you have in mind? I didn't like this part originally, and after changes to support hugetlb I think it's worth it just to replace mremap with munmap/mmap. That way there will be no such question, e.g. if a segment is getting shrinked the unmapped area will again become a part of the reserved space. > > * There is one more patch to address hugepages remap. As mentioned in > > this thread above, Linux kernel has certain limitations when it comes > > to mremap for segments allocated with huge pages. To work around it's > > possible to replace mremap with a sequence of unmap and map again, > > relying on the anon file behind the segment to keep the memory > > content. I haven't found any downsides of this approach so far, but it > > makes the anonymous file patch 0007 mandatory. > > In 0008 > if (munmap(m->shmem, m->shmem_size) < 0) > ... snip ... > /* Resize the backing anon file. */ > if(ftruncate(m->segment_fd, new_size) == -1) > ... > /* Reclaim the space */ > ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE, > mmap_flags | MAP_FIXED, m->segment_fd, 0); > > How are we preventing something get mapped into the space after > m->shmem + newsize? We will need to add an unallocated but reserved > addressed space map after m->shmem+newsize right? Nope, the segment is allocated from the reserved space already, with some chunk of it left after the segment's end for resizing purposes. We only take some part of the designated space, the rest is still reserved.
On Fri, Apr 11, 2025 at 8:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > I think a relatively elegant solution is to extend ProcSignalBarrier > > > mechanism to track not only pss_barrierGeneration, as a sign that > > > everything was processed, but also something like > > > pss_barrierReceivedGeneration, indicating that the message was received > > > everywhere but not processed yet. That would be enough to allow > > > processes to wait until the resize message was received everywhere, then > > > use a global Barrier to wait until all processes are finished. It's > > > somehow similar to your proposal to use two signals, but has less > > > implementation overhead. > > > > The way it's implemented in v4 still has the disjoint group problem. > > Assume backends p1, p2, p3. All three of them are executing > > ProcessProcSignalBarrier(). All three of them updated > > pss_barrierReceivedGeneration > > > > /* The message is observed, record that */ > > pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration, > > shared_gen); > > > > p1, p2 moved faster and reached following code from ProcessBarrierShmemResize() > > if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED) > > WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation)); > > > > Since all the processes have received the barrier message, p1, p2 move > > ahead and go through all the next phases and finish resizing even > > before p3 gets a chance to call ProcessBarrierShmemResize() and attach > > itself to Barrier. This could happen because it processed some other > > ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has > > not attached itself to the barrier. Once p1, p2 finish, p3 will attach > > itself to the barrier and resize buffers again - reinitializing the > > shared memory, which might has been already modified by p1 or p2. Boom > > - there's memory corruption. > > It won't reinitialize anything, since this logic is controlled by the > ShmemCtrl->NSharedBuffers, if it's already updated nothing will be > changed. Ah, I see it now if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers) { Thanks for the clarification. However, when we put back the patches to shrink buffers, we will evict the extra buffers, and shrink - if all the processes haven't participated in the barrier by then, some of them may try to access those buffers - re-installing them and then bad things can happen. > > About the race condition you mention, there is indeed a window between > receiving the ProcSignalBarrier and attaching to the global Barrier in > resize, but I don't think any process will be able to touch buffer pool > while inside this window. Even if it happens that the remapping itself > was blazing fast that this window was enough to make one process late > (e.g. if it was busy handling some other signal as you mention), as I've > showed above it shouldn't be a problem. > > I can experiment with this case though, maybe there is a way to > completely close this window to not thing about even potential > scenarios. The window may be small today but we have to make this future proof. Multiple ProcSignalBarrier messages may be processed in a single call to ProcessProcSignalBarrier() and if each of those takes as long as buffer resizing, the window will get bigger and bigger. So we have to close this window. > > > > * Shared memory address space is now reserved for future usage, making > > > shared memory segments clash (e.g. due to memory allocation) > > > impossible. There is a new GUC to control how much space to reserve, > > > which is called max_available_memory -- on the assumption that most of > > > the time it would make sense to set its value to the total amount of > > > memory on the machine. I'm open for suggestions regarding the name. > > > > With 0006 applied > > + /* Clean up some reserved space to resize into */ > > + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1) > > ze, m->shmem))); > > ... snip ... > > + ptr = mremap(m->shmem, m->shmem_size, new_size, 0); > > > > We unmap the portion of reserved address space where the existing > > segment would expand into. As long as we are just expanding this will > > work. I am wondering how would this work for shrinking buffers? What > > scheme do you have in mind? > > I didn't like this part originally, and after changes to support hugetlb > I think it's worth it just to replace mremap with munmap/mmap. That way > there will be no such question, e.g. if a segment is getting shrinked > the unmapped area will again become a part of the reserved space. > I might have not noticed it, but are we putting two mappings one reserved and one allocated in the same address space, so that when the allocated mapping shrinks or expands, the reserved mapping continues to prohibit any other mapping from appearing there? I looked at some of the previous emails, but didn't find anything that describes how the reserved mapped space is managed. -- Best Wishes, Ashutosh Bapat
> On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote: > > However, when we put back the patches to shrink buffers, we will evict > the extra buffers, and shrink - if all the processes haven't > participated in the barrier by then, some of them may try to access > those buffers - re-installing them and then bad things can happen. As I've mentioned above, I don't see how a process could try to access a buffer, if it's on the path between receiving the ProcSignalBarrier and attaching to the global shmem Barrier, even if we shrink buffers. AFAICT interrupt handles should not touch buffers, and otherwise the process doesn't have any point withing this window where it might do this. Do you have some particular scenario in mind? > I might have not noticed it, but are we putting two mappings one > reserved and one allocated in the same address space, so that when the > allocated mapping shrinks or expands, the reserved mapping continues > to prohibit any other mapping from appearing there? I looked at some > of the previous emails, but didn't find anything that describes how > the reserved mapped space is managed. I though so, but this turns out to be incorrect. Just have done a small experiment -- looks like when reserving some space, mapping and unmapping a small segment from it leaves a non-mapped gap. That would mean for shrinking the new available space has to be reserved again.
On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote: > > > > However, when we put back the patches to shrink buffers, we will evict > > the extra buffers, and shrink - if all the processes haven't > > participated in the barrier by then, some of them may try to access > > those buffers - re-installing them and then bad things can happen. > > As I've mentioned above, I don't see how a process could try to access a > buffer, if it's on the path between receiving the ProcSignalBarrier and > attaching to the global shmem Barrier, even if we shrink buffers. > AFAICT interrupt handles should not touch buffers, and otherwise the > process doesn't have any point withing this window where it might do > this. Do you have some particular scenario in mind? ProcessProcSignalBarrier() is not within an interrupt handler but it responds to a flag set by an interrupt handler. After calling pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration, shared_gen); it will enter the loop while (flags != 0) where it may process many barriers before processing PROCSIGNAL_BARRIER_SHMEM_RESIZE. Nothing stops the other barrier processing code from touching buffers. Right now it's just smgrrelease that gets called in the other barrier. But that's not guaranteed in future. > > > I might have not noticed it, but are we putting two mappings one > > reserved and one allocated in the same address space, so that when the > > allocated mapping shrinks or expands, the reserved mapping continues > > to prohibit any other mapping from appearing there? I looked at some > > of the previous emails, but didn't find anything that describes how > > the reserved mapped space is managed. > > I though so, but this turns out to be incorrect. Just have done a small > experiment -- looks like when reserving some space, mapping and > unmapping a small segment from it leaves a non-mapped gap. That would > mean for shrinking the new available space has to be reserved again. Right. That's what I thought. But I didn't see the corresponding code. So we have to keep track of two mappings for every segment - 1 for allocation and one for reserving space and resize those two while shrinking and expanding buffers. Am I correct? -- Best Wishes, Ashutosh Bapat
Hi Dmitry, On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote: > > > > However, when we put back the patches to shrink buffers, we will evict > > the extra buffers, and shrink - if all the processes haven't > > participated in the barrier by then, some of them may try to access > > those buffers - re-installing them and then bad things can happen. > > As I've mentioned above, I don't see how a process could try to access a > buffer, if it's on the path between receiving the ProcSignalBarrier and > attaching to the global shmem Barrier, even if we shrink buffers. > AFAICT interrupt handles should not touch buffers, and otherwise the > process doesn't have any point withing this window where it might do > this. Do you have some particular scenario in mind? > > > I might have not noticed it, but are we putting two mappings one > > reserved and one allocated in the same address space, so that when the > > allocated mapping shrinks or expands, the reserved mapping continues > > to prohibit any other mapping from appearing there? I looked at some > > of the previous emails, but didn't find anything that describes how > > the reserved mapped space is managed. > > I though so, but this turns out to be incorrect. Just have done a small > experiment -- looks like when reserving some space, mapping and > unmapping a small segment from it leaves a non-mapped gap. That would > mean for shrinking the new available space has to be reserved again. In an offlist chat Thomas Munro mentioned that just ftruncate() would be enough to resize the shared memory without touching address maps using mmap and munmap(). ftruncate man page seems to concur with him If the effect of ftruncate() is to decrease the size of a memory mapped file or a shared memory object and whole pages beyond the new end were previously mapped, then the whole pages beyond the new end shall be discarded. References to discarded pages shall result in the generation of a SIGBUS signal. If the effect of ftruncate() is to increase the size of a memory object, it is unspecified whether the contents of any mapped pages between the old end-of-file and the new are flushed to the underlying object. ftruncate() when shrinking memory will release the extra pages and also would cause segmentation fault when memory outside the size of file is accessed even if the actual address map is larger than the mapped file. The expanded memory is allocated as it is written to, and those pages also become visible in the underlying object. I played with the attached small program under debugger observing pmap and /proc/<pid>/status after every memory operation. The address map always shows that it's as long as 300K memory. 00007fffd2200000 307200K rw-s- memfd:mmap_fd_exp (deleted) Immediately after mmap() RssShmem: 0 kB after first memset RssShmem: 307200 kB after ftruncate to 100MB (we don't need to wait for memset() to see the effect on RssShmem) RssShmem: 102400 kB after ftruncate to 200MB (requires memset to see effect on RssShmem) RssShmem: 102400 kB after memsetting upto 200MB RssShmem: 204800 kB All the observations concur with the man page. [1] https://man7.org/linux/man-pages/man3/ftruncate.3p.html#:~:text=If%20the%20effect%20of%20ftruncate,generation%20of%20a%20SIGBUS%20signal. -- Best Wishes, Ashutosh Bapat
On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via changing shared memory mapping layout. Any feedback is appreciated.
Hi Dmitry,
I am sorry that I have not participated in the discussion in this thread from the very beginning, although I am also very interested in dynamic shared buffer resizing and evn proposed my own implementation of it: https://github.com/knizhnik/postgres/pull/2 based on memory ballooning and using `madvise`. And it really works (returns unused memory to the system).
This PoC allows me to understand the main drawbacks of this approach:
1. Performance of Postgres CLOCK page eviction algorithm depends on number of shared buffers. My first native attempt just to mark unused buffers as invalid cause significant degrade of performance
pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S
(here shared_buffers - is maximal shared buffers size and `available_buffers` - is used part:
| shared_buffers | available_buffers | TPS | | ------------------| ---------------------------- | ---- | | 128MB | -1 | 280k | | 1GB | -1 | 324k | | 2GB | -1 | 358k | | 32GB | -1 | 350k | | 2GB | 128Mb | 130k | | 2GB | 1Gb | 311k | | 32GB | 128Mb | 13k | | 32GB | 1Gb | 140k | | 32GB | 2Gb | 348k |
My first thought is to replace clock with LRU based in double-linked list. As far as there is no lockless double-list implementation,
it need some global lock. This lock can become bottleneck. The standard solution is partitioning: use N LRU lists instead of 1.
Just as partitioned has table used by buffer manager to lockup buffers. Actually we can use the same partitions locks to protect LRU list.
But it not clear what to do with ring buffers (strategies).So I decided not to perform such revolution in bufmgr, but optimize clock to more efficiently split reserved buffers.
Just add skip_count
field to buffer descriptor. And it helps! Now the worst case shared_buffer/available_buffers = 32Gb/128Mb
shows the same performance 280k as shared_buffers=128Mb without ballooning.
2. There are several data structures i Postgres which size depends on number of buffers.
In my patch I used in some cases dynamic shared buffer size, but if this structure has to be allocated in shared memory then still maximal size has to be used. We have the buffers themselves (8 kB per buffer), then the main BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be quote noticeable with size differences larger than 2 orders of magnitude:
E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd have ~2GiB of static overhead on only 0.5GiB of actual buffers.
3. `madvise` is not portable.
Certainly you have moved much further in your proposal comparing with my PoC (including huge pages support).
But it is still not quite clear to me how you are going to solve the problems with large memory overhead in case of ~100x times variation of shared buffers size.
I
> On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote: > > In an offlist chat Thomas Munro mentioned that just ftruncate() would > be enough to resize the shared memory without touching address maps > using mmap and munmap(). > > ftruncate man page seems to concur with him > > If the effect of ftruncate() is to decrease the size of a memory > mapped file or a shared memory object and whole pages beyond the > new end were previously mapped, then the whole pages beyond the > new end shall be discarded. > > References to discarded pages shall result in the generation of a > SIGBUS signal. > > If the effect of ftruncate() is to increase the size of a memory > object, it is unspecified whether the contents of any mapped pages > between the old end-of-file and the new are flushed to the > underlying object. > > ftruncate() when shrinking memory will release the extra pages and > also would cause segmentation fault when memory outside the size of > file is accessed even if the actual address map is larger than the > mapped file. The expanded memory is allocated as it is written to, and > those pages also become visible in the underlying object. Thanks for sharing. I need to do more thorough tests, but after a quick look I'm not sure about that. ftruncate will take care about the memory, but AFAICT the memory mapping will stay the same, is that what you mean? In that case if the segment got increased, the memory still can't be used because it's beyond the mapping end (at least in my test that's what happened). If the segment got shrinked, the memory couldn't be reclaimed, because, well, there is already a mapping. Or do I miss something? > > > I might have not noticed it, but are we putting two mappings one > > > reserved and one allocated in the same address space, so that when the > > > allocated mapping shrinks or expands, the reserved mapping continues > > > to prohibit any other mapping from appearing there? I looked at some > > > of the previous emails, but didn't find anything that describes how > > > the reserved mapped space is managed. > > > > I though so, but this turns out to be incorrect. Just have done a small > > experiment -- looks like when reserving some space, mapping and > > unmapping a small segment from it leaves a non-mapped gap. That would > > mean for shrinking the new available space has to be reserved again. > > Right. That's what I thought. But I didn't see the corresponding code. > So we have to keep track of two mappings for every segment - 1 for > allocation and one for reserving space and resize those two while > shrinking and expanding buffers. Am I correct? Not necessarily, depending on what we want. Again, I'll do a bit more testing, but after a quick check it seems that it's possible to "plug" the gap with a new reservation mapping, then reallocate it to another mapping or unmap both reservations (main and the "gap" one) at once. That would mean that for the current functionality we don't need to track reservation in any way more than just start and the end of the "main" reserved space. The only consequence I can imagine is possible fragmentation of the reserved space in case of frequent increase/decrease of a segment with even decreasing size. But since it's only reserved space, which will not really be used, it's probably not going to be a problem.
> On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote: > > 1. Performance of Postgres CLOCK page eviction algorithm depends on number > of shared buffers. My first native attempt just to mark unused buffers as > invalid cause significant degrade of performance Thanks for sharing! Right, but it concerns the case when the number of shared buffers is high, independently from whether it was changed online or with a restart, correct? In that case it's out of scope for this patch. > 2. There are several data structures i Postgres which size depends on number > of buffers. > In my patch I used in some cases dynamic shared buffer size, but if this > structure has to be allocated in shared memory then still maximal size has > to be used. We have the buffers themselves (8 kB per buffer), then the main > BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's > CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry). > 128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be > quote noticeable with size differences larger than 2 orders of magnitude: > E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd > have ~2GiB of static overhead on only 0.5GiB of actual buffers. Not sure what do you mean by using a maximal size, can you elaborate. In the current patch those structures are allocated as before, except each goes into a separate segment -- without any extra memory overhead as far as I see. > 3. `madvise` is not portable. The current implementation doesn't rely on madvise so far (it might for shared memory shrinking), but yeah there are plenty of other not very portable things (MAP_FIXED, memfd_create). All of that is mentioned in the corresponding patches as a limitation.
I also have a related question about how ftruncate() is used in the patch.
In my testing I also see that when using ftruncate to shrink a shared segment, the memory is freed immediately after the call, even if other processes still have that memory mapped, and they will hit SIGBUS if they try to access that memory again as the manpage says.
So am I correct to think that, to support the bufferpool shrinking case, it would not be safe to call ftruncate in AnonymousShmemResize as-is, since at that point other processes may still be using pages that belong to the truncated memory?
It appears that for shrinking we should only call ftruncate when we're sure no process will access those pages again (eg, all processes have handled the resize interrupt signal barrier). I suppose this can be done by the resize coordinator after synchronizing with all the other processes.
But in that case it seems we cannot use the postmaster as the coordinator then? b/c I see some code comments saying the postmaster does not have waiting infrastructure... (maybe even if the postmaster has waiting infra we don't want to use it anyway since it can be blocked for a long time and won't be able to serve other requests).
Regards,
Jack Ng
On 18/04/2025 12:26 am, Dmitry Dolgov wrote: >> On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote: >> >> 1. Performance of Postgres CLOCK page eviction algorithm depends on number >> of shared buffers. My first native attempt just to mark unused buffers as >> invalid cause significant degrade of performance > Thanks for sharing! > > Right, but it concerns the case when the number of shared buffers is > high, independently from whether it was changed online or with a > restart, correct? In that case it's out of scope for this patch. > >> 2. There are several data structures i Postgres which size depends on number >> of buffers. >> In my patch I used in some cases dynamic shared buffer size, but if this >> structure has to be allocated in shared memory then still maximal size has >> to be used. We have the buffers themselves (8 kB per buffer), then the main >> BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's >> CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry). >> 128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be >> quote noticeable with size differences larger than 2 orders of magnitude: >> E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd >> have ~2GiB of static overhead on only 0.5GiB of actual buffers. > Not sure what do you mean by using a maximal size, can you elaborate. > > In the current patch those structures are allocated as before, except > each goes into a separate segment -- without any extra memory overhead > as far as I see. Thank you for explanation. I am sorry that I have not precisely investigated your patch before writing: it seems to be that you are are placing in separate segment only content of shared buffers. Now I see that I was wrong and it is actually the main difference with memory ballooning approach I have used. As far as you are are allocating buffers descriptors and hash table in the same segment, there is no extra memory overhead. The only drawback is that we are loosing content of shared buffers in case of resize. It may be sadly, but not looks like there is no better alternative. But there are still some dependencies on shared buffers size which are not addressed in this PR. I am not sure how critical they are and is it possible to do something here, but at least I want to enumerate them: 1. Checkpointer: maximal number of checkpointer requests depends on NBuffers. So if we start with small shared buffers and then upscale, it may cause the too frequent checkpoints: Size CheckpointerShmemSize(void) ... size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest))); CheckpointerShmemInit(void) CheckpointerShmem->max_requests = NBuffers; 2. XLOG: number of xlog buffers is calculated depending on number of shared buffers: XLOGChooseNumBuffers(void) { ... xbuffers = NBuffers / 32; Should not cause some errors, but may be not so efficient if once again we start we tiny shared buffers. 3. AIO: AIO max concurrency is also calculated based on number of shared buffers: AioChooseMaxConcurrency(void) { ... max_proportional_pins = NBuffers / max_backends; For small shared buffers (i.e. 1Mb, there will be no concurrency at all). So none of this issues can cause some error, just some inefficient behavior. But if we want to start with very small shared buffers and then increase them on demand, then it can be a problem. In all this three cases NBuffers is used not just to calculate some threshold value, but also determine size of the structure in shared memory. The straightforward solution is to place them in the same segment as shared buffers. But I am not sure how difficult it will be to implement.
On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote: > > > > In an offlist chat Thomas Munro mentioned that just ftruncate() would > > be enough to resize the shared memory without touching address maps > > using mmap and munmap(). > > > > ftruncate man page seems to concur with him > > > > If the effect of ftruncate() is to decrease the size of a memory > > mapped file or a shared memory object and whole pages beyond the > > new end were previously mapped, then the whole pages beyond the > > new end shall be discarded. > > > > References to discarded pages shall result in the generation of a > > SIGBUS signal. > > > > If the effect of ftruncate() is to increase the size of a memory > > object, it is unspecified whether the contents of any mapped pages > > between the old end-of-file and the new are flushed to the > > underlying object. > > > > ftruncate() when shrinking memory will release the extra pages and > > also would cause segmentation fault when memory outside the size of > > file is accessed even if the actual address map is larger than the > > mapped file. The expanded memory is allocated as it is written to, and > > those pages also become visible in the underlying object. > > Thanks for sharing. I need to do more thorough tests, but after a quick > look I'm not sure about that. ftruncate will take care about the memory, > but AFAICT the memory mapping will stay the same, is that what you mean? > In that case if the segment got increased, the memory still can't be > used because it's beyond the mapping end (at least in my test that's > what happened). If the segment got shrinked, the memory couldn't be > reclaimed, because, well, there is already a mapping. Or do I miss > something? I was imagining that you might map some maximum possible size at the beginning to reserve the address space permanently, and then adjust the virtual memory object's size with ftruncate as required to provide backing. Doesn't that achieve the goal with fewer steps, using only portable* POSIX stuff, and keeping all pointers stable? I understand that pointer stability may not be required (I can see roughly how that argument is constructed), but isn't it still better to avoid having to prove that and deal with various other problems completely? Is there a downside/cost to having a large mapping that is only partially backed? I suppose choosing that number might offend you but at least there is an obvious upper bound: physical memory size. *You might also want to use fallocate after ftruncate on Linux to avoid SIGBUS on allocation failure on first touch page fault, which raises portability questions since it's unspecified whether you can do that with shm fds and fails on some systems, but it let's call that an independent topic as it's not affected by this choice.
Hi, On April 18, 2025 11:17:21 AM GMT+02:00, Thomas Munro <thomas.munro@gmail.com> wrote: > Doesn't that achieve the goal with fewer steps, using only >portable* POSIX stuff, and keeping all pointers stable? I understand >that pointer stability may not be required (I can see roughly how that >argument is constructed), but isn't it still better to avoid having to >prove that and deal with various other problems completely? I think we should flat out reject any approach that does not maintain pointer stability. It would restrict future optimizationsa lot if we can't rely on that (e.g. not materializing tuples when transporting them from worker to leader;pointering datastructures in shared buffers). Greetings, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Apr 18, 2025 at 9:17 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Thanks for sharing. I need to do more thorough tests, but after a quick > > look I'm not sure about that. ftruncate will take care about the memory, > > but AFAICT the memory mapping will stay the same, is that what you mean? > > In that case if the segment got increased, the memory still can't be > > used because it's beyond the mapping end (at least in my test that's > > what happened). If the segment got shrinked, the memory couldn't be > > reclaimed, because, well, there is already a mapping. Or do I miss > > something? > > I was imagining that you might map some maximum possible size at the > beginning to reserve the address space permanently, and then adjust > the virtual memory object's size with ftruncate as required to provide > backing. Doesn't that achieve the goal with fewer steps, using only > portable* POSIX stuff, and keeping all pointers stable? I understand > that pointer stability may not be required (I can see roughly how that > argument is constructed), but isn't it still better to avoid having to > prove that and deal with various other problems completely? Is there > a downside/cost to having a large mapping that is only partially > backed? I suppose choosing that number might offend you but at least > there is an obvious upper bound: physical memory size. TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side effect on Linux, as if you had called ftruncate on it (fully allocated huge pages I expected up to the object's size, just not magical size changes beyond that when I merely asked to map it). That doesn't happen for regular page size, or for any page size on my local OS's shm objects and doesn't seem to fit mmap's job description given an fd*, but maybe I'm just confused. Anyway, a workaround seems to be to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ | PROT_WRITE) new regions after extending with ftruncate(), at least in simple tests... (*Hmm, wiild uninformed speculation: perhap the size-setting behaviour needed when hugetlbfs is used secretly to implement MAP_ANONYMOUS is being exposed also when a hugetlbfs fd is given explicitly to mmap, generating this bizarro side effect?)
> On Fri, Apr 18, 2025 at 09:17:21PM GMT, Thomas Munro wrote: > I was imagining that you might map some maximum possible size at the > beginning to reserve the address space permanently, and then adjust > the virtual memory object's size with ftruncate as required to provide > backing. Doesn't that achieve the goal with fewer steps, using only > portable* POSIX stuff, and keeping all pointers stable? Ah, I see what you folks mean. So in the latest patch there is a single large shared memory area reserved with PROT_NONE + MAP_NORESERVE. This area is logically divided between shmem segments, and each segment is mmap'd out of it and could be resized withing these logical boundaries. Now the suggestion is to have one reserved area for each segment, and instead of really mmap'ing something out of it, manage memory via ftruncate. Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are questionable from portability point of view. This leaves memfd_create, and I'm still not completely clear on it's portability -- it seems to be specific to Linux, but others provide compatible implementation as well. Let me experiment with this idea a bit, I would like to make sure there are no other limitations we might face. > I understand that pointer stability may not be required Just to clarify, the current patch maintains this property (stable pointers), which I also see as mandatory for any possible implementation. > *You might also want to use fallocate after ftruncate on Linux to > avoid SIGBUS on allocation failure on first touch page fault, which > raises portability questions since it's unspecified whether you can do > that with shm fds and fails on some systems, but it let's call that an > independent topic as it's not affected by this choice. I'm afraid it would be strictly neccessary to do fallocate, otherwise we're back where we were before reservation accounting for huge pages in Linux (lot's of people were facing unexpected SIGBUS when dealing with cgroups). > TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side > effect on Linux, as if you had called ftruncate on it (fully allocated > huge pages I expected up to the object's size, just not magical size > changes beyond that when I merely asked to map it). That doesn't > happen for regular page size, or for any page size on my local OS's > shm objects and doesn't seem to fit mmap's job description given an > fd*, but maybe I'm just confused. Anyway, a workaround seems to be > to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ > | PROT_WRITE) new regions after extending with ftruncate(), at least > in simple tests... Right, it's similar to the currently implemented space reservation, which also goes with PROT_NONE and MAP_NORESERVE. I assume it boils down to the way how memory reservation accounting in Linux works.
> On Thu, Apr 17, 2025 at 07:05:36PM GMT, Ni Ku wrote: > I also have a related question about how ftruncate() is used in the patch. > In my testing I also see that when using ftruncate to shrink a shared > segment, the memory is freed immediately after the call, even if other > processes still have that memory mapped, and they will hit SIGBUS if they > try to access that memory again as the manpage says. > > So am I correct to think that, to support the bufferpool shrinking case, it > would not be safe to call ftruncate in AnonymousShmemResize as-is, since at > that point other processes may still be using pages that belong to the > truncated memory? > It appears that for shrinking we should only call ftruncate when we're sure > no process will access those pages again (eg, all processes have handled > the resize interrupt signal barrier). I suppose this can be done by the > resize coordinator after synchronizing with all the other processes. > But in that case it seems we cannot use the postmaster as the coordinator > then? b/c I see some code comments saying the postmaster does not have > waiting infrastructure... (maybe even if the postmaster has waiting infra > we don't want to use it anyway since it can be blocked for a long time and > won't be able to serve other requests). There is already a coordination infrastructure, implemented in the patch 0006, which will take care of this and prevent access to the shared memory until everything is resized.
> On Fri, Apr 18, 2025 at 10:06:23AM GMT, Konstantin Knizhnik wrote: > The only drawback is that we are loosing content of shared buffers in case > of resize. It may be sadly, but not looks like there is no better > alternative. No, why would we loose the content? If we do mremap, it will leave the content as it is. If we do munmap/mmap with an anonymous backing file, it will also keep the content in memory. The same with another proposal about using ftruncate/fallocate only, both will leave the content untouch unless told to do otherwise. > But there are still some dependencies on shared buffers size which are not > addressed in this PR. > I am not sure how critical they are and is it possible to do something here, > but at least I want to enumerate them: Righ, I'm aware about those (except the AIO one, which was added after the first version of the patch), and didn't address them yet due to the same reason you've mentioned -- they're not hard errors, rather inefficiencies. But thanks for the reminder, I keep those in the back of my mind, and when the rest of the design will be settled down, I'll try to address them as well.
On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are > questionable from portability point of view. This leaves memfd_create, and I'm > still not completely clear on it's portability -- it seems to be specific to > Linux, but others provide compatible implementation as well. Something like this should work, roughly based on DSM code except here we don't really need the name so we unlink it immediately, at the slight risk of leaking it if the postmaster is killed between those lines (maybe someone should go and tell POSIX to support the special name SHM_ANON or some other way to avoid that; I can't see any portable workaround). Not tested/compiled, just a sketch: #ifdef HAVE_MEMFD_CREATE /* Anonymous shared memory region. */ fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags); #else /* Standard POSIX insists on a name, which we unlink immediately. */ do { char tmp[80]; snprintf(tmp, sizeof(tmp), "PostgreSQL.%u", pg_prng_uint32(&pg_global_prng_state)); fd.= shm_open(tmp, O_CREAT | O_EXCL); if (fd >= 0) shm_unlink(tmp); } while (fd < 0 && errno == EXIST); #endif > Let me experiment with this idea a bit, I would like to make sure there are no > other limitations we might face. One thing I'm still wondering about is whether you really need all this multi-phase barrier stuff, or even need to stop other backends from running at all while doing the resize. I guess that's related to your remapping scheme, but supposing you find the simple ftruncate()-only approach to be good, my next question is: why isn't it enough to wait for all backends to agree to stop allocating new buffers in the range to be truncated, and then left them continue to run as normal? As far as they would be concerned, the in-progress downsize has already happened, though it could be reverted later if the eviction phase fails. Then the coordinator could start evicting buffers and truncating the shared memory object, which are phases/steps, sure, but it's not clear to me why they need other backends' help. It sounds like Windows might need a second ProcSignalBarrier poke in order to call VirtualUnlock() in every backend. That's based on that Usenet discussion I lobbed in here the other day; I haven't tried it myself or fully grokked why it works, and there could well be other ways, IDK. Assuming it's the right approach, between the first poke to make all backends accept the new lower size and the second poke to unlock the memory, I don't see why they need to wait. I suppose it would be the same ProcSignalBarrier, but behave differently based on a control variables. I suppose there could also be a third poke, if you want to consider the operation to be fully complete only once they have all actually done that unlock step, but it may also be OK not to worry about that, IDK. On the other hand, maybe it just feels less risky if you stop the whole world, or maybe you envisage parallelising the eviction work, or there is some correctness concern I haven't grokked yet, but what? > > *You might also want to use fallocate after ftruncate on Linux to > > avoid SIGBUS on allocation failure on first touch page fault, which > > raises portability questions since it's unspecified whether you can do > > that with shm fds and fails on some systems, but it let's call that an > > independent topic as it's not affected by this choice. > > I'm afraid it would be strictly neccessary to do fallocate, otherwise we're > back where we were before reservation accounting for huge pages in Linux (lot's > of people were facing unexpected SIGBUS when dealing with cgroups). Yeah. FWIW here is where we decided to gate that on __linux__ while fixing that for DSM: https://www.postgresql.org/message-id/flat/CAEepm%3D0euOKPaYWz0-gFv9xfG%2B8ptAjhFjiQEX0CCJaYN--sDQ%40mail.gmail.com#c81b941d300f04d382472e6414cec1f4