Thread: Re: Changing shared_buffers without restart

Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Here is a new version of the patch, which contains a proposal about how to
coordinate shared memory resizing between backends. The rest is more or less
the same, a feedback about coordination is appreciated. It's a lot to read, but
the main difference is about:

1. Allowing to decouple a GUC value change from actually applying it, sort of a
"pending" change. The idea is to let a custom logic be triggered on an assign
hook, and then take responsibility for what happens later and how it's going to
be applied. This allows to use regular GUC infrastructure in cases where value
change requires some complicated processing. I was trying to make the change
not so invasive, plus it's missing GUC reporting yet.

2. Shared memory resizing patch became more complicated thanks to some
coordination between backends. The current implementation was chosen from few
more or less equal alternatives, which are evolving along following lines:

* There should be one "coordinator" process overseeing the change. Having
postmaster to fulfill this role like in this patch seems like a natural idea,
but it poses certain challenges since it doesn't have locking infrastructure.
Another option would be to elect a single backend to be a coordinator, which
will handle the postmaster as a special case. If there will ever be a
"coordinator" worker in Postgres, that would be useful here.

* The coordinator uses EmitProcSignalBarrier to reach out to all other backends
and trigger the resize process. Backends join a Barrier to synchronize and wait
untill everyone is finished.

* There is some resizing state stored in shared memory, which is there to
handle backends that were for some reason late or didn't receive the signal.
What to store there is open for discussion.

* Since we want to make sure all processes share the same understanding of what
NBuffers value is, any failure is mostly a hard stop, since to rollback the
change coordination is needed as well and sounds a bit too complicated for now.

We've tested this change manually for now, although it might be useful to try
out injection points. The testing strategy, which has caught plenty of bugs,
was simply to run pgbench workload against a running instance and change
shared_buffers on the fly. Some more subtle cases were verified by manually
injecting delays to trigger expected scenarios.

To reiterate, here is patches breakdown:

Patches 1-3 prepare the infrastructure and shared memory layout. They could be
useful even with multithreaded PostgreSQL, when there will be no need for
shared memory. I assume, in the multithreaded world there still will be need
for a contiguous chunk of memory to share between threads, and its layout would
be similar to the one with shared memory mappings. Note that patch nr 2 is
going away as soon as I'll get to implement shared memory address reservation,
but for now it's needed.

Patch 4 is a new addition to handle "pending" GUC changes.

Patch 5 actually does resizing. It's shared memory specific of course, and
utilized Linux specific mremap, meaning open portability questions.

Patch 6 is somewhat independent, but quite convenient to have. It also utilizes
Linux specific call memfd_create.

I would like to get some feedback on the synchronization part. While waiting
I'll proceed implementing shared memory address space reservation and Ashutosh
will continue with buffer eviction to support shared memory reduction.

Attachment

Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote:
> > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > changing shared memory mapping layout. Any feedback is appreciated.
>
> Hi,
>
> Here is a new version of the patch, which contains a proposal about how to
> coordinate shared memory resizing between backends. The rest is more or less
> the same, a feedback about coordination is appreciated. It's a lot to read, but
> the main difference is about:

Just one note, there are still couple of compilation warnings in the
code, which I haven't addressed yet. Those will go away in the next
version.



Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Thu, Feb 27, 2025 at 1:58 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote:
> > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > > changing shared memory mapping layout. Any feedback is appreciated.
> >
> > Hi,
> >
> > Here is a new version of the patch, which contains a proposal about how to
> > coordinate shared memory resizing between backends. The rest is more or less
> > the same, a feedback about coordination is appreciated. It's a lot to read, but
> > the main difference is about:
>
> Just one note, there are still couple of compilation warnings in the
> code, which I haven't addressed yet. Those will go away in the next
> version.

PFA the patchset which implements shrinking shared buffers.
0001-0006 are same as the previous patchset
0007 fixes compilation warnings from previous patches - I think those
should be absorbed into their respective patches
0008 adds TODOs that need some code changes or at least need some
consideration. Some of them might point to the causes of Assertion
failures seen with this patch set.
0009 adds WIP support for shrinking shared buffers - I think this
should be absorbed into 0005
0010 WIP fix for Assertion failures seen from BgBufferSync() - I am
still investigating those.

I am using the attached script to shake the patch well.  It runs
pgbench and concurrently resizes the shared_buffers. I am seeing
Assertion failures when running the script in both cases, expanding
and shrinking the buffers. I am investigating "failed
Assert("strategy_delta >= 0")," next.

--
Best Wishes,
Ashutosh Bapat

Attachment

Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > changing shared memory mapping layout. Any feedback is appreciated.
>
> Hi,
>
> Here is a new version of the patch, which contains a proposal about how to
> coordinate shared memory resizing between backends. The rest is more or less
> the same, a feedback about coordination is appreciated. It's a lot to read, but
> the main difference is about:

Thanks Dmitry for the summary.

>
> 1. Allowing to decouple a GUC value change from actually applying it, sort of a
> "pending" change. The idea is to let a custom logic be triggered on an assign
> hook, and then take responsibility for what happens later and how it's going to
> be applied. This allows to use regular GUC infrastructure in cases where value
> change requires some complicated processing. I was trying to make the change
> not so invasive, plus it's missing GUC reporting yet.
>
> 2. Shared memory resizing patch became more complicated thanks to some
> coordination between backends. The current implementation was chosen from few
> more or less equal alternatives, which are evolving along following lines:
>
> * There should be one "coordinator" process overseeing the change. Having
> postmaster to fulfill this role like in this patch seems like a natural idea,
> but it poses certain challenges since it doesn't have locking infrastructure.
> Another option would be to elect a single backend to be a coordinator, which
> will handle the postmaster as a special case. If there will ever be a
> "coordinator" worker in Postgres, that would be useful here.
>
> * The coordinator uses EmitProcSignalBarrier to reach out to all other backends
> and trigger the resize process. Backends join a Barrier to synchronize and wait
> untill everyone is finished.
>
> * There is some resizing state stored in shared memory, which is there to
> handle backends that were for some reason late or didn't receive the signal.
> What to store there is open for discussion.
>
> * Since we want to make sure all processes share the same understanding of what
> NBuffers value is, any failure is mostly a hard stop, since to rollback the
> change coordination is needed as well and sounds a bit too complicated for now.
>

I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.

> We've tested this change manually for now, although it might be useful to try
> out injection points. The testing strategy, which has caught plenty of bugs,
> was simply to run pgbench workload against a running instance and change
> shared_buffers on the fly. Some more subtle cases were verified by manually
> injecting delays to trigger expected scenarios.

I have shared a script with my changes but it's far from being full
testing. We will need to use injection points to test specific
scenarios.

--
Best Wishes,
Ashutosh Bapat



Re: Changing shared_buffers without restart

From
Ni Ku
Date:
Dmitry / Ashutosh,
Thanks for the patch set. I've been doing some testing with it and in particular want to see if this solution would work with hugepage bufferpool.

I ran some simple tests (outside of PG) on linux kernel v6.1, which has this commit that added some hugepage support to mremap (
https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/).

From reading the kernel code and testing, for a hugepage-backed mapping it seems mremap supports only shrinking but not growing. Further, for shrinking, what I observed is that after mremap is called the hugepage memory
is not released back to the OS, rather it's released when the fd is closed (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
I'm not sure if this behavior is expected, but being able to release memory back to the OS immediately after mremap would be important for use cases such as supporting "serverless" PG instances on the cloud.

I'm no expert in the linux kernel so I could be missing something. It'd be great if you or somebody can comment on these observations and whether this mremap-based solution would work with hugepage bufferpool.

I also attached the test program in case someone can spot I did something wrong.

Regards,

Jack Ng

On Tue, Mar 18, 2025 at 11:02 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > changing shared memory mapping layout. Any feedback is appreciated.
>
> Hi,
>
> Here is a new version of the patch, which contains a proposal about how to
> coordinate shared memory resizing between backends. The rest is more or less
> the same, a feedback about coordination is appreciated. It's a lot to read, but
> the main difference is about:

Thanks Dmitry for the summary.

>
> 1. Allowing to decouple a GUC value change from actually applying it, sort of a
> "pending" change. The idea is to let a custom logic be triggered on an assign
> hook, and then take responsibility for what happens later and how it's going to
> be applied. This allows to use regular GUC infrastructure in cases where value
> change requires some complicated processing. I was trying to make the change
> not so invasive, plus it's missing GUC reporting yet.
>
> 2. Shared memory resizing patch became more complicated thanks to some
> coordination between backends. The current implementation was chosen from few
> more or less equal alternatives, which are evolving along following lines:
>
> * There should be one "coordinator" process overseeing the change. Having
> postmaster to fulfill this role like in this patch seems like a natural idea,
> but it poses certain challenges since it doesn't have locking infrastructure.
> Another option would be to elect a single backend to be a coordinator, which
> will handle the postmaster as a special case. If there will ever be a
> "coordinator" worker in Postgres, that would be useful here.
>
> * The coordinator uses EmitProcSignalBarrier to reach out to all other backends
> and trigger the resize process. Backends join a Barrier to synchronize and wait
> untill everyone is finished.
>
> * There is some resizing state stored in shared memory, which is there to
> handle backends that were for some reason late or didn't receive the signal.
> What to store there is open for discussion.
>
> * Since we want to make sure all processes share the same understanding of what
> NBuffers value is, any failure is mostly a hard stop, since to rollback the
> change coordination is needed as well and sounds a bit too complicated for now.
>

I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.

> We've tested this change manually for now, although it might be useful to try
> out injection points. The testing strategy, which has caught plenty of bugs,
> was simply to run pgbench workload against a running instance and change
> shared_buffers on the fly. Some more subtle cases were verified by manually
> injecting delays to trigger expected scenarios.

I have shared a script with my changes but it's far from being full
testing. We will need to use injection points to test specific
scenarios.

--
Best Wishes,
Ashutosh Bapat




Attachment

Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:
>
> I ran some simple tests (outside of PG) on linux kernel v6.1, which has
> this commit that added some hugepage support to mremap (
> https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
> ).
>
> From reading the kernel code and testing, for a hugepage-backed mapping it
> seems mremap supports only shrinking but not growing. Further, for
> shrinking, what I observed is that after mremap is called the hugepage
> memory
> is not released back to the OS, rather it's released when the fd is closed
> (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
> I'm not sure if this behavior is expected, but being able to release memory
> back to the OS immediately after mremap would be important for use cases
> such as supporting "serverless" PG instances on the cloud.
>
> I'm no expert in the linux kernel so I could be missing something. It'd be
> great if you or somebody can comment on these observations and whether this
> mremap-based solution would work with hugepage bufferpool.

Hm, I think you're right. I didn't realize there is such limitation, but
just verified on the latest kernel build and hit the same condition on
increasing hugetlb mapping you've mentioned above. That's annoying of
course, but I've got another approach I was originally experimenting
with -- instead of mremap do munmap and mmap with the new size and rely
on the anonymous fd to keep the memory content in between. I'm currently
reworking mmap'ing part of the patch, let me check if this new approach
is something we could universally rely on.



Re: Changing shared_buffers without restart

From
Ni Ku
Date:
Thanks for your insights and confirmation, Dmitry.
Right, I think the anonymous fd approach would work to keep the memory contents intact in between munmap and mmap with the new size, so bufferpool expansion would work.
But it seems shrinking would still be problematic, since that approach requires the anonymous fd to remain open (for memory content protection), and so munmap would not release the memory back to the OS right away (gets released when the fd is closed). From testing this is true for hugepage memory at least.
Is there a way around this? Or maybe I misunderstood what you have in mind ;)

Regards,

Jack Ng

On Thu, Mar 20, 2025 at 6:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:
>
> I ran some simple tests (outside of PG) on linux kernel v6.1, which has
> this commit that added some hugepage support to mremap (
> https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
> ).
>
> From reading the kernel code and testing, for a hugepage-backed mapping it
> seems mremap supports only shrinking but not growing. Further, for
> shrinking, what I observed is that after mremap is called the hugepage
> memory
> is not released back to the OS, rather it's released when the fd is closed
> (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
> I'm not sure if this behavior is expected, but being able to release memory
> back to the OS immediately after mremap would be important for use cases
> such as supporting "serverless" PG instances on the cloud.
>
> I'm no expert in the linux kernel so I could be missing something. It'd be
> great if you or somebody can comment on these observations and whether this
> mremap-based solution would work with hugepage bufferpool.

Hm, I think you're right. I didn't realize there is such limitation, but
just verified on the latest kernel build and hit the same condition on
increasing hugetlb mapping you've mentioned above. That's annoying of
course, but I've got another approach I was originally experimenting
with -- instead of mremap do munmap and mmap with the new size and rely
on the anonymous fd to keep the memory content in between. I'm currently
reworking mmap'ing part of the patch, let me check if this new approach
is something we could universally rely on.

Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
> Thanks for your insights and confirmation, Dmitry.
> Right, I think the anonymous fd approach would work to keep the memory
> contents intact in between munmap and mmap with the new size, so bufferpool
> expansion would work.
> But it seems shrinking would still be problematic, since that approach
> requires the anonymous fd to remain open (for memory content protection),
> and so munmap would not release the memory back to the OS right away (gets
> released when the fd is closed). From testing this is true for hugepage
> memory at least.
> Is there a way around this? Or maybe I misunderstood what you have in mind
> ;)

The anonymous file will be truncated to it's new shrinked size before
mapping it second time (I think this part is missing in your test
example), to my understanding after a quick look at do_vmi_align_munmap,
this should be enough to make the memory reclaimable.



Re: Changing shared_buffers without restart

From
Ni Ku
Date:
You're right Dmitry, truncating the anonymous file before mapping it again does the trick! I see 'HugePages_Free' increases to the expected size right after the ftruncate call for shrinking.
This alternative approach looks very promising. Thanks.

Regards,

Jack Ng

On Fri, Mar 21, 2025 at 5:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
> Thanks for your insights and confirmation, Dmitry.
> Right, I think the anonymous fd approach would work to keep the memory
> contents intact in between munmap and mmap with the new size, so bufferpool
> expansion would work.
> But it seems shrinking would still be problematic, since that approach
> requires the anonymous fd to remain open (for memory content protection),
> and so munmap would not release the memory back to the OS right away (gets
> released when the fd is closed). From testing this is true for hugepage
> memory at least.
> Is there a way around this? Or maybe I misunderstood what you have in mind
> ;)

The anonymous file will be truncated to it's new shrinked size before
mapping it second time (I think this part is missing in your test
example), to my understanding after a quick look at do_vmi_align_munmap,
this should be enough to make the memory reclaimable.

Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Fri, Feb 28, 2025 at 5:31 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> I think we should add a way to monitor the progress of resizing; at
> least whether resizing is complete and whether the new GUC value is in
> effect.
>

I further tested this approach by tracing the barrier synchronization
using the attached patch with adds a bunch of elogs().
I ran pgbench load and simultaneously
executed following commands on a psql connection

#alter system set shared_buffers to '200MB';
ALTER SYSTEM
#select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

#show shared_buffers;
 shared_buffers
----------------
 200MB
(1 row)

#select count(*) from pg_stat_activity;
 count
-------
     6
(1 row)

#select pg_backend_pid(); - the backend where all these commands were executed
 pg_backend_pid
----------------
         878405
(1 row)

I see the following in the postgresql error logs.

2025-03-12 11:04:53.812 IST [878167] LOG: received SIGHUP, reloading
configuration files
2025-03-12 11:04:53.813 IST [878405] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0

-- not all backends have reloaded configuration.

2025-03-12 11:04:53.813 IST [878173] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878173] LOG: attached when barrier was at phase 0
2025-03-12 11:04:53.813 IST [878173] LOG: reached barrier phase 1
2025-03-12 11:04:53.813 IST [878171] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878172] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878171] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878172] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878340] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE
pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
2025-03-12 11:04:53.813 IST [878340] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE
pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
2025-03-12 11:04:53.813 IST [878338] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
2025-03-12 11:04:53.813 IST [878339] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
2025-03-12 11:04:53.813 IST [878338] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
2025-03-12 11:04:53.813 IST [878339] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878341] STATEMENT: BEGIN;
2025-03-12 11:04:53.814 IST [878341] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN;
2025-03-12 11:04:53.814 IST [878337] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878337] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE
pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN;
2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878173] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE
pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN;
2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878337] LOG: buffer resizing operation
finished at phase 4
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.814 IST [878168] LOG: attached when barrier was at phase 0
2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 1
2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878168] LOG: buffer resizing operation
finished at phase 3
2025-03-12 11:04:53.815 IST [878169] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.815 IST [878169] LOG: attached when barrier was at phase 0
2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 1
2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 2
2025-03-12 11:04:53.815 IST [878169] LOG: buffer resizing operation
finished at phase 3
2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
2025-03-12 11:04:55.965 IST [878405] LOG: attached when barrier was at phase 0
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 1
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 2
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
2025-03-12 11:04:55.965 IST [878405] LOG: buffer resizing operation
finished at phase 3
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;

To tell the story in short. pid 173 (for the sake of brevity I am just
mentioning the last three digits of PID) attached to the barrier first
and immediately reached phase 1. 171, 172, 340, 338, 339, 341, 337 -
all attached barrier in phase 1. All of these backends completed the
phases in synchronous fashion. But 168, 169 and 405 were yet to attach
to the barrier since they hadn't loaded their configurations yet. Each
of these backends then finished all phases independent of others.

For your reference
#select pid, application_name, backend_type from pg_stat_activity
where pid in (878169, 878168);
  pid   | application_name |   backend_type
--------+------------------+-------------------
 878168 |                  | checkpointer
 878169 |                  | background writer
(2 rows)

This is because the BarrierArriveAndWait() only waits for all the
attached backends. It doesn't wait for backends which are yet to
attach. I think what we want is *all* the backends should execute all
the phases synchronously and wait for others to finish. If we don't do
that, there's a possibility that some of them would see inconsistent
buffer states or even worse may not have necessary memory mapped and
resized - thus causing segfaults. Am I correct?

I think what needs to be done is that every backend should wait for other
backends to attach themselves to the barrier before moving to the
first phase. One way I can think of is we use two signal barriers -
one to ensure that all the backends have attached themselves and
second for the actual resizing. But then the postmaster needs to wait for
all the processes to process the first signal barrier. A postmaster can
not wait on anything. Maybe there's a way to poll, but I didn't find
it. Does that mean that we have to make some other backend a coordinator?

--
Best Wishes,
Ashutosh Bapat



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Mon, Apr 07, 2025 at 11:50:46AM GMT, Ashutosh Bapat wrote:
> This is because the BarrierArriveAndWait() only waits for all the
> attached backends. It doesn't wait for backends which are yet to
> attach. I think what we want is *all* the backends should execute all
> the phases synchronously and wait for others to finish. If we don't do
> that, there's a possibility that some of them would see inconsistent
> buffer states or even worse may not have necessary memory mapped and
> resized - thus causing segfaults. Am I correct?
>
> I think what needs to be done is that every backend should wait for other
> backends to attach themselves to the barrier before moving to the
> first phase. One way I can think of is we use two signal barriers -
> one to ensure that all the backends have attached themselves and
> second for the actual resizing. But then the postmaster needs to wait for
> all the processes to process the first signal barrier. A postmaster can
> not wait on anything. Maybe there's a way to poll, but I didn't find
> it. Does that mean that we have to make some other backend a coordinator?

Yes, you're right, plain dynamic Barrier does not ensure all available
processes will be synchronized. I was aware about the scenario you
describe, it's mentioned in commentaries for the resize function. I was
under the impression this should be enough, but after some more thinking
I'm not so sure anymore. Let me try to structure it as a list of
possible corner cases that we need to worry about:

* New backend spawned while we're busy resizing shared memory. Those
  should wait until the resizing is complete and get the new size as well.

* Old backend receives a resize message, but exits before attempting to
  resize. Those should be excluded from coordination.

* A backend is blocked and not responding before or after the
  ProcSignalBarrier message was sent. I'm thinking about a failure
  situation, when one rogue backend is doing something without checking
  for interrupts. We need to wait for those to become responsive, and
  potentially abort shared memory resize after some timeout.

* Backends join the barrier in disjoint groups with some time in
  between, which is longer than what it takes to resize shared memory.
  That means that relying only on the shared dynamic barrier is not
  enough -- it will only synchronize resize procedure withing those
  groups.

Out of those I think the third poses some problems, e.g. if we shrinking
the shared memory, but one backend is accessing buffer pool without
checking for interrupts. In the v3 implementation this won't be handled
correctly, other backends will ignore such rogue process. Independently
from that we could reason about the logic much easier if it's guaranteed
that all the process to resize shared memory will wait for each other to
start simultaneously.

Looks like to achieve that we need a slightly different combination of a
global Barrier and ProcSignalBarrier mechanism. We can't use
ProcSignalBarrier as it is, because processes need to wait for each
other, and at the same time finish processing to bump the generation. We
also can't use a simple dynamic Barrier due to possibility of disjoint
groups of processes. A static Barrier is also not easier, because we
would need somehow to know exact number of processes, which might change
over time.

I think a relatively elegant solution is to extend ProcSignalBarrier
mechanism to track not only pss_barrierGeneration, as a sign that
everything was processed, but also something like
pss_barrierReceivedGeneration, indicating that the message was received
everywhere but not processed yet. That would be enough to allow
processes to wait until the resize message was received everywhere, then
use a global Barrier to wait until all processes are finished.  It's
somehow similar to your proposal to use two signals, but has less
implementation overhead.

This would also allow different solutions regarding error handling. E.g.
we could do an unbounded waiting for all processes we expect to resize,
assuming that the user will be able to intervene and fix an issue if
there is any. Or we can do a timed waiting, and abort the resize after
some timeout of not all processes are ready yet. In the new v4 version
of the patch the first option is implemented.

On top of that there are following changes:

* Shared memory address space is now reserved for future usage, making
  shared memory segments clash (e.g. due to memory allocation)
  impossible.  There is a new GUC to control how much space to reserve,
  which is called max_available_memory -- on the assumption that most of
  the time it would make sense to set its value to the total amount of
  memory on the machine. I'm open for suggestions regarding the name.

* There is one more patch to address hugepages remap. As mentioned in
  this thread above, Linux kernel has certain limitations when it comes
  to mremap for segments allocated with huge pages. To work around it's
  possible to replace mremap with a sequence of unmap and map again,
  relying on the anon file behind the segment to keep the memory
  content. I haven't found any downsides of this approach so far, but it
  makes the anonymous file patch 0007 mandatory.

Attachment

Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> In the new v4 version
> of the patch the first option is implemented.
>

The patches don't apply cleanly using git am but patch -p1 applies
them cleanly. However I see following compilation errors
RuntimeError: command "ninja" failed with error [1/1954] Generating
src/include/utils/errcodes with a custom command
[2/1954] Generating src/include/storage/lwlocknames_h with a custom command
[3/1954] Generating src/include/utils/wait_event_names with a custom command
[4/1954] Compiling C object src/port/libpgport.a.p/pg_popcount_aarch64.c.o
[5/1954] Compiling C object src/port/libpgport.a.p/pg_numa.c.o
FAILED: src/port/libpgport.a.p/pg_numa.c.o
cc -Isrc/port/libpgport.a.p -Isrc/include
-I../../coderoot/pg/src/include -fdiagnostics-color=always
-D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -g
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE
-Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3
-Wcast-function-type -Wshadow=compatible-local -Wformat-security
-Wdeclaration-after-statement -Wno-format-truncation
-Wno-stringop-truncation -fPIC -DFRONTEND -MD -MQ
src/port/libpgport.a.p/pg_numa.c.o -MF
src/port/libpgport.a.p/pg_numa.c.o.d -o
src/port/libpgport.a.p/pg_numa.c.o -c
../../coderoot/pg/src/port/pg_numa.c
In file included from ../../coderoot/pg/src/include/storage/spin.h:54,
                 from
../../coderoot/pg/src/include/storage/condition_variable.h:26,
                 from ../../coderoot/pg/src/include/storage/barrier.h:22,
                 from ../../coderoot/pg/src/include/storage/pg_shmem.h:27,
                 from ../../coderoot/pg/src/port/pg_numa.c:26:
../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error
"s_lock.h may not be included from frontend code"
   93 | #error "s_lock.h may not be included from frontend code"
      |  ^~~~~
In file included from ../../coderoot/pg/src/port/pg_numa.c:26:
../../coderoot/pg/src/include/storage/pg_shmem.h:66:9: error: unknown
type name ‘pg_atomic_uint32’
   66 |         pg_atomic_uint32        NSharedBuffers;
      |         ^~~~~~~~~~~~~~~~
../../coderoot/pg/src/include/storage/pg_shmem.h:68:9: error: unknown
type name ‘pg_atomic_uint64’
   68 |         pg_atomic_uint64        Generation;
      |         ^~~~~~~~~~~~~~~~
../../coderoot/pg/src/port/pg_numa.c: In function ‘pg_numa_get_pagesize’:
../../coderoot/pg/src/port/pg_numa.c:117:17: error: too few arguments
to function ‘GetHugePageSize’
  117 |                 GetHugePageSize(&os_page_size, NULL);
      |                 ^~~~~~~~~~~~~~~
In file included from ../../coderoot/pg/src/port/pg_numa.c:26:
../../coderoot/pg/src/include/storage/pg_shmem.h:127:13: note: declared here
  127 | extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
      |             ^~~~~~~~~~~~~~~

--
Best Wishes,
Ashutosh Bapat



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote:
> On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > In the new v4 version
> > of the patch the first option is implemented.
> >
>
> The patches don't apply cleanly using git am but patch -p1 applies
> them cleanly. However I see following compilation errors
> RuntimeError: command "ninja" failed with error

Becase it's relatively meaningless to apply a patch to the tip of the
master around the release freeze time :) Commit 65c298f61fc has
introduced new usage of GetHugePageSize, which was modified in my patch.
I'm going to address it with the next rebased version, in the meantime
you can always use the specified base commit to apply the changeset:

    base-commit: 5e1915439085014140314979c4dd5e23bd677cac



Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Wed, Apr 9, 2025 at 1:15 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote:
> > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > >
> > > In the new v4 version
> > > of the patch the first option is implemented.
> > >
> >
> > The patches don't apply cleanly using git am but patch -p1 applies
> > them cleanly. However I see following compilation errors
> > RuntimeError: command "ninja" failed with error
>
> Becase it's relatively meaningless to apply a patch to the tip of the
> master around the release freeze time :) Commit 65c298f61fc has
> introduced new usage of GetHugePageSize, which was modified in my patch.
> I'm going to address it with the next rebased version, in the meantime
> you can always use the specified base commit to apply the changeset:
>
>     base-commit: 5e1915439085014140314979c4dd5e23bd677cac

There is a higher chance that people will try these patches now than
it was two days before and more chance if they find the patches
applicable easily.

../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error
"s_lock.h may not be included from frontend code"

How about this? Why is that happening?

--
Best Wishes,
Ashutosh Bapat



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Wed, Apr 09, 2025 at 01:20:16PM GMT, Ashutosh Bapat wrote:
> ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error
> "s_lock.h may not be included from frontend code"
>
> How about this? Why is that happening?

The same -- as you can see it comes from compiling pg_numa.c, which as
it seems used in frontend and imports pg_shmem.h . I wanted to reshuffle
includes in the patch anyway, that would be a good excuse to finally do
this.



Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Yes, you're right, plain dynamic Barrier does not ensure all available
> processes will be synchronized. I was aware about the scenario you
> describe, it's mentioned in commentaries for the resize function. I was
> under the impression this should be enough, but after some more thinking
> I'm not so sure anymore. Let me try to structure it as a list of
> possible corner cases that we need to worry about:
>
> * New backend spawned while we're busy resizing shared memory. Those
>   should wait until the resizing is complete and get the new size as well.
>
> * Old backend receives a resize message, but exits before attempting to
>   resize. Those should be excluded from coordination.

Should we detach barrier in on_exit()?

>
> * A backend is blocked and not responding before or after the
>   ProcSignalBarrier message was sent. I'm thinking about a failure
>   situation, when one rogue backend is doing something without checking
>   for interrupts. We need to wait for those to become responsive, and
>   potentially abort shared memory resize after some timeout.

Right.

>
> I think a relatively elegant solution is to extend ProcSignalBarrier
> mechanism to track not only pss_barrierGeneration, as a sign that
> everything was processed, but also something like
> pss_barrierReceivedGeneration, indicating that the message was received
> everywhere but not processed yet. That would be enough to allow
> processes to wait until the resize message was received everywhere, then
> use a global Barrier to wait until all processes are finished.  It's
> somehow similar to your proposal to use two signals, but has less
> implementation overhead.

The way it's implemented in v4 still has the disjoint group problem.
Assume backends p1, p2, p3. All three of them are executing
ProcessProcSignalBarrier(). All three of them updated
pss_barrierReceivedGeneration

/* The message is observed, record that */
pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
shared_gen);

p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
  WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));

Since all the processes have received the barrier message, p1, p2 move
ahead and go through all the next phases and finish resizing even
before p3 gets a chance to call ProcessBarrierShmemResize() and attach
itself to Barrier. This could happen because it processed some other
ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
not attached itself to the barrier. Once p1, p2 finish, p3 will attach
itself to the barrier and resize buffers again - reinitializing the
shared memory, which might has been already modified by p1 or p2. Boom
- there's memory corruption.

Either every process has to make sure that all the other extant
backends have attached themselves to the barrier OR somebody has to
ensure that and signal all the backends to proceed. The implementation
doesn't do either.

>
> * Shared memory address space is now reserved for future usage, making
>   shared memory segments clash (e.g. due to memory allocation)
>   impossible.  There is a new GUC to control how much space to reserve,
>   which is called max_available_memory -- on the assumption that most of
>   the time it would make sense to set its value to the total amount of
>   memory on the machine. I'm open for suggestions regarding the name.

With 0006 applied
+ /* Clean up some reserved space to resize into */
+ if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
ze, m->shmem)));
... snip ...
+ ptr = mremap(m->shmem, m->shmem_size, new_size, 0);

We unmap the portion of reserved address space where the existing
segment would expand into. As long as we are just expanding this will
work. I am wondering how would this work for shrinking buffers? What
scheme do you have in mind?

>
> * There is one more patch to address hugepages remap. As mentioned in
>   this thread above, Linux kernel has certain limitations when it comes
>   to mremap for segments allocated with huge pages. To work around it's
>   possible to replace mremap with a sequence of unmap and map again,
>   relying on the anon file behind the segment to keep the memory
>   content. I haven't found any downsides of this approach so far, but it
>   makes the anonymous file patch 0007 mandatory.

In 0008
if (munmap(m->shmem, m->shmem_size) < 0)
... snip ...
/* Resize the backing anon file. */
if(ftruncate(m->segment_fd, new_size) == -1)
...
/* Reclaim the space */
ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
mmap_flags | MAP_FIXED, m->segment_fd, 0);

How are we preventing something get mapped into the space after
m->shmem + newsize? We will need to add an unallocated but reserved
addressed space map after m->shmem+newsize right?

--
Best Wishes,
Ashutosh Bapat



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Fri, Apr 11, 2025 at 08:04:39PM GMT, Ashutosh Bapat wrote:
> On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > Yes, you're right, plain dynamic Barrier does not ensure all available
> > processes will be synchronized. I was aware about the scenario you
> > describe, it's mentioned in commentaries for the resize function. I was
> > under the impression this should be enough, but after some more thinking
> > I'm not so sure anymore. Let me try to structure it as a list of
> > possible corner cases that we need to worry about:
> >
> > * New backend spawned while we're busy resizing shared memory. Those
> >   should wait until the resizing is complete and get the new size as well.
> >
> > * Old backend receives a resize message, but exits before attempting to
> >   resize. Those should be excluded from coordination.
>
> Should we detach barrier in on_exit()?

Yeah, good point.

> > I think a relatively elegant solution is to extend ProcSignalBarrier
> > mechanism to track not only pss_barrierGeneration, as a sign that
> > everything was processed, but also something like
> > pss_barrierReceivedGeneration, indicating that the message was received
> > everywhere but not processed yet. That would be enough to allow
> > processes to wait until the resize message was received everywhere, then
> > use a global Barrier to wait until all processes are finished.  It's
> > somehow similar to your proposal to use two signals, but has less
> > implementation overhead.
>
> The way it's implemented in v4 still has the disjoint group problem.
> Assume backends p1, p2, p3. All three of them are executing
> ProcessProcSignalBarrier(). All three of them updated
> pss_barrierReceivedGeneration
>
> /* The message is observed, record that */
> pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
> shared_gen);
>
> p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
> if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
>   WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));
>
> Since all the processes have received the barrier message, p1, p2 move
> ahead and go through all the next phases and finish resizing even
> before p3 gets a chance to call ProcessBarrierShmemResize() and attach
> itself to Barrier. This could happen because it processed some other
> ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
> not attached itself to the barrier. Once p1, p2 finish, p3 will attach
> itself to the barrier and resize buffers again - reinitializing the
> shared memory, which might has been already modified by p1 or p2. Boom
> - there's memory corruption.

It won't reinitialize anything, since this logic is controlled by the
ShmemCtrl->NSharedBuffers, if it's already updated nothing will be
changed.

About the race condition you mention, there is indeed a window between
receiving the ProcSignalBarrier and attaching to the global Barrier in
resize, but I don't think any process will be able to touch buffer pool
while inside this window. Even if it happens that the remapping itself
was blazing fast that this window was enough to make one process late
(e.g. if it was busy handling some other signal as you mention), as I've
showed above it shouldn't be a problem.

I can experiment with this case though, maybe there is a way to
completely close this window to not thing about even potential
scenarios.

> > * Shared memory address space is now reserved for future usage, making
> >   shared memory segments clash (e.g. due to memory allocation)
> >   impossible.  There is a new GUC to control how much space to reserve,
> >   which is called max_available_memory -- on the assumption that most of
> >   the time it would make sense to set its value to the total amount of
> >   memory on the machine. I'm open for suggestions regarding the name.
>
> With 0006 applied
> + /* Clean up some reserved space to resize into */
> + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
> ze, m->shmem)));
> ... snip ...
> + ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
>
> We unmap the portion of reserved address space where the existing
> segment would expand into. As long as we are just expanding this will
> work. I am wondering how would this work for shrinking buffers? What
> scheme do you have in mind?

I didn't like this part originally, and after changes to support hugetlb
I think it's worth it just to replace mremap with munmap/mmap. That way
there will be no such question, e.g. if a segment is getting shrinked
the unmapped area will again become a part of the reserved space.

> > * There is one more patch to address hugepages remap. As mentioned in
> >   this thread above, Linux kernel has certain limitations when it comes
> >   to mremap for segments allocated with huge pages. To work around it's
> >   possible to replace mremap with a sequence of unmap and map again,
> >   relying on the anon file behind the segment to keep the memory
> >   content. I haven't found any downsides of this approach so far, but it
> >   makes the anonymous file patch 0007 mandatory.
>
> In 0008
> if (munmap(m->shmem, m->shmem_size) < 0)
> ... snip ...
> /* Resize the backing anon file. */
> if(ftruncate(m->segment_fd, new_size) == -1)
> ...
> /* Reclaim the space */
> ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
> mmap_flags | MAP_FIXED, m->segment_fd, 0);
>
> How are we preventing something get mapped into the space after
> m->shmem + newsize? We will need to add an unallocated but reserved
> addressed space map after m->shmem+newsize right?

Nope, the segment is allocated from the reserved space already, with
some chunk of it left after the segment's end for resizing purposes. We
only take some part of the designated space, the rest is still reserved.



Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Fri, Apr 11, 2025 at 8:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > > I think a relatively elegant solution is to extend ProcSignalBarrier
> > > mechanism to track not only pss_barrierGeneration, as a sign that
> > > everything was processed, but also something like
> > > pss_barrierReceivedGeneration, indicating that the message was received
> > > everywhere but not processed yet. That would be enough to allow
> > > processes to wait until the resize message was received everywhere, then
> > > use a global Barrier to wait until all processes are finished.  It's
> > > somehow similar to your proposal to use two signals, but has less
> > > implementation overhead.
> >
> > The way it's implemented in v4 still has the disjoint group problem.
> > Assume backends p1, p2, p3. All three of them are executing
> > ProcessProcSignalBarrier(). All three of them updated
> > pss_barrierReceivedGeneration
> >
> > /* The message is observed, record that */
> > pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
> > shared_gen);
> >
> > p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
> > if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
> >   WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));
> >
> > Since all the processes have received the barrier message, p1, p2 move
> > ahead and go through all the next phases and finish resizing even
> > before p3 gets a chance to call ProcessBarrierShmemResize() and attach
> > itself to Barrier. This could happen because it processed some other
> > ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
> > not attached itself to the barrier. Once p1, p2 finish, p3 will attach
> > itself to the barrier and resize buffers again - reinitializing the
> > shared memory, which might has been already modified by p1 or p2. Boom
> > - there's memory corruption.
>
> It won't reinitialize anything, since this logic is controlled by the
> ShmemCtrl->NSharedBuffers, if it's already updated nothing will be
> changed.

Ah, I see it now
if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
{

Thanks for the clarification.

However, when we put back the patches to shrink buffers, we will evict
the extra buffers, and shrink - if all the processes haven't
participated in the barrier by then, some of them may try to access
those buffers - re-installing them and then bad things can happen.

>
> About the race condition you mention, there is indeed a window between
> receiving the ProcSignalBarrier and attaching to the global Barrier in
> resize, but I don't think any process will be able to touch buffer pool
> while inside this window. Even if it happens that the remapping itself
> was blazing fast that this window was enough to make one process late
> (e.g. if it was busy handling some other signal as you mention), as I've
> showed above it shouldn't be a problem.
>
> I can experiment with this case though, maybe there is a way to
> completely close this window to not thing about even potential
> scenarios.

The window may be small today but we have to make this future proof.
Multiple ProcSignalBarrier messages may be processed in a single call
to ProcessProcSignalBarrier() and if each of those takes as long as
buffer resizing, the window will get bigger and bigger. So we have to
close this window.

>
> > > * Shared memory address space is now reserved for future usage, making
> > >   shared memory segments clash (e.g. due to memory allocation)
> > >   impossible.  There is a new GUC to control how much space to reserve,
> > >   which is called max_available_memory -- on the assumption that most of
> > >   the time it would make sense to set its value to the total amount of
> > >   memory on the machine. I'm open for suggestions regarding the name.
> >
> > With 0006 applied
> > + /* Clean up some reserved space to resize into */
> > + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
> > ze, m->shmem)));
> > ... snip ...
> > + ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
> >
> > We unmap the portion of reserved address space where the existing
> > segment would expand into. As long as we are just expanding this will
> > work. I am wondering how would this work for shrinking buffers? What
> > scheme do you have in mind?
>
> I didn't like this part originally, and after changes to support hugetlb
> I think it's worth it just to replace mremap with munmap/mmap. That way
> there will be no such question, e.g. if a segment is getting shrinked
> the unmapped area will again become a part of the reserved space.
>

I might have not noticed it, but are we putting two mappings one
reserved and one allocated in the same address space, so that when the
allocated mapping shrinks or expands, the reserved mapping continues
to prohibit any other mapping from appearing there? I looked at some
of the previous emails, but didn't find anything that describes how
the reserved mapped space is managed.

--
Best Wishes,
Ashutosh Bapat



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote:
>
> However, when we put back the patches to shrink buffers, we will evict
> the extra buffers, and shrink - if all the processes haven't
> participated in the barrier by then, some of them may try to access
> those buffers - re-installing them and then bad things can happen.

As I've mentioned above, I don't see how a process could try to access a
buffer, if it's on the path between receiving the ProcSignalBarrier and
attaching to the global shmem Barrier, even if we shrink buffers.
AFAICT interrupt handles should not touch buffers, and otherwise the
process doesn't have any point withing this window where it might do
this. Do you have some particular scenario in mind?

> I might have not noticed it, but are we putting two mappings one
> reserved and one allocated in the same address space, so that when the
> allocated mapping shrinks or expands, the reserved mapping continues
> to prohibit any other mapping from appearing there? I looked at some
> of the previous emails, but didn't find anything that describes how
> the reserved mapped space is managed.

I though so, but this turns out to be incorrect. Just have done a small
experiment -- looks like when reserving some space, mapping and
unmapping a small segment from it leaves a non-mapped gap. That would
mean for shrinking the new available space has to be reserved again.



Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote:
> >
> > However, when we put back the patches to shrink buffers, we will evict
> > the extra buffers, and shrink - if all the processes haven't
> > participated in the barrier by then, some of them may try to access
> > those buffers - re-installing them and then bad things can happen.
>
> As I've mentioned above, I don't see how a process could try to access a
> buffer, if it's on the path between receiving the ProcSignalBarrier and
> attaching to the global shmem Barrier, even if we shrink buffers.
> AFAICT interrupt handles should not touch buffers, and otherwise the
> process doesn't have any point withing this window where it might do
> this. Do you have some particular scenario in mind?

ProcessProcSignalBarrier() is not within an interrupt handler but it
responds to a flag set by an interrupt handler. After calling
pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
shared_gen); it will enter the loop

while (flags != 0)
where it may process many barriers before processing
PROCSIGNAL_BARRIER_SHMEM_RESIZE. Nothing stops the other barrier
processing code from touching buffers. Right now it's just smgrrelease
that gets called in the other barrier. But that's not guaranteed in
future.

>
> > I might have not noticed it, but are we putting two mappings one
> > reserved and one allocated in the same address space, so that when the
> > allocated mapping shrinks or expands, the reserved mapping continues
> > to prohibit any other mapping from appearing there? I looked at some
> > of the previous emails, but didn't find anything that describes how
> > the reserved mapped space is managed.
>
> I though so, but this turns out to be incorrect. Just have done a small
> experiment -- looks like when reserving some space, mapping and
> unmapping a small segment from it leaves a non-mapped gap. That would
> mean for shrinking the new available space has to be reserved again.

Right. That's what I thought. But I didn't see the corresponding code.
So we have to keep track of two mappings for every segment - 1 for
allocation and one for reserving space and resize those two while
shrinking and expanding buffers. Am I correct?


--
Best Wishes,
Ashutosh Bapat



Re: Changing shared_buffers without restart

From
Ashutosh Bapat
Date:
Hi Dmitry,

On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote:
> >
> > However, when we put back the patches to shrink buffers, we will evict
> > the extra buffers, and shrink - if all the processes haven't
> > participated in the barrier by then, some of them may try to access
> > those buffers - re-installing them and then bad things can happen.
>
> As I've mentioned above, I don't see how a process could try to access a
> buffer, if it's on the path between receiving the ProcSignalBarrier and
> attaching to the global shmem Barrier, even if we shrink buffers.
> AFAICT interrupt handles should not touch buffers, and otherwise the
> process doesn't have any point withing this window where it might do
> this. Do you have some particular scenario in mind?
>
> > I might have not noticed it, but are we putting two mappings one
> > reserved and one allocated in the same address space, so that when the
> > allocated mapping shrinks or expands, the reserved mapping continues
> > to prohibit any other mapping from appearing there? I looked at some
> > of the previous emails, but didn't find anything that describes how
> > the reserved mapped space is managed.
>
> I though so, but this turns out to be incorrect. Just have done a small
> experiment -- looks like when reserving some space, mapping and
> unmapping a small segment from it leaves a non-mapped gap. That would
> mean for shrinking the new available space has to be reserved again.

In an offlist chat Thomas Munro mentioned that just ftruncate() would
be enough to resize the shared memory without touching address maps
using mmap and munmap().

ftruncate man page seems to concur with him

       If the effect of ftruncate() is to decrease the size of a memory
       mapped file or a shared memory object and whole pages beyond the
       new end were previously mapped, then the whole pages beyond the
       new end shall be discarded.

       References to discarded pages shall result in the generation of a
       SIGBUS signal.

       If the effect of ftruncate() is to increase the size of a memory
       object, it is unspecified whether the contents of any mapped pages
       between the old end-of-file and the new are flushed to the
       underlying object.

ftruncate() when shrinking memory will release the extra pages and
also would cause segmentation fault when memory outside the size of
file is accessed even if the actual address map is larger than the
mapped file. The expanded memory is allocated as it is written to, and
those pages also become visible in the underlying object.

I played with the attached small program under debugger observing pmap
and /proc/<pid>/status after every memory operation. The address map
always shows that it's as long as 300K memory.
00007fffd2200000 307200K rw-s- memfd:mmap_fd_exp (deleted)

Immediately after mmap()
RssShmem:              0 kB

after first memset
RssShmem:         307200 kB

after ftruncate to 100MB (we don't need to wait for memset() to see
the effect on RssShmem)
RssShmem:         102400 kB

after ftruncate to 200MB (requires memset to see effect on RssShmem)
RssShmem:         102400 kB

after memsetting upto 200MB
RssShmem:         204800 kB

All the observations concur with the man page.

[1]
https://man7.org/linux/man-pages/man3/ftruncate.3p.html#:~:text=If%20the%20effect%20of%20ftruncate,generation%20of%20a%20SIGBUS%20signal.


--
Best Wishes,
Ashutosh Bapat



Re: Changing shared_buffers without restart

From
Konstantin Knizhnik
Date:


On 25/02/2025 11:52 am, Dmitry Dolgov wrote:
On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

Hi Dmitry,

I am sorry that I have not participated in the discussion in this thread from the very beginning, although I am also very interested in dynamic shared buffer resizing and evn proposed my own implementation of it: https://github.com/knizhnik/postgres/pull/2 based on memory ballooning and using `madvise`. And it really works (returns unused memory to the system).
This PoC allows me to understand the main drawbacks of this approach:

1. Performance of Postgres CLOCK page eviction algorithm depends on number of shared buffers. My first native attempt just to mark unused buffers as invalid cause significant degrade of performance

pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S

(here shared_buffers - is maximal shared buffers size and `available_buffers` - is used part:

| shared_buffers | available_buffers | TPS | | ------------------| ---------------------------- | ---- | | 128MB | -1 | 280k | | 1GB | -1 | 324k | | 2GB | -1 | 358k | | 32GB | -1 | 350k | | 2GB | 128Mb | 130k | | 2GB | 1Gb | 311k | | 32GB | 128Mb | 13k | | 32GB | 1Gb | 140k | | 32GB | 2Gb | 348k |

My first thought is to replace clock with LRU based in double-linked list. As far as there is no lockless double-list implementation,
it  need some global lock. This lock can become bottleneck. The standard solution is partitioning: use N  LRU lists instead of 1.
Just as partitioned has table used by buffer manager to lockup buffers. Actually we can use the same partitions locks to protect LRU list.
But it not clear what to do with ring buffers (strategies).So I decided not to perform such revolution in bufmgr, but optimize clock to more efficiently split reserved buffers.
Just add skip_count field to buffer descriptor. And it helps! Now the worst case shared_buffer/available_buffers = 32Gb/128Mb
shows the same performance 280k as  shared_buffers=128Mb without ballooning.

2. There are several data structures i Postgres which size depends on number of buffers.
In my patch I used in some cases dynamic shared buffer size, but if this structure has to be allocated in shared memory then still maximal size has to be used. We have the buffers themselves (8 kB per buffer), then the main BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
128 bytes per 8kb
bytes seems to  large overhead (~1%) but but it may be quote noticeable with size differences larger than 2 orders of magnitude:
E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd have ~2GiB of static overhead on only 0.5GiB of actual buffers. 


3. `madvise` is not portable.

Certainly you have moved much further in your proposal comparing with my PoC (including huge pages support).
But it is still not quite clear to me how you are going to solve the problems with large memory overhead in case of ~100x times variation of shared buffers size.




I




Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote:
>
> In an offlist chat Thomas Munro mentioned that just ftruncate() would
> be enough to resize the shared memory without touching address maps
> using mmap and munmap().
>
> ftruncate man page seems to concur with him
>
>        If the effect of ftruncate() is to decrease the size of a memory
>        mapped file or a shared memory object and whole pages beyond the
>        new end were previously mapped, then the whole pages beyond the
>        new end shall be discarded.
>
>        References to discarded pages shall result in the generation of a
>        SIGBUS signal.
>
>        If the effect of ftruncate() is to increase the size of a memory
>        object, it is unspecified whether the contents of any mapped pages
>        between the old end-of-file and the new are flushed to the
>        underlying object.
>
> ftruncate() when shrinking memory will release the extra pages and
> also would cause segmentation fault when memory outside the size of
> file is accessed even if the actual address map is larger than the
> mapped file. The expanded memory is allocated as it is written to, and
> those pages also become visible in the underlying object.

Thanks for sharing. I need to do more thorough tests, but after a quick
look I'm not sure about that. ftruncate will take care about the memory,
but AFAICT the memory mapping will stay the same, is that what you mean?
In that case if the segment got increased, the memory still can't be
used because it's beyond the mapping end (at least in my test that's
what happened). If the segment got shrinked, the memory couldn't be
reclaimed, because, well, there is already a mapping. Or do I miss
something?

> > > I might have not noticed it, but are we putting two mappings one
> > > reserved and one allocated in the same address space, so that when the
> > > allocated mapping shrinks or expands, the reserved mapping continues
> > > to prohibit any other mapping from appearing there? I looked at some
> > > of the previous emails, but didn't find anything that describes how
> > > the reserved mapped space is managed.
> >
> > I though so, but this turns out to be incorrect. Just have done a small
> > experiment -- looks like when reserving some space, mapping and
> > unmapping a small segment from it leaves a non-mapped gap. That would
> > mean for shrinking the new available space has to be reserved again.
>
> Right. That's what I thought. But I didn't see the corresponding code.
> So we have to keep track of two mappings for every segment - 1 for
> allocation and one for reserving space and resize those two while
> shrinking and expanding buffers. Am I correct?

Not necessarily, depending on what we want. Again, I'll do a bit more testing,
but after a quick check it seems that it's possible to "plug" the gap with a
new reservation mapping, then reallocate it to another mapping or unmap both
reservations (main and the "gap" one) at once. That would mean that for the
current functionality we don't need to track reservation in any way more than
just start and the end of the "main" reserved space. The only consequence I can
imagine is possible fragmentation of the reserved space in case of frequent
increase/decrease of a segment with even decreasing size. But since it's only
reserved space, which will not really be used, it's probably not going to be a
problem.



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote:
>
> 1. Performance of Postgres CLOCK page eviction algorithm depends on number
> of shared buffers. My first native attempt just to mark unused buffers as
> invalid cause significant degrade of performance

Thanks for sharing!

Right, but it concerns the case when the number of shared buffers is
high, independently from whether it was changed online or with a
restart, correct? In that case it's out of scope for this patch.

> 2. There are several data structures i Postgres which size depends on number
> of buffers.
> In my patch I used in some cases dynamic shared buffer size, but if this
> structure has to be allocated in shared memory then still maximal size has
> to be used. We have the buffers themselves (8 kB per buffer), then the main
> BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's
> CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
> 128 bytes per 8kb bytes seems to  large overhead (~1%) but but it may be
> quote noticeable with size differences larger than 2 orders of magnitude:
> E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd
> have ~2GiB of static overhead on only 0.5GiB of actual buffers.

Not sure what do you mean by using a maximal size, can you elaborate.

In the current patch those structures are allocated as before, except
each goes into a separate segment -- without any extra memory overhead
as far as I see.

> 3. `madvise` is not portable.

The current implementation doesn't rely on madvise so far (it might for
shared memory shrinking), but yeah there are plenty of other not very
portable things (MAP_FIXED, memfd_create). All of that is mentioned in
the corresponding patches as a limitation.



Re: Changing shared_buffers without restart

From
Ni Ku
Date:

Re: Changing shared_buffers without restart

From
Konstantin Knizhnik
Date:
On 18/04/2025 12:26 am, Dmitry Dolgov wrote:
>> On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote:
>>
>> 1. Performance of Postgres CLOCK page eviction algorithm depends on number
>> of shared buffers. My first native attempt just to mark unused buffers as
>> invalid cause significant degrade of performance
> Thanks for sharing!
>
> Right, but it concerns the case when the number of shared buffers is
> high, independently from whether it was changed online or with a
> restart, correct? In that case it's out of scope for this patch.
>
>> 2. There are several data structures i Postgres which size depends on number
>> of buffers.
>> In my patch I used in some cases dynamic shared buffer size, but if this
>> structure has to be allocated in shared memory then still maximal size has
>> to be used. We have the buffers themselves (8 kB per buffer), then the main
>> BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's
>> CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
>> 128 bytes per 8kb bytes seems to  large overhead (~1%) but but it may be
>> quote noticeable with size differences larger than 2 orders of magnitude:
>> E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd
>> have ~2GiB of static overhead on only 0.5GiB of actual buffers.
> Not sure what do you mean by using a maximal size, can you elaborate.
>
> In the current patch those structures are allocated as before, except
> each goes into a separate segment -- without any extra memory overhead
> as far as I see.

Thank you for explanation. I am sorry that I have not precisely 
investigated your patch before writing: it seems to be that you are are 
placing in separate segment only content of shared buffers.
Now I see that I was wrong and it is actually the main difference with 
memory ballooning approach I have used. As far as you are are allocating 
buffers descriptors and hash table in the same segment,
there is no extra memory overhead.
The only drawback is that we are loosing content of shared buffers in 
case of resize. It may be sadly, but not looks like there is no better 
alternative.

But there are still some dependencies on shared buffers size which are 
not addressed in this PR.
I am not sure how critical they are and is it possible to do something 
here, but at least I want to enumerate them:

1. Checkpointer: maximal number of checkpointer requests depends on 
NBuffers. So if we start with small shared buffers and then upscale, it 
may cause the too frequent checkpoints:

Size
CheckpointerShmemSize(void)
...
         size = add_size(size, mul_size(NBuffers, 
sizeof(CheckpointerRequest)));

CheckpointerShmemInit(void)
         CheckpointerShmem->max_requests = NBuffers;

2. XLOG: number of xlog buffers is calculated depending on number of 
shared buffers:

XLOGChooseNumBuffers(void)
{
...
      xbuffers = NBuffers / 32;

Should not cause some errors, but may be not so efficient if once again 
we start we tiny shared buffers.

3. AIO: AIO max concurrency is also calculated based on number of shared 
buffers:

AioChooseMaxConcurrency(void)
{
...

     max_proportional_pins = NBuffers / max_backends;

For small shared buffers (i.e. 1Mb,  there will be no concurrency at all).

So none of this issues can cause some error, just some inefficient behavior.
But if we want to start with very small shared buffers and then increase 
them on demand,
then it can be a problem.

In all this three cases NBuffers is used not just to calculate some 
threshold value, but also determine size of the structure in shared memory.
The straightforward solution is to place them in the same segment as 
shared buffers. But I am not sure how difficult it will be to implement.







Re: Changing shared_buffers without restart

From
Thomas Munro
Date:
On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote:
> >
> > In an offlist chat Thomas Munro mentioned that just ftruncate() would
> > be enough to resize the shared memory without touching address maps
> > using mmap and munmap().
> >
> > ftruncate man page seems to concur with him
> >
> >        If the effect of ftruncate() is to decrease the size of a memory
> >        mapped file or a shared memory object and whole pages beyond the
> >        new end were previously mapped, then the whole pages beyond the
> >        new end shall be discarded.
> >
> >        References to discarded pages shall result in the generation of a
> >        SIGBUS signal.
> >
> >        If the effect of ftruncate() is to increase the size of a memory
> >        object, it is unspecified whether the contents of any mapped pages
> >        between the old end-of-file and the new are flushed to the
> >        underlying object.
> >
> > ftruncate() when shrinking memory will release the extra pages and
> > also would cause segmentation fault when memory outside the size of
> > file is accessed even if the actual address map is larger than the
> > mapped file. The expanded memory is allocated as it is written to, and
> > those pages also become visible in the underlying object.
>
> Thanks for sharing. I need to do more thorough tests, but after a quick
> look I'm not sure about that. ftruncate will take care about the memory,
> but AFAICT the memory mapping will stay the same, is that what you mean?
> In that case if the segment got increased, the memory still can't be
> used because it's beyond the mapping end (at least in my test that's
> what happened). If the segment got shrinked, the memory couldn't be
> reclaimed, because, well, there is already a mapping. Or do I miss
> something?

I was imagining that you might map some maximum possible size at the
beginning to reserve the address space permanently, and then adjust
the virtual memory object's size with ftruncate as required to provide
backing.  Doesn't that achieve the goal with fewer steps, using only
portable* POSIX stuff, and keeping all pointers stable?  I understand
that pointer stability may not be required (I can see roughly how that
argument is constructed), but isn't it still better to avoid having to
prove that and deal with various other problems completely?  Is there
a downside/cost to having a large mapping that is only partially
backed?  I suppose choosing that number might offend you but at least
there is an obvious upper bound: physical memory size.

*You might also want to use fallocate after ftruncate on Linux to
avoid SIGBUS on allocation failure on first touch page fault, which
raises portability questions since it's unspecified whether you can do
that with shm fds and fails on some systems, but it let's call that an
independent topic as it's not affected by this choice.



Re: Changing shared_buffers without restart

From
Andres Freund
Date:
Hi,

On April 18, 2025 11:17:21 AM GMT+02:00, Thomas Munro <thomas.munro@gmail.com> wrote:
> Doesn't that achieve the goal with fewer steps, using only
>portable* POSIX stuff, and keeping all pointers stable?  I understand
>that pointer stability may not be required (I can see roughly how that
>argument is constructed), but isn't it still better to avoid having to
>prove that and deal with various other problems completely?

I think we should flat out reject any approach that does not maintain pointer stability.  It would restrict future
optimizationsa lot if we can't rely on that (e.g. not materializing tuples when transporting them from worker to
leader;pointering datastructures in shared buffers). 

Greetings,

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Changing shared_buffers without restart

From
Thomas Munro
Date:
On Fri, Apr 18, 2025 at 9:17 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Thanks for sharing. I need to do more thorough tests, but after a quick
> > look I'm not sure about that. ftruncate will take care about the memory,
> > but AFAICT the memory mapping will stay the same, is that what you mean?
> > In that case if the segment got increased, the memory still can't be
> > used because it's beyond the mapping end (at least in my test that's
> > what happened). If the segment got shrinked, the memory couldn't be
> > reclaimed, because, well, there is already a mapping. Or do I miss
> > something?
>
> I was imagining that you might map some maximum possible size at the
> beginning to reserve the address space permanently, and then adjust
> the virtual memory object's size with ftruncate as required to provide
> backing.  Doesn't that achieve the goal with fewer steps, using only
> portable* POSIX stuff, and keeping all pointers stable?  I understand
> that pointer stability may not be required (I can see roughly how that
> argument is constructed), but isn't it still better to avoid having to
> prove that and deal with various other problems completely?  Is there
> a downside/cost to having a large mapping that is only partially
> backed?  I suppose choosing that number might offend you but at least
> there is an obvious upper bound: physical memory size.

TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side
effect on Linux, as if you had called ftruncate on it (fully allocated
huge pages I expected up to the object's size, just not magical size
changes beyond that when I merely asked to map it).  That doesn't
happen for regular page size, or for any page size on my local OS's
shm objects and doesn't seem to fit mmap's job description given an
fd*, but maybe I'm just confused.  Anyway, a  workaround seems to be
to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ
| PROT_WRITE) new regions after extending with ftruncate(), at least
in simple tests...

(*Hmm, wiild uninformed speculation: perhap the size-setting behaviour
needed when hugetlbfs is used secretly to implement MAP_ANONYMOUS is
being exposed also when a hugetlbfs fd is given explicitly to mmap,
generating this bizarro side effect?)



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Fri, Apr 18, 2025 at 09:17:21PM GMT, Thomas Munro wrote:
> I was imagining that you might map some maximum possible size at the
> beginning to reserve the address space permanently, and then adjust
> the virtual memory object's size with ftruncate as required to provide
> backing.  Doesn't that achieve the goal with fewer steps, using only
> portable* POSIX stuff, and keeping all pointers stable?

Ah, I see what you folks mean. So in the latest patch there is a single large
shared memory area reserved with PROT_NONE + MAP_NORESERVE. This area is
logically divided between shmem segments, and each segment is mmap'd out of it
and could be resized withing these logical boundaries. Now the suggestion is to
have one reserved area for each segment, and instead of really mmap'ing
something out of it, manage memory via ftruncate.

Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are
questionable from portability point of view. This leaves memfd_create, and I'm
still not completely clear on it's portability -- it seems to be specific to
Linux, but others provide compatible implementation as well.

Let me experiment with this idea a bit, I would like to make sure there are no
other limitations we might face.

> I understand that pointer stability may not be required

Just to clarify, the current patch maintains this property (stable pointers),
which I also see as mandatory for any possible implementation.

> *You might also want to use fallocate after ftruncate on Linux to
> avoid SIGBUS on allocation failure on first touch page fault, which
> raises portability questions since it's unspecified whether you can do
> that with shm fds and fails on some systems, but it let's call that an
> independent topic as it's not affected by this choice.

I'm afraid it would be strictly neccessary to do fallocate, otherwise we're
back where we were before reservation accounting for huge pages in Linux (lot's
of people were facing unexpected SIGBUS when dealing with cgroups).

> TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side
> effect on Linux, as if you had called ftruncate on it (fully allocated
> huge pages I expected up to the object's size, just not magical size
> changes beyond that when I merely asked to map it).  That doesn't
> happen for regular page size, or for any page size on my local OS's
> shm objects and doesn't seem to fit mmap's job description given an
> fd*, but maybe I'm just confused.  Anyway, a  workaround seems to be
> to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ
> | PROT_WRITE) new regions after extending with ftruncate(), at least
> in simple tests...

Right, it's similar to the currently implemented space reservation, which also
goes with PROT_NONE and MAP_NORESERVE. I assume it boils down to the way how
memory reservation accounting in Linux works.



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Thu, Apr 17, 2025 at 07:05:36PM GMT, Ni Ku wrote:
> I also have a related question about how ftruncate() is used in the patch.
> In my testing I also see that when using ftruncate to shrink a shared
> segment, the memory is freed immediately after the call, even if other
> processes still have that memory mapped, and they will hit SIGBUS if they
> try to access that memory again as the manpage says.
>
> So am I correct to think that, to support the bufferpool shrinking case, it
> would not be safe to call ftruncate in AnonymousShmemResize as-is, since at
> that point other processes may still be using pages that belong to the
> truncated memory?
> It appears that for shrinking we should only call ftruncate when we're sure
> no process will access those pages again (eg, all processes have handled
> the resize interrupt signal barrier). I suppose this can be done by the
> resize coordinator after synchronizing with all the other processes.
> But in that case it seems we cannot use the postmaster as the coordinator
> then? b/c I see some code comments saying the postmaster does not have
> waiting infrastructure... (maybe even if the postmaster has waiting infra
> we don't want to use it anyway since it can be blocked for a long time and
> won't be able to serve other requests).

There is already a coordination infrastructure, implemented in the patch
0006, which will take care of this and prevent access to the shared
memory until everything is resized.



Re: Changing shared_buffers without restart

From
Dmitry Dolgov
Date:
> On Fri, Apr 18, 2025 at 10:06:23AM GMT, Konstantin Knizhnik wrote:
> The only drawback is that we are loosing content of shared buffers in case
> of resize. It may be sadly, but not looks like there is no better
> alternative.

No, why would we loose the content? If we do mremap, it will leave the
content as it is. If we do munmap/mmap with an anonymous backing file,
it will also keep the content in memory. The same with another proposal
about using ftruncate/fallocate only, both will leave the content
untouch unless told to do otherwise.

> But there are still some dependencies on shared buffers size which are not
> addressed in this PR.
> I am not sure how critical they are and is it possible to do something here,
> but at least I want to enumerate them:

Righ, I'm aware about those (except the AIO one, which was added after
the first version of the patch), and didn't address them yet due to the
same reason you've mentioned -- they're not hard errors, rather
inefficiencies. But thanks for the reminder, I keep those in the back of
my mind, and when the rest of the design will be settled down, I'll try
to address them as well.



Re: Changing shared_buffers without restart

From
Thomas Munro
Date:
On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are
> questionable from portability point of view. This leaves memfd_create, and I'm
> still not completely clear on it's portability -- it seems to be specific to
> Linux, but others provide compatible implementation as well.

Something like this should work, roughly based on DSM code except here
we don't really need the name so we unlink it immediately, at the
slight risk of leaking it if the postmaster is killed between those
lines (maybe someone should go and tell POSIX to support the special
name SHM_ANON or some other way to avoid that; I can't see any
portable workaround).  Not tested/compiled, just a sketch:

#ifdef HAVE_MEMFD_CREATE
  /* Anonymous shared memory region. */
  fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags);
#else
  /* Standard POSIX insists on a name, which we unlink immediately. */
  do
  {
      char tmp[80];
      snprintf(tmp, sizeof(tmp), "PostgreSQL.%u",
pg_prng_uint32(&pg_global_prng_state));
      fd.= shm_open(tmp, O_CREAT | O_EXCL);
      if (fd >= 0)
        shm_unlink(tmp);
  } while (fd < 0 && errno == EXIST);
#endif

> Let me experiment with this idea a bit, I would like to make sure there are no
> other limitations we might face.

One thing I'm still wondering about is whether you really need all
this multi-phase barrier stuff, or even need to stop other backends
from running at all while doing the resize.  I guess that's related to
your remapping scheme, but supposing you find the simple
ftruncate()-only approach to be good, my next question is:  why isn't
it enough to wait for all backends to agree to stop allocating new
buffers in the range to be truncated, and then left them continue to
run as normal?  As far as they would be concerned, the in-progress
downsize has already happened, though it could be reverted later if
the eviction phase fails.  Then the coordinator could start evicting
buffers and truncating the shared memory object, which are
phases/steps, sure, but it's not clear to me why they need other
backends' help.

It sounds like Windows might need a second ProcSignalBarrier poke in
order to call VirtualUnlock() in every backend.  That's based on that
Usenet discussion I lobbed in here the other day; I haven't tried it
myself or fully grokked why it works, and there could well be other
ways, IDK.  Assuming it's the right approach, between the first poke
to make all backends accept the new lower size and the second poke to
unlock the memory, I don't see why they need to wait.  I suppose it
would be the same ProcSignalBarrier, but behave differently based on a
control variables.  I suppose there could also be a third poke, if you
want to consider the operation to be fully complete only once they
have all actually done that unlock step, but it may also be OK not to
worry about that, IDK.

On the other hand, maybe it just feels less risky if you stop the
whole world, or maybe you envisage parallelising the eviction work, or
there is some correctness concern I haven't grokked yet, but what?

> > *You might also want to use fallocate after ftruncate on Linux to
> > avoid SIGBUS on allocation failure on first touch page fault, which
> > raises portability questions since it's unspecified whether you can do
> > that with shm fds and fails on some systems, but it let's call that an
> > independent topic as it's not affected by this choice.
>
> I'm afraid it would be strictly neccessary to do fallocate, otherwise we're
> back where we were before reservation accounting for huge pages in Linux (lot's
> of people were facing unexpected SIGBUS when dealing with cgroups).

Yeah.  FWIW here is where we decided to gate that on __linux__ while
fixing that for DSM:


https://www.postgresql.org/message-id/flat/CAEepm%3D0euOKPaYWz0-gFv9xfG%2B8ptAjhFjiQEX0CCJaYN--sDQ%40mail.gmail.com#c81b941d300f04d382472e6414cec1f4