Re: Changing shared_buffers without restart - Mailing list pgsql-hackers

From Dmitry Dolgov
Subject Re: Changing shared_buffers without restart
Date
Msg-id eqs6v4rsboazl67xz3wxc6xjkgrpfybitpl45y3lmb2br67wbj@o7czebb3rlgd
Whole thread Raw
In response to Re: Changing shared_buffers without restart  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Responses Re: Changing shared_buffers without restart
Re: Changing shared_buffers without restart
List pgsql-hackers
> On Mon, Apr 07, 2025 at 11:50:46AM GMT, Ashutosh Bapat wrote:
> This is because the BarrierArriveAndWait() only waits for all the
> attached backends. It doesn't wait for backends which are yet to
> attach. I think what we want is *all* the backends should execute all
> the phases synchronously and wait for others to finish. If we don't do
> that, there's a possibility that some of them would see inconsistent
> buffer states or even worse may not have necessary memory mapped and
> resized - thus causing segfaults. Am I correct?
>
> I think what needs to be done is that every backend should wait for other
> backends to attach themselves to the barrier before moving to the
> first phase. One way I can think of is we use two signal barriers -
> one to ensure that all the backends have attached themselves and
> second for the actual resizing. But then the postmaster needs to wait for
> all the processes to process the first signal barrier. A postmaster can
> not wait on anything. Maybe there's a way to poll, but I didn't find
> it. Does that mean that we have to make some other backend a coordinator?

Yes, you're right, plain dynamic Barrier does not ensure all available
processes will be synchronized. I was aware about the scenario you
describe, it's mentioned in commentaries for the resize function. I was
under the impression this should be enough, but after some more thinking
I'm not so sure anymore. Let me try to structure it as a list of
possible corner cases that we need to worry about:

* New backend spawned while we're busy resizing shared memory. Those
  should wait until the resizing is complete and get the new size as well.

* Old backend receives a resize message, but exits before attempting to
  resize. Those should be excluded from coordination.

* A backend is blocked and not responding before or after the
  ProcSignalBarrier message was sent. I'm thinking about a failure
  situation, when one rogue backend is doing something without checking
  for interrupts. We need to wait for those to become responsive, and
  potentially abort shared memory resize after some timeout.

* Backends join the barrier in disjoint groups with some time in
  between, which is longer than what it takes to resize shared memory.
  That means that relying only on the shared dynamic barrier is not
  enough -- it will only synchronize resize procedure withing those
  groups.

Out of those I think the third poses some problems, e.g. if we shrinking
the shared memory, but one backend is accessing buffer pool without
checking for interrupts. In the v3 implementation this won't be handled
correctly, other backends will ignore such rogue process. Independently
from that we could reason about the logic much easier if it's guaranteed
that all the process to resize shared memory will wait for each other to
start simultaneously.

Looks like to achieve that we need a slightly different combination of a
global Barrier and ProcSignalBarrier mechanism. We can't use
ProcSignalBarrier as it is, because processes need to wait for each
other, and at the same time finish processing to bump the generation. We
also can't use a simple dynamic Barrier due to possibility of disjoint
groups of processes. A static Barrier is also not easier, because we
would need somehow to know exact number of processes, which might change
over time.

I think a relatively elegant solution is to extend ProcSignalBarrier
mechanism to track not only pss_barrierGeneration, as a sign that
everything was processed, but also something like
pss_barrierReceivedGeneration, indicating that the message was received
everywhere but not processed yet. That would be enough to allow
processes to wait until the resize message was received everywhere, then
use a global Barrier to wait until all processes are finished.  It's
somehow similar to your proposal to use two signals, but has less
implementation overhead.

This would also allow different solutions regarding error handling. E.g.
we could do an unbounded waiting for all processes we expect to resize,
assuming that the user will be able to intervene and fix an issue if
there is any. Or we can do a timed waiting, and abort the resize after
some timeout of not all processes are ready yet. In the new v4 version
of the patch the first option is implemented.

On top of that there are following changes:

* Shared memory address space is now reserved for future usage, making
  shared memory segments clash (e.g. due to memory allocation)
  impossible.  There is a new GUC to control how much space to reserve,
  which is called max_available_memory -- on the assumption that most of
  the time it would make sense to set its value to the total amount of
  memory on the machine. I'm open for suggestions regarding the name.

* There is one more patch to address hugepages remap. As mentioned in
  this thread above, Linux kernel has certain limitations when it comes
  to mremap for segments allocated with huge pages. To work around it's
  possible to replace mremap with a sequence of unmap and map again,
  relying on the anon file behind the segment to keep the memory
  content. I haven't found any downsides of this approach so far, but it
  makes the anonymous file patch 0007 mandatory.

Attachment

pgsql-hackers by date:

Previous
From: Rushabh Lathia
Date:
Subject: Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints
Next
From: Bertrand Drouvot
Date:
Subject: Re: Fix 035_standby_logical_decoding.pl race conditions