> On Mon, Apr 07, 2025 at 11:50:46AM GMT, Ashutosh Bapat wrote:
> This is because the BarrierArriveAndWait() only waits for all the
> attached backends. It doesn't wait for backends which are yet to
> attach. I think what we want is *all* the backends should execute all
> the phases synchronously and wait for others to finish. If we don't do
> that, there's a possibility that some of them would see inconsistent
> buffer states or even worse may not have necessary memory mapped and
> resized - thus causing segfaults. Am I correct?
>
> I think what needs to be done is that every backend should wait for other
> backends to attach themselves to the barrier before moving to the
> first phase. One way I can think of is we use two signal barriers -
> one to ensure that all the backends have attached themselves and
> second for the actual resizing. But then the postmaster needs to wait for
> all the processes to process the first signal barrier. A postmaster can
> not wait on anything. Maybe there's a way to poll, but I didn't find
> it. Does that mean that we have to make some other backend a coordinator?
Yes, you're right, plain dynamic Barrier does not ensure all available
processes will be synchronized. I was aware about the scenario you
describe, it's mentioned in commentaries for the resize function. I was
under the impression this should be enough, but after some more thinking
I'm not so sure anymore. Let me try to structure it as a list of
possible corner cases that we need to worry about:
* New backend spawned while we're busy resizing shared memory. Those
should wait until the resizing is complete and get the new size as well.
* Old backend receives a resize message, but exits before attempting to
resize. Those should be excluded from coordination.
* A backend is blocked and not responding before or after the
ProcSignalBarrier message was sent. I'm thinking about a failure
situation, when one rogue backend is doing something without checking
for interrupts. We need to wait for those to become responsive, and
potentially abort shared memory resize after some timeout.
* Backends join the barrier in disjoint groups with some time in
between, which is longer than what it takes to resize shared memory.
That means that relying only on the shared dynamic barrier is not
enough -- it will only synchronize resize procedure withing those
groups.
Out of those I think the third poses some problems, e.g. if we shrinking
the shared memory, but one backend is accessing buffer pool without
checking for interrupts. In the v3 implementation this won't be handled
correctly, other backends will ignore such rogue process. Independently
from that we could reason about the logic much easier if it's guaranteed
that all the process to resize shared memory will wait for each other to
start simultaneously.
Looks like to achieve that we need a slightly different combination of a
global Barrier and ProcSignalBarrier mechanism. We can't use
ProcSignalBarrier as it is, because processes need to wait for each
other, and at the same time finish processing to bump the generation. We
also can't use a simple dynamic Barrier due to possibility of disjoint
groups of processes. A static Barrier is also not easier, because we
would need somehow to know exact number of processes, which might change
over time.
I think a relatively elegant solution is to extend ProcSignalBarrier
mechanism to track not only pss_barrierGeneration, as a sign that
everything was processed, but also something like
pss_barrierReceivedGeneration, indicating that the message was received
everywhere but not processed yet. That would be enough to allow
processes to wait until the resize message was received everywhere, then
use a global Barrier to wait until all processes are finished. It's
somehow similar to your proposal to use two signals, but has less
implementation overhead.
This would also allow different solutions regarding error handling. E.g.
we could do an unbounded waiting for all processes we expect to resize,
assuming that the user will be able to intervene and fix an issue if
there is any. Or we can do a timed waiting, and abort the resize after
some timeout of not all processes are ready yet. In the new v4 version
of the patch the first option is implemented.
On top of that there are following changes:
* Shared memory address space is now reserved for future usage, making
shared memory segments clash (e.g. due to memory allocation)
impossible. There is a new GUC to control how much space to reserve,
which is called max_available_memory -- on the assumption that most of
the time it would make sense to set its value to the total amount of
memory on the machine. I'm open for suggestions regarding the name.
* There is one more patch to address hugepages remap. As mentioned in
this thread above, Linux kernel has certain limitations when it comes
to mremap for segments allocated with huge pages. To work around it's
possible to replace mremap with a sequence of unmap and map again,
relying on the anon file behind the segment to keep the memory
content. I haven't found any downsides of this approach so far, but it
makes the anonymous file patch 0007 mandatory.