Thread: Re: crash with synchronized_standby_slots

Re: crash with synchronized_standby_slots

From

Amit Kapila

Date:

29 November 2024, 09:01:48

On Thu, Nov 28, 2024 at 5:54 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Gabriele just reported a crash when changing synchronized_standby_slots
> under SIGHUP and logging collector working.  The problem apparently is
> that validate_sync_standby_slots is run for the GUC check routine, and
> it requires acquiring an LWLock; but syslogger doesn't have a PGPROC so
> that doesn't work.
>
> Apparently we already have a hack in place against the same problem in
> postmaster (and elsewhere?), by testing that ReplicationSlotCtl is not
> null.  But that hack seems to be incomplete, as the crash here attests.
>

I tried it on my Windows machine and noticed that ReplicationSlotCtl
is NULL for syslogger, so the problem doesn't occur. The reason is
that we don't attach to shared memory in syslogger, so ideally
ReplicationSlotCtl should be NULL. Because we inherit everything
through the fork for Linux systems and then later for processes that
don't want to attach to shared memory, we call PGSharedMemoryDetach()
from postmaster_child_launch(). The PGSharedMemoryDetach won't
reinitialize the memory pointed to by ReplicationSlotCtl, so, it would
be an invalid memory.

> To reproduce, simply start with no synchronized_standby_slots setting
> and logging_collector=on, then change the value in postgresql.conf and
> reload.
>
> One option to fix this might be as attached.  The point here is that
> processes without a PGPROC don't need or care to have a correct setting,
> so we just skip verifying it on those.  AFAICS this is more general than
> the test for ReplicationSlotCtl, so I just removed that one.
>

Yeah, I can't think of a better fix for this.

--
With Regards,
Amit Kapila.

Re: crash with synchronized_standby_slots

From

Alvaro Herrera

Date:

03 December 2024, 20:04:49

On 2024-Nov-29, Amit Kapila wrote:

> I tried it on my Windows machine and noticed that ReplicationSlotCtl
> is NULL for syslogger, so the problem doesn't occur. The reason is
> that we don't attach to shared memory in syslogger, so ideally
> ReplicationSlotCtl should be NULL. Because we inherit everything
> through the fork for Linux systems and then later for processes that
> don't want to attach to shared memory, we call PGSharedMemoryDetach()
> from postmaster_child_launch(). The PGSharedMemoryDetach won't
> reinitialize the memory pointed to by ReplicationSlotCtl, so, it would
> be an invalid memory.

Heh, interesting.  I'm not sure if we should try to do something about
invalid pointers being left around after shmem initialization.  Also, is
this the first GUC check_hook that needs to take an LWLock?

Anyway, I have pushed this.

BTW it occurs to me that there might well be some sort of thundering
herd problem if every process needs to run the check_hook when a SIGHUP
is broadcast, and they'll all be waiting on that particular lwlock and
run the same validation locally again and again.  I bet if you have a
few thousand backends (hi Jakub! [1]) it's problematic.  Maybe we need a
different way to validate the GUC, but I don't know what that might be;
but doing the validation once and storing the result in shmem might be
better.

On 2024-Nov-29, Zhijie Hou (Fujitsu) wrote:

> I can also reproduce this bug and confirmed that the bug is fixed
> after applying the patch. In addition to the regression tests, I also
> manually tested the behavior of the postmaster, walsender, and user
> backend after reloading the configuration, and they all work as
> expected.

Many thanks for testing!

[1] https://postgr.es/m/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I apologize for the confusion in my previous responses.
 There appears to be an error."                   (ChatGPT)

Re: crash with synchronized_standby_slots

From

Amit Kapila

Date:

04 December 2024, 08:32:18

On Tue, Dec 3, 2024 at 10:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Nov-29, Amit Kapila wrote:
>
> BTW it occurs to me that there might well be some sort of thundering
> herd problem if every process needs to run the check_hook when a SIGHUP
> is broadcast, and they'll all be waiting on that particular lwlock and
> run the same validation locally again and again.  I bet if you have a
> few thousand backends (hi Jakub! [1]) it's problematic.
>

The lock is taken in shared mode, so, ideally, it shouldn't create a
problem but if it ever creates a problem, we can even skip that check
during validation. The validation will anyway happen later during
replication in StandbySlotsHaveCaughtup(). This check is mostly to
detect the error in GUC early.

>  Maybe we need a
> different way to validate the GUC, but I don't know what that might be;
> but doing the validation once and storing the result in shmem might be
> better.
>

What if that particular GUC changes again? We may need some additional
invalidation mechanism.

--
With Regards,
Amit Kapila.