Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
Date
Msg-id CAEze2WhhTnSLpjGJWGupbxkTp_JdNP6v0mNgpqhi_YkXJa=m6A@mail.gmail.com
Whole thread
In response to Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Wed, 22 Apr 2026 at 21:05, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
> > If the PSB is emitted (and signaled to checkpointer) before the
> > checkpointer has registered its SIGUSR1 handler, then the checkpointer
> > won't receive the notice to check its procsignal slots, it won't
> > notice the updated procsignal flags, and it won't process the PSB; not
> > until it receives a new SIGUSR1.
> >
> > Signals are sent to all processes that have their procsignal pss_pid
> > set, which is true for every process which has called ProcSignalInit,
> > which for the checkpointer (like other aux processes) happens in
> > AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
> > processes) calls AuxiliaryProcessMainCommon before registering its
> > signal handlers, creating a small window in time where signals are
> > sent, but not handled.
>
> Hm. Have we confirmed this happens?
>
> CheckpointerMain() is called with all signals masked, so it should be ok for
> the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
> long as it happens before [...]

Yeah, that was a misidentification of the exact race that caused the issue.

On Tue, 28 Apr 2026 at 21:28, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> >
> > Hello Sawada-san,
> >
> > 24.04.2026 20:52, Masahiko Sawada wrote:
> >
> > Right. The postmaster blocks all signals before starting child process
> > as the following comment explains:
> >
> >      /*
> >       * We start postmaster children with signals blocked.  This allows them to
> >       * install their own handlers before unblocking, to avoid races where they
> >       * might run the postmaster's handler and miss an important control
> >       * signal. With more analysis this could potentially be relaxed.
> >       */
> >      sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
> >
> > Investigating the issue, I found there is a race condition between the
> > procsignal initialization and emitting signal barrier that could be
> > the cause of this issue. Imagine the following scenario:

Ah, that'd be it indeed. Thanks!

> I've attached a patch to address the issue. I haven't verified it
> across all versions yet, but I suspect it exists in the stable
> branches as well. Previously, the issue rarely occurred because
> EmitProcSignalBarrier() was only used for smgr invalidation. However,
> now that we use signal barriers for online wal_level changes and
> checksum status updates, this race condition is likely to be
> encountered more frequently.

Yes, I think the boot process with the xlog_logical_info barrier is
more likely to hit this issue; as indicated by two known detected
cases in various CI jobs; though it could also be that the lockup of
the new barrier is just exceptionally bad for system stability.

As for the patches:
v1-0001 -- LGTM.

0001 (upthread): LGTM, but I'd also suggest to add some code to make
sure that we're actually receiving procsignals by the time we
initialize the Logical/Checksum subsystems that need to process shared
state changes by responding to procsignals; as attached. smgr's
procsignal doesn't really depend on shared memory state, so I've kept
that out of my patch.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Include schema-qualified names in publication error messages.
Next
From: Dilip Kumar
Date:
Subject: Re: Include schema-qualified names in publication error messages.