On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Sawada-san,
>
> 24.04.2026 20:52, Masahiko Sawada wrote:
>
> Right. The postmaster blocks all signals before starting child process
> as the following comment explains:
>
> /*
> * We start postmaster children with signals blocked. This allows them to
> * install their own handlers before unblocking, to avoid races where they
> * might run the postmaster's handler and miss an important control
> * signal. With more analysis this could potentially be relaxed.
> */
> sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
>
> Investigating the issue, I found there is a race condition between the
> procsignal initialization and emitting signal barrier that could be
> the cause of this issue. Imagine the following scenario:
>
> 1. In ProcSignalInit(), the checkpointer initializes its
> slot->pss_barrierGeneration with the global generation.
> 2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
> procsignal slot but it skips emitting the signal as slot->pss_pid is
> still 0. It can happen even though the checkpointer holds a spinlock
> on its slot during the initialization because the first pid check is
> done without a spinlock acquisition.
> 3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
> 4. In WaitForProcSignalBarrier(), the startup checks the
> checkpointer's procsignal slot that has already initialized the
> pss_barrierGeneration, and waits for it to be updated. However, the
> checkpointer never updates its barrier generation as it doesn't get
> the signal.
>
>
> Thank you for the investigation and explanation of the issue!
>
> I've been puzzled by a buildfarm failure [1] with such symptoms for a while
> and even reproduced it locally once, but couldn't gather more information
> that time. But now that you have described the scenario, I can easily
> reproduce the same test failure with:
> --- a/src/backend/storage/ipc/procsignal.c
> +++ b/src/backend/storage/ipc/procsignal.c
> @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
> if (cancel_key_len > 0)
> memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
> slot->pss_cancel_key_len = cancel_key_len;
> +pg_usleep(10000);
> pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
Thank you for testing this.
I've attached a patch to address the issue. I haven't verified it
across all versions yet, but I suspect it exists in the stable
branches as well. Previously, the issue rarely occurred because
EmitProcSignalBarrier() was only used for smgr invalidation. However,
now that we use signal barriers for online wal_level changes and
checksum status updates, this race condition is likely to be
encountered more frequently.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com