Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
Date
Msg-id CAD21AoAiPyHqCT2LrBJoMT4PYmr=QaKzzdLL9EM=B-4whV47xA@mail.gmail.com
Whole thread
In response to Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process  (Alexander Lakhin <exclusion@gmail.com>)
Responses Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
List pgsql-hackers
On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Dear Sawada-san,
>
> 28.04.2026 22:27, Masahiko Sawada wrote:
> > On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> >> I've been puzzled by a buildfarm failure [1] with such symptoms for a while
> >> and even reproduced it locally once, but couldn't gather more information
> >> that time. But now that you have described the scenario, I can easily
> >> reproduce the same test failure with:
> >> --- a/src/backend/storage/ipc/procsignal.c
> >> +++ b/src/backend/storage/ipc/procsignal.c
> >> @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
> >>          if (cancel_key_len > 0)
> >>                  memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
> >>          slot->pss_cancel_key_len = cancel_key_len;
> >> +pg_usleep(10000);
> >>          pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
> > Thank you for testing this.
> >
> > I've attached a patch to address the issue. I haven't verified it
> > across all versions yet, but I suspect it exists in the stable
> > branches as well...
>
> Thank you for the fix! It works for me too.
>
> I was wondering why is that failure the only one of this kind on buildfarm
> (in last two years, at least), so I've tried to reproduce it on
> REL_18_STABLE... and failed.
>
> Then I've bisected it on the master branch and found (your) commit that
> introduced this behavior: 67c20979c from 2025-12-23.
>

I've confirmed that this race condition issue is present from v15 to
the master. In v14, we have the procsignal barrier code but don't use
it anywhere. In v18 or older, it could happen when executing DROP
DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
in more cases as we're using procsignal barrier more places. In any
case, if a process emits a signal barrier when another process is
between the initialization of slot->pss_barrierGeneration and
slot->pss_pid initialization, the subsequent
WaitForProcSignalBarrier() ends up waiting for that process forever.
So I think the patch should be backpatched to v15. Please review these
patches.

FYI I found that we had a similar report[1]  last year, I'm not sure
it hit the exact same issue, though.

Regards,

[1] https://www.postgresql.org/message-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [PATCH] Compressed TOAST data corruption with REPACK CONCURRENTLY
Next
From: Tom Lane
Date:
Subject: Re: Having problems generating a code coverage report