Re: Parallel worker hangs while handling errors. - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Parallel worker hangs while handling errors.
Date
Msg-id 2754818.1596814605@sss.pgh.pa.us
Whole thread Raw
In response to Re: Parallel worker hangs while handling errors.  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallel worker hangs while handling errors.  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
>> We intend to unblock SIGQUIT before sigsetjmp() in places like
>> bgwriter, checkpointer, walwriter and walreceiver, but we only call
>> sigdelset(&BlockSig, SIGQUIT);, Without  PG_SETMASK(&BlockSig);, seems
>> like we are not actually unblocking SIQUIT and quickdie() will never
>> get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
>> 0){....}

> Yeah, maybe so. This code has been around for a long time and I'm not
> sure what the thought process behind it was, but I don't see a flaw in
> your analysis here.

I think that code is the way it is intentionally: the idea is to not
accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
call further down, between the sigsetjmp stanza and the main loop.
The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.

In general, you don't want to accept signals in that area because the
process state may not be fully set up yet.  You could argue that the
SIGQUIT handler has no state dependencies, making it safe to accept
SIGQUIT earlier during startup of one of these processes, and likewise
for them to accept SIGQUIT during error recovery.  But barring actual
evidence of a problem with slow SIGQUIT response in these areas I'm more
inclined to leave well enough alone.  Changing this would add hazards,
e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
I'd want some concrete evidence of a benefit.  It seems fairly
irrelevant to the problem at hand with bgworkers, anyway.

As for said problem, I concur with Robert that the v4 patch seems pretty
dubious; it's adding a lot of barely-thought-out mechanism for no
convincing gain in safety.  I think the v1 patch was more nearly the right
thing, except that the unblock needs to happen a tad further down, as
attached (this is untested but certainly it should pass any test that v1
passed).  I didn't do anything about rewriting the bogus comment just
above the sigsetjmp call, but I agree that that should happen too.

            regards, tom lane

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85434..dd22241600 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -753,7 +753,14 @@ StartBackgroundWorker(void)
         /* Prevent interrupts while cleaning up */
         HOLD_INTERRUPTS();

-        /* Report the error to the server log */
+        /*
+         * sigsetjmp will have blocked all signals, but we may need to accept
+         * signals while communicating with our parallel leader.  Once we've
+         * done HOLD_INTERRUPTS() it should be safe to unblock signals.
+         */
+        BackgroundWorkerUnblockSignals();
+
+        /* Report the error to the parallel leader and the server log */
         EmitErrorReport();

         /*

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Handing off SLRU fsyncs to the checkpointer
Next
From: Robert Haas
Date:
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."