Re: Parallel worker hangs while handling errors. - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Parallel worker hangs while handling errors.
Date
Msg-id CALj2ACXEVxRrmrpsZxLEMKtRQdqOWQWwm=zRjPyO6_uU1KZLnQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel worker hangs while handling errors.  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallel worker hangs while handling errors.
List pgsql-hackers
On Fri, Aug 7, 2020 at 1:34 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > The v4 patch looks good to me. Hang is not seen, make check and make
> > check-world passes. I moved this to the committer for further review
> > in https://commitfest.postgresql.org/29/2636/.
>
> I don't think I agree with this approach. In particular, I don't
> understand the rationale for unblocking only SIGUSR1. Above, Vignesh
> says that he feels that unblocking only that signal would be the right
> approach, but no reason is given. I have two reasons why I suspect
> it's not the right approach. One, it doesn't seem to be what we do
> elsewhere; the only existing cases where we have special handling for
> particular signals are SIGQUIT and SIGPIPE, and those places have
> comments explaining the reason why they are handled in a special way.
>

We intend to unblock SIGQUIT before sigsetjmp() in places like
bgwriter, checkpointer, walwriter and walreceiver, but we only call
sigdelset(&BlockSig, SIGQUIT);, Without  PG_SETMASK(&BlockSig);, seems
like we are not actually unblocking SIQUIT and quickdie() will never
get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
0){....}, if postmaster sends a SIGQUIT while these processes are
doing clean up tasks in sigsetjmp(), it will not be received, and the
postmaster later sends SIGKLL to kill from below places.

        /*
         * If we already sent SIGQUIT to children and they are slow to shut
         * down, it's time to send them SIGKILL.  This doesn't happen
         * normally, but under certain conditions backends can get stuck while
         * shutting down.  This is a last measure to get them unwedged.
         *
         * Note we also do this during recovery from a process crash.
         */
        if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
            AbortStartTime != 0 &&
            (now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
        {
            /* We were gentle with them before. Not anymore */
            TerminateChildren(SIGKILL);
            /* reset flag so we don't SIGKILL again */
            AbortStartTime = 0;
        }

Shouldn't we call PG_SETMASK(&BlockSig); to make it effective?

Am I missing anything here?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "k.jamison@fujitsu.com"
Date:
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Next
From: Amit Langote
Date:
Subject: Re: [POC] Fast COPY FROM command for the table with foreign partitions