Re: We shouldn't signal process groups with SIGQUIT - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: We shouldn't signal process groups with SIGQUIT
Date
Msg-id CA+hUKGJvK0Py8BJar+HVfPUUcERLCJpnYhztpRz6cKhq0svp+w@mail.gmail.com
Whole thread Raw
In response to Re: We shouldn't signal process groups with SIGQUIT  (Michael Paquier <michael@paquier.xyz>)
Responses Re: We shouldn't signal process groups with SIGQUIT  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Tue, Feb 28, 2023 at 5:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Feb 14, 2023 at 12:47:12PM -0800, Andres Freund wrote:
> > Just naively hacking this behaviour change into the current code, would yield
> > sending SIGQUIT to postgres, and then SIGTERM to the whole process
> > group. Which seems like a reasonable order?  quickdie() should _exit()
> > immediately in the signal handler, so we shouldn't get to processing the
> > SIGTERM.  Even if both signals are "reacted to" at the same time, possibly
> > with SIGTERM being processed first, the SIGQUIT handler should be executed
> > long before the next CFI().
>
> I have been poking a bit at that, and did a change as simple as this
> one in signal_child():
>  #ifdef HAVE_SETSID
> +   if (signal == SIGQUIT)
> +       signal = SIGTERM;
>
> From what I can see, SIGTERM is actually received by the backends
> before SIGQUIT, and I can also see that the backends have enough room
> to process CFIs in some cases, especially short queries, even before
> reaching quickdie() and its exit().  So the window between SIGTERM and
> SIGQUIT is not as long as one would think.

Pop quiz: in what order do signal handlers run, if SIGQUIT and SIGTERM
are both pending when a process wakes up or unblocks?  I *think* the
answer on all typical implementation that follow conventions going
back to ancient Unix (but not standardised, so you can't count on
it!*), is that pending signals are delivered in order of the bits in
the pending signals bitmap from lowest to highest, and SIGQUIT <
SIGTERM (again: tradition, not standard), and then:

1.  If the handlers block each other via their sa_mask so that they
are serialised (note: ours don't) then you'll see the SIGQUIT handler
run and then the SIGTERM handler, for example if you do kill(self,
SIGTERM), kill(self, SIGQUIT), sigprocmask(SIG_SETMASK, &unblock_all,
NULL).

2.  If the handlers don't block each other (our case), then their
stack frames will be set up in that order (you might say they start in
that order but are immediately interrupted by the next one before they
can do anything), so they then run in the reverse order, SIGTERM
first.  I guess that is what you saw?

In theory you could straighten this out by asking what else is pending
so that we imposed our own priority, if that were a problem, but there
is something I don't understand: you said we could handle SIGTERM and
then make it all the way to CFI() (= non-signal handler code) before
handling a SIGQUIT that was sent first.  Huh... what am I missing?  I
thought the only risk was handlers running in the opposite of send
order because they 'overlapped', not non-handler code being allowed to
run in between.

*The standard explicitly says that delivery order is unspecified,
except for realtime signals which are aren't using.



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: stopgap fix for signal handling during restore_command
Next
From: Andres Freund
Date:
Subject: Re: We shouldn't signal process groups with SIGQUIT