Re: SIGQUIT handling, redux - Mailing list pgsql-hackers

From Andres Freund
Subject Re: SIGQUIT handling, redux
Date
Msg-id 20200909192620.dylomoxknmt6rdb3@alap3.anarazel.de
Whole thread Raw
In response to SIGQUIT handling, redux  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: SIGQUIT handling, redux
List pgsql-hackers
Hi,

On 2020-09-08 17:39:24 -0400, Tom Lane wrote:
> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
> anywhere, so I did a quick survey of backend signal handlers to see if
> that is true.  I immediately found pgarch_exit() which surely is not.  It
> turns out that Horiguchi-san already noticed that and proposed to fix it,
> within the seemingly entirely unrelated patch series for a shared-memory
> based stats collector (see patch 0001 in [2]).  I think we should go ahead
> and commit that, and maybe even back-patch it.  There seems no good reason
> for the archiver to treat SIGQUIT as nonfatal when other postmaster
> children do not.

+1


> Another thing I found is startup_die() in postmaster.c, which catches
> SIGQUIT during the authentication phase of a new backend.  This calls
> proc_exit(1), which (a) is unsafe on its face, and (b) potentially
> demotes a SIGQUIT into something that will not cause the parent postmaster
> to perform a DB-wide restart.  There seems no good reason for that
> behavior, either.  I propose that we use SignalHandlerForCrashExit
> for SIGQUIT here too.

Yikes, yea, that seems like an important change.


I wish startup_die() weren't named startup_ - every single time I see
the name I think it's about the startup process...


I think StartupPacketTimeoutHandler is another case?



> In passing, it's worth noting that startup_die() isn't really much safer
> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> those is that code that applies BlockSig will at least manage to block the
> former.

Which is pretty unconvincing...

I wonder if we could at least *try* to rely on CFR() for a while. It'd
not be pretty, but probably doable, to set up a timer inside the signal
handler and try to exit using normal mechanisms for a while (some
overlap with authentication timeout).

That'd leave the question of what we do once that timer expires. There's
quite a few possible ways that could reproducibly happen, e.g. if DNS
lookups are required during auth.

The long term correct way to handle this would obviously be to
restructure everything that happens covered by startup_die() in a
non-blocking manner and just rely on CFR(). But that's a tall order to
get done anytime soon, particularly things like DNS are IIRC pretty hard
without relying on custom libraries.

Shorter term I don't really see an alternative to escalating to SIGQUIT
which is obviously terrible.


> So it's slightly tempting to drop startup_die() altogether in favor of
> using SignalHandlerForCrashExit for the SIGTERM-during-auth case too.
> However, that'd have the effect of causing a "fast shutdown" to get
> converted to a panic if it happens while any sessions are in
> authentication phase, so I think this cure is likely worse than the
> disease.

Indeed.


> I don't want to give up trying to send a message to the client.

That still doesn't make much sense to me. The potential for hanging
(e.g. inside malloc) is so much worse than not sending a message... And
we already infer things about the server dying in libpq when the
connection just closes (which I am admittedly also not happy about). But
I also think we can reduce the risk a bit, see below.

I only had one coffee so far (and it looks like the sun has died
outside), so maybe I'm just slow: But, uh, we don't currently send a
message startup_die(), right?

So that part is about quickdie()?


> Worth reading in connection with this is the thread that led up to
> commits 8e19a8264 et al [3].  We had a long discussion about how
> quickdie() and startup_die() might be made safe(r), without any
> real consensus emerging about any alternatives being better than the
> status quo.  I still don't have an improvement idea for quickdie();

I still think that we should at least mitigate the risk to hang inside
quickdie() by at least disabling translations (since the rest happens
inside the pre-allocated error memory context, which should lower the
risk).  And perhaps do similarly for the memory required for encryption,
if enabled, and where possible.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch
Next
From: Tomas Vondra
Date:
Subject: Re: WIP: BRIN multi-range indexes