Re: SIGQUIT handling, redux - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: SIGQUIT handling, redux |
Date | |
Msg-id | 20200909192620.dylomoxknmt6rdb3@alap3.anarazel.de Whole thread Raw |
In response to | SIGQUIT handling, redux (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: SIGQUIT handling, redux
|
List | pgsql-hackers |
Hi, On 2020-09-08 17:39:24 -0400, Tom Lane wrote: > Of course, this is only safe if the SIGQUIT handler is safe to be invoked > anywhere, so I did a quick survey of backend signal handlers to see if > that is true. I immediately found pgarch_exit() which surely is not. It > turns out that Horiguchi-san already noticed that and proposed to fix it, > within the seemingly entirely unrelated patch series for a shared-memory > based stats collector (see patch 0001 in [2]). I think we should go ahead > and commit that, and maybe even back-patch it. There seems no good reason > for the archiver to treat SIGQUIT as nonfatal when other postmaster > children do not. +1 > Another thing I found is startup_die() in postmaster.c, which catches > SIGQUIT during the authentication phase of a new backend. This calls > proc_exit(1), which (a) is unsafe on its face, and (b) potentially > demotes a SIGQUIT into something that will not cause the parent postmaster > to perform a DB-wide restart. There seems no good reason for that > behavior, either. I propose that we use SignalHandlerForCrashExit > for SIGQUIT here too. Yikes, yea, that seems like an important change. I wish startup_die() weren't named startup_ - every single time I see the name I think it's about the startup process... I think StartupPacketTimeoutHandler is another case? > In passing, it's worth noting that startup_die() isn't really much safer > for SIGTERM than it is for SIGQUIT; the only argument for distinguishing > those is that code that applies BlockSig will at least manage to block the > former. Which is pretty unconvincing... I wonder if we could at least *try* to rely on CFR() for a while. It'd not be pretty, but probably doable, to set up a timer inside the signal handler and try to exit using normal mechanisms for a while (some overlap with authentication timeout). That'd leave the question of what we do once that timer expires. There's quite a few possible ways that could reproducibly happen, e.g. if DNS lookups are required during auth. The long term correct way to handle this would obviously be to restructure everything that happens covered by startup_die() in a non-blocking manner and just rely on CFR(). But that's a tall order to get done anytime soon, particularly things like DNS are IIRC pretty hard without relying on custom libraries. Shorter term I don't really see an alternative to escalating to SIGQUIT which is obviously terrible. > So it's slightly tempting to drop startup_die() altogether in favor of > using SignalHandlerForCrashExit for the SIGTERM-during-auth case too. > However, that'd have the effect of causing a "fast shutdown" to get > converted to a panic if it happens while any sessions are in > authentication phase, so I think this cure is likely worse than the > disease. Indeed. > I don't want to give up trying to send a message to the client. That still doesn't make much sense to me. The potential for hanging (e.g. inside malloc) is so much worse than not sending a message... And we already infer things about the server dying in libpq when the connection just closes (which I am admittedly also not happy about). But I also think we can reduce the risk a bit, see below. I only had one coffee so far (and it looks like the sun has died outside), so maybe I'm just slow: But, uh, we don't currently send a message startup_die(), right? So that part is about quickdie()? > Worth reading in connection with this is the thread that led up to > commits 8e19a8264 et al [3]. We had a long discussion about how > quickdie() and startup_die() might be made safe(r), without any > real consensus emerging about any alternatives being better than the > status quo. I still don't have an improvement idea for quickdie(); I still think that we should at least mitigate the risk to hang inside quickdie() by at least disabling translations (since the rest happens inside the pre-allocated error memory context, which should lower the risk). And perhaps do similarly for the memory required for encryption, if enabled, and where possible. Greetings, Andres Freund
pgsql-hackers by date: