Re: SIGQUIT handling, redux - Mailing list pgsql-hackers

From Andres Freund
Subject Re: SIGQUIT handling, redux
Date
Msg-id 20200909202201.unpjmshshu7sge6i@alap3.anarazel.de
Whole thread Raw
In response to Re: SIGQUIT handling, redux  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: SIGQUIT handling, redux  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I wish startup_die() weren't named startup_ - every single time I see
> > the name I think it's about the startup process...
> 
> We could call it startup_packet_die or something?

Yea, I think that'd be good.


> > I think StartupPacketTimeoutHandler is another case?
> 
> Yeah.  Although it's a lot less risky, since if the timeout is reached
> we're almost certainly waiting for client input.

An adversary could control that to a significant degree - but ordinarily
I agree...


> >> In passing, it's worth noting that startup_die() isn't really much safer
> >> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> >> those is that code that applies BlockSig will at least manage to block the
> >> former.
> 
> > Which is pretty unconvincing...
> 
> Agreed, it'd be nice if this were less shaky.  On the other hand,
> we've seen darn few complaints traceable to this AFAIR.  I'm not
> really sure it's worth putting a lot of effort into.

Not sure how many would plausibly reach us though.  A common reaction
will likely just to be to force-restart the server, not to fully
investigate the issue. Particularly because it'll often be once
something already has gone wrong...



> >> I don't want to give up trying to send a message to the client.
> 
> > That still doesn't make much sense to me. The potential for hanging
> > (e.g. inside malloc) is so much worse than not sending a message...
> 
> We see backends going through this code on a very regular basis in the
> buildfarm, but complete hangs are rare as can be.  I think you
> overestimate the severity of the problem.

I don't think the BF exercises the problmetic paths to a significant
degree. It's mostly local socket connections, and where not it's
localhost. There's no slow DNS, no more complicated authentication
methods, no packet loss. How often do we ever actually end up even
getting close to any of the paths but immediate shutdowns? And in the
SIGQUIT path, how often do we end up in the SIGKILL path, masking
potential deadlocks?


> > I only had one coffee so far (and it looks like the sun has died
> > outside), so maybe I'm just slow: But, uh, we don't currently send a
> > message startup_die(), right?
> > So that part is about quickdie()?
> 
> Right.  Note that startup_die() is pre-authentication, so I'm doubtful
> that we should tell the would-be client anything about the state of
> the server at that point, even ignoring these risk factors.  (I'm a
> bit inclined to remove the comment suggesting that'd be desirable.)

Yea, I think just putting in an editorialized version of your paragraph
would make sense.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: SIGQUIT handling, redux
Next
From: Tomas Vondra
Date:
Subject: Re: WIP: BRIN multi-range indexes