Re: Better client reporting for "immediate stop" shutdowns - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Better client reporting for "immediate stop" shutdowns
Date
Msg-id CALj2ACV4+N5iOUFSv7SZzGMFBUo8Xoqb9QJmz+2WoHAESUSPUg@mail.gmail.com
Whole thread Raw
In response to Re: Better client reporting for "immediate stop" shutdowns  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Better client reporting for "immediate stop" shutdowns
List pgsql-hackers
On Tue, Dec 22, 2020 at 11:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > On Tue, Dec 22, 2020 at 2:29 AM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> >> If I'm correct, quickdie() doesn't access any shared memory because
> >> one of the reason we can be in quickdie() is when the shared memory
> >> itself is corrupted(the comment down below on why we don't call
> >> roc_exit() or atexit() says), in such case, will GetQuitSignalReason()
> >> have some problem in accessing the shared memory i.e. +    return
> >> PMSignalState->sigquit_reason;?
>
> It couldn't really have any problem in physically accessing the field;
> we never detach from the main shared memory block.  The risk that needs
> to be thought about is that shared memory contains garbage --- for
> example, imagine that a failing process scribbled in the wrong part of
> shared memory before crashing.  So the hazard here is that there's a
> small chance that sigquit_reason will contain the wrong value, which
> would cause the patch to print a misleading message, or more likely
> not print anything (since I didn't put a default case in that switch).
> That seems fine to me.  Also, because the sequence of events would be
> (1) failing process scribbles and crashes, (2) postmaster updates
> sigquit_reason, (3) other child processes examine sigquit_reason,
> it's fairly likely that we'd get the right answer even if the field
> got clobbered during (1).

Hmm.

> There might be an argument for emitting the "unexpected SIGQUIT"
> text if we find garbage in sigquit_reason.  Any thoughts about that?

Although I can't think of any case now, IMHO we can still have a
default case(we may or may not hit it) in the switch with a message
something like  "terminating connection due to unexpected SIGQUIT".

> >> AFAIK, errmsg(terminating connection due to administrator command") is
> >> emitted when there's no specific reason. But we know exactly why we
> >> are terminating in this case, how about having an error message like
> >> errmsg("terminating connection due to immediate shutdown request")));
> >> ? There are other places where errmsg("terminating connection due to
> >> XXXX reasons"); is used. We also log messages when an immediate
> >> shutdown request is received errmsg("received immediate shutdown
> >> request").
>
> > +1. I definitely think having this message be different can be useful.
>
> OK, will use "terminating connection due to immediate shutdown
> request".

Thanks.

I don't have any further comments on the patch.

> > See also the thread about tracking shutdown reasons (connection
> > statistics) -- not the same thing, but the same concepts apply.
>
> Hm.  I wondered for a bit if that patch could make use of this one
> to improve its results.  For the specific case of SIGQUIT it seems
> moot because we aren't going to let backends send any shutdown
> statistics during an emergency stop.

Yeah.

> But maybe the idea could be extended to let more-accurate termination reasons be provided in
> some other cases.

Yeah. For instance, the idea can be extended to the following scenario
- currently for smart and fast shutdown postmaster sends single signal
SIGTERM, so the backend can not know what was the exact reason for the
shutdown. If the postmaster updates the sigterm reason, (the way this
patch does, just before signalling children with SIGTERM), then the
backend would know that information and can report better.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Single transaction in the tablesync worker?
Next
From: Craig Ringer
Date:
Subject: Logical decoding without slots: decoding in lockstep with recovery