Re: Add guc to enable send SIGSTOP to peers when backend exits abnormally - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Add guc to enable send SIGSTOP to peers when backend exits abnormally
Date
Msg-id 2601098.1630705483@sss.pgh.pa.us
Whole thread Raw
In response to Re: Add guc to enable send SIGSTOP to peers when backend exits abnormally  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-Sep-03, Tom Lane wrote:
>> TBH, I'd sooner rip out SendStop, and simplify the related postmaster
>> logic.

> I wrote a patch to do that in 2012, after this exchange:
> https://postgr.es/m/1333124720-sup-6193@alvh.no-ip.org
> I obviously doesn't apply at all anymore, but the thing that prevented
> me from sending it was I couldn't find what the mentioned feature was
> that would cause all backends to dump core at the time of a crash.

Oh, I think you misunderstood what I wrote.  I was thinking of the
ancient habit of most kernels to dump cores to a file just named
"core"; so that even if you went around and manually SIGABRT'd
each stopped process, the cores would all overwrite each other,
leaving you with little to show for the exercise.  Nowadays you're
more likely to get "core.NNN" for each PID, so that it could in
principle be useful to force all the backends to dump core for later
analysis.  But I know of no mechanism that would do that for you.

However, thinking about this afresh, it seems like that Berkeley-era
comment about "the wily post_hacker" was never very apropos.  If what
you wanted was a few GB of core files for later analysis, it'd make
more sense to have the postmaster send SIGABRT or the like.  That
saves a bunch of tedious manual steps, plus the cluster isn't left
in a funny state that requires yet more manual cleanup steps.

So I'm thinking that the *real* use-case for this is for developers
to attach with gdb and do on-the-fly investigation of the state of
other backends, rather than forcing core-dumps.  However, it's still
a pretty half-baked feature because there's no easy way to clean up
afterwards.

The other elephant in the room is that by the time the postmaster
has reacted to the initial backend crash, it's dubious whether the
state of other processes is still able to tell you much.  (IME,
at least, the postmaster doesn't hear about it until the kernel
has finished writing out the dying process's core image, which
takes approximately forever compared to modern CPU speeds.)

> So it seemed to me that we would be ripping out a feature I had used,
> with no replacement.

If we had a really useful feature here I'd be all over it.
But it looks more like somebody's ten-minute hack, so the
fact that it's undocumented and obscure-to-invoke seems
appropriate to me.

(BTW, I think we had exactly this discussion way back when
Peter cleaned up the postmaster/postgres command line switches.
Just about all the other old switches have equivalent GUCs,
and IIRC it is not an oversight that SendStop was left out.)

            regards, tom lane



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Add guc to enable send SIGSTOP to peers when backend exits abnormally
Next
From: Andres Freund
Date:
Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints