Thread: SIGQUIT handling, redux

SIGQUIT handling, redux

From
Tom Lane
Date:
This is to pull together a couple of recent threads that are seemingly
unrelated to $SUBJECT.

Over in the long thread at [1] we've agreed to try to make the backend
code actually do what assorted comments claim it does, namely allow
SIGQUIT to be accepted at any point after a given process has established
its signal handlers.  (Right now, it mostly is not accepted during error
recovery, but there's no clear reason to preserve that exception.)

Of course, this is only safe if the SIGQUIT handler is safe to be invoked
anywhere, so I did a quick survey of backend signal handlers to see if
that is true.  I immediately found pgarch_exit() which surely is not.  It
turns out that Horiguchi-san already noticed that and proposed to fix it,
within the seemingly entirely unrelated patch series for a shared-memory
based stats collector (see patch 0001 in [2]).  I think we should go ahead
and commit that, and maybe even back-patch it.  There seems no good reason
for the archiver to treat SIGQUIT as nonfatal when other postmaster
children do not.

Another thing I found is startup_die() in postmaster.c, which catches
SIGQUIT during the authentication phase of a new backend.  This calls
proc_exit(1), which (a) is unsafe on its face, and (b) potentially
demotes a SIGQUIT into something that will not cause the parent postmaster
to perform a DB-wide restart.  There seems no good reason for that
behavior, either.  I propose that we use SignalHandlerForCrashExit
for SIGQUIT here too.

In passing, it's worth noting that startup_die() isn't really much safer
for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
those is that code that applies BlockSig will at least manage to block the
former.  So it's slightly tempting to drop startup_die() altogether in
favor of using SignalHandlerForCrashExit for the SIGTERM-during-auth case
too.  However, that'd have the effect of causing a "fast shutdown" to get
converted to a panic if it happens while any sessions are in
authentication phase, so I think this cure is likely worse than the
disease.

Worth reading in connection with this is the thread that led up to
commits 8e19a8264 et al [3].  We had a long discussion about how
quickdie() and startup_die() might be made safe(r), without any
real consensus emerging about any alternatives being better than the
status quo.  I still don't have an improvement idea for quickdie();
I don't want to give up trying to send a message to the client.
Note, however, that quickdie() does end with _exit(2) so that at
least it's not trying to execute arbitrary process-cleanup code.

In short then, I want to ensure that postmaster children's SIGQUIT
handlers always end with _exit(2) rather than some other exit method.
We have two exceptions now and they need to get fixed.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/CALDaNm1d1hHPZUg3xU4XjtWBOLCrA%2B-2cJcLpw-cePZ%3DGgDVfA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20200908.175557.617150409868541587.horikyota.ntt%40gmail.com
[3] https://www.postgresql.org/message-id/flat/CAOMx_OAuRUHiAuCg2YgicZLzPVv5d9_H4KrL_OFsFP%3DVPekigA%40mail.gmail.com



Re: SIGQUIT handling, redux

From
Stephen Frost
Date:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> This is to pull together a couple of recent threads that are seemingly
> unrelated to $SUBJECT.
>
> Over in the long thread at [1] we've agreed to try to make the backend
> code actually do what assorted comments claim it does, namely allow
> SIGQUIT to be accepted at any point after a given process has established
> its signal handlers.  (Right now, it mostly is not accepted during error
> recovery, but there's no clear reason to preserve that exception.)
>
> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
> anywhere, so I did a quick survey of backend signal handlers to see if
> that is true.  I immediately found pgarch_exit() which surely is not.  It
> turns out that Horiguchi-san already noticed that and proposed to fix it,
> within the seemingly entirely unrelated patch series for a shared-memory
> based stats collector (see patch 0001 in [2]).  I think we should go ahead
> and commit that, and maybe even back-patch it.  There seems no good reason
> for the archiver to treat SIGQUIT as nonfatal when other postmaster
> children do not.

As I mentioned over there, I agree that we should do this and we should
further have the statistics collector also do so, which currently sets
up SIGQUIT with ShutdownRequestPending() and in its loop decides it's
fine to write out the stats file (which we're going to remove during
recovery anyway...) and then call exit(0).  I also think we should
back-patch these changes, as the commit mentioned in Horiguchi-san's
patch for pgarch_exit() was.

> Another thing I found is startup_die() in postmaster.c, which catches
> SIGQUIT during the authentication phase of a new backend.  This calls
> proc_exit(1), which (a) is unsafe on its face, and (b) potentially
> demotes a SIGQUIT into something that will not cause the parent postmaster
> to perform a DB-wide restart.  There seems no good reason for that
> behavior, either.  I propose that we use SignalHandlerForCrashExit
> for SIGQUIT here too.

Yikes, agreed.

> In passing, it's worth noting that startup_die() isn't really much safer
> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> those is that code that applies BlockSig will at least manage to block the
> former.  So it's slightly tempting to drop startup_die() altogether in
> favor of using SignalHandlerForCrashExit for the SIGTERM-during-auth case
> too.  However, that'd have the effect of causing a "fast shutdown" to get
> converted to a panic if it happens while any sessions are in
> authentication phase, so I think this cure is likely worse than the
> disease.

Agreed, that's definitely no good.

> Worth reading in connection with this is the thread that led up to
> commits 8e19a8264 et al [3].  We had a long discussion about how
> quickdie() and startup_die() might be made safe(r), without any
> real consensus emerging about any alternatives being better than the
> status quo.  I still don't have an improvement idea for quickdie();
> I don't want to give up trying to send a message to the client.
> Note, however, that quickdie() does end with _exit(2) so that at
> least it's not trying to execute arbitrary process-cleanup code.
>
> In short then, I want to ensure that postmaster children's SIGQUIT
> handlers always end with _exit(2) rather than some other exit method.
> We have two exceptions now and they need to get fixed.

I agree we should fix these.

Thanks,

Stephen

Attachment

Re: SIGQUIT handling, redux

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
>> anywhere, so I did a quick survey of backend signal handlers to see if
>> that is true.  I immediately found pgarch_exit() which surely is not.  It
>> turns out that Horiguchi-san already noticed that and proposed to fix it,
>> within the seemingly entirely unrelated patch series for a shared-memory
>> based stats collector (see patch 0001 in [2]).  I think we should go ahead
>> and commit that, and maybe even back-patch it.  There seems no good reason
>> for the archiver to treat SIGQUIT as nonfatal when other postmaster
>> children do not.

> As I mentioned over there, I agree that we should do this and we should
> further have the statistics collector also do so, which currently sets
> up SIGQUIT with ShutdownRequestPending() and in its loop decides it's
> fine to write out the stats file (which we're going to remove during
> recovery anyway...) and then call exit(0).

I noticed that that was different from everything else, but it's not
actually signal-unsafe, so it seems like a different topic from what
I'm on about at the moment.  I don't mind if you or somebody else
wants to change it, but I don't see it as a back-patchable bug fix.

> I also think we should
> back-patch these changes, as the commit mentioned in Horiguchi-san's
> patch for pgarch_exit() was.

Agreed; I'll go make it happen.

            regards, tom lane



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
I wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> As I mentioned over there, I agree that we should do this and we should
>> further have the statistics collector also do so, which currently sets
>> up SIGQUIT with ShutdownRequestPending() and in its loop decides it's
>> fine to write out the stats file (which we're going to remove during
>> recovery anyway...) and then call exit(0).

> I noticed that that was different from everything else, but it's not
> actually signal-unsafe, so it seems like a different topic from what
> I'm on about at the moment.  I don't mind if you or somebody else
> wants to change it, but I don't see it as a back-patchable bug fix.

Note also that the postmaster actually uses SIGQUIT to command normal
shutdown of the stats collector (cf reaper(), around line 3125 in HEAD).
So this needs a change in signaling conventions, not just internal
tweaks in the collector.  Not a big deal, but it reinforces my feeling
that it should be a HEAD-only change.

            regards, tom lane



Re: SIGQUIT handling, redux

From
Andres Freund
Date:
Hi,

On 2020-09-08 17:39:24 -0400, Tom Lane wrote:
> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
> anywhere, so I did a quick survey of backend signal handlers to see if
> that is true.  I immediately found pgarch_exit() which surely is not.  It
> turns out that Horiguchi-san already noticed that and proposed to fix it,
> within the seemingly entirely unrelated patch series for a shared-memory
> based stats collector (see patch 0001 in [2]).  I think we should go ahead
> and commit that, and maybe even back-patch it.  There seems no good reason
> for the archiver to treat SIGQUIT as nonfatal when other postmaster
> children do not.

+1


> Another thing I found is startup_die() in postmaster.c, which catches
> SIGQUIT during the authentication phase of a new backend.  This calls
> proc_exit(1), which (a) is unsafe on its face, and (b) potentially
> demotes a SIGQUIT into something that will not cause the parent postmaster
> to perform a DB-wide restart.  There seems no good reason for that
> behavior, either.  I propose that we use SignalHandlerForCrashExit
> for SIGQUIT here too.

Yikes, yea, that seems like an important change.


I wish startup_die() weren't named startup_ - every single time I see
the name I think it's about the startup process...


I think StartupPacketTimeoutHandler is another case?



> In passing, it's worth noting that startup_die() isn't really much safer
> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> those is that code that applies BlockSig will at least manage to block the
> former.

Which is pretty unconvincing...

I wonder if we could at least *try* to rely on CFR() for a while. It'd
not be pretty, but probably doable, to set up a timer inside the signal
handler and try to exit using normal mechanisms for a while (some
overlap with authentication timeout).

That'd leave the question of what we do once that timer expires. There's
quite a few possible ways that could reproducibly happen, e.g. if DNS
lookups are required during auth.

The long term correct way to handle this would obviously be to
restructure everything that happens covered by startup_die() in a
non-blocking manner and just rely on CFR(). But that's a tall order to
get done anytime soon, particularly things like DNS are IIRC pretty hard
without relying on custom libraries.

Shorter term I don't really see an alternative to escalating to SIGQUIT
which is obviously terrible.


> So it's slightly tempting to drop startup_die() altogether in favor of
> using SignalHandlerForCrashExit for the SIGTERM-during-auth case too.
> However, that'd have the effect of causing a "fast shutdown" to get
> converted to a panic if it happens while any sessions are in
> authentication phase, so I think this cure is likely worse than the
> disease.

Indeed.


> I don't want to give up trying to send a message to the client.

That still doesn't make much sense to me. The potential for hanging
(e.g. inside malloc) is so much worse than not sending a message... And
we already infer things about the server dying in libpq when the
connection just closes (which I am admittedly also not happy about). But
I also think we can reduce the risk a bit, see below.

I only had one coffee so far (and it looks like the sun has died
outside), so maybe I'm just slow: But, uh, we don't currently send a
message startup_die(), right?

So that part is about quickdie()?


> Worth reading in connection with this is the thread that led up to
> commits 8e19a8264 et al [3].  We had a long discussion about how
> quickdie() and startup_die() might be made safe(r), without any
> real consensus emerging about any alternatives being better than the
> status quo.  I still don't have an improvement idea for quickdie();

I still think that we should at least mitigate the risk to hang inside
quickdie() by at least disabling translations (since the rest happens
inside the pre-allocated error memory context, which should lower the
risk).  And perhaps do similarly for the memory required for encryption,
if enabled, and where possible.


Greetings,

Andres Freund



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I wish startup_die() weren't named startup_ - every single time I see
> the name I think it's about the startup process...

We could call it startup_packet_die or something?

> I think StartupPacketTimeoutHandler is another case?

Yeah.  Although it's a lot less risky, since if the timeout is reached
we're almost certainly waiting for client input.

>> In passing, it's worth noting that startup_die() isn't really much safer
>> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
>> those is that code that applies BlockSig will at least manage to block the
>> former.

> Which is pretty unconvincing...

Agreed, it'd be nice if this were less shaky.  On the other hand,
we've seen darn few complaints traceable to this AFAIR.  I'm not
really sure it's worth putting a lot of effort into.

> The long term correct way to handle this would obviously be to
> restructure everything that happens covered by startup_die() in a
> non-blocking manner and just rely on CFR(). But that's a tall order to
> get done anytime soon, particularly things like DNS are IIRC pretty hard
> without relying on custom libraries.

Not only DNS, but all the various auth libraries would have to be
contended with.  Lots of work there compared to the likely rewards.

>> I don't want to give up trying to send a message to the client.

> That still doesn't make much sense to me. The potential for hanging
> (e.g. inside malloc) is so much worse than not sending a message...

We see backends going through this code on a very regular basis in the
buildfarm, but complete hangs are rare as can be.  I think you
overestimate the severity of the problem.

> I only had one coffee so far (and it looks like the sun has died
> outside), so maybe I'm just slow: But, uh, we don't currently send a
> message startup_die(), right?
> So that part is about quickdie()?

Right.  Note that startup_die() is pre-authentication, so I'm doubtful
that we should tell the would-be client anything about the state of
the server at that point, even ignoring these risk factors.  (I'm a
bit inclined to remove the comment suggesting that'd be desirable.)

            regards, tom lane



Re: SIGQUIT handling, redux

From
Andres Freund
Date:
Hi,

On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I wish startup_die() weren't named startup_ - every single time I see
> > the name I think it's about the startup process...
> 
> We could call it startup_packet_die or something?

Yea, I think that'd be good.


> > I think StartupPacketTimeoutHandler is another case?
> 
> Yeah.  Although it's a lot less risky, since if the timeout is reached
> we're almost certainly waiting for client input.

An adversary could control that to a significant degree - but ordinarily
I agree...


> >> In passing, it's worth noting that startup_die() isn't really much safer
> >> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> >> those is that code that applies BlockSig will at least manage to block the
> >> former.
> 
> > Which is pretty unconvincing...
> 
> Agreed, it'd be nice if this were less shaky.  On the other hand,
> we've seen darn few complaints traceable to this AFAIR.  I'm not
> really sure it's worth putting a lot of effort into.

Not sure how many would plausibly reach us though.  A common reaction
will likely just to be to force-restart the server, not to fully
investigate the issue. Particularly because it'll often be once
something already has gone wrong...



> >> I don't want to give up trying to send a message to the client.
> 
> > That still doesn't make much sense to me. The potential for hanging
> > (e.g. inside malloc) is so much worse than not sending a message...
> 
> We see backends going through this code on a very regular basis in the
> buildfarm, but complete hangs are rare as can be.  I think you
> overestimate the severity of the problem.

I don't think the BF exercises the problmetic paths to a significant
degree. It's mostly local socket connections, and where not it's
localhost. There's no slow DNS, no more complicated authentication
methods, no packet loss. How often do we ever actually end up even
getting close to any of the paths but immediate shutdowns? And in the
SIGQUIT path, how often do we end up in the SIGKILL path, masking
potential deadlocks?


> > I only had one coffee so far (and it looks like the sun has died
> > outside), so maybe I'm just slow: But, uh, we don't currently send a
> > message startup_die(), right?
> > So that part is about quickdie()?
> 
> Right.  Note that startup_die() is pre-authentication, so I'm doubtful
> that we should tell the would-be client anything about the state of
> the server at that point, even ignoring these risk factors.  (I'm a
> bit inclined to remove the comment suggesting that'd be desirable.)

Yea, I think just putting in an editorialized version of your paragraph
would make sense.

Greetings,

Andres Freund



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
>> We could call it startup_packet_die or something?

> Yea, I think that'd be good.

I'll make it so.

>> We see backends going through this code on a very regular basis in the
>> buildfarm, but complete hangs are rare as can be.  I think you
>> overestimate the severity of the problem.

> I don't think the BF exercises the problmetic paths to a significant
> degree. It's mostly local socket connections, and where not it's
> localhost. There's no slow DNS, no more complicated authentication
> methods, no packet loss. How often do we ever actually end up even
> getting close to any of the paths but immediate shutdowns?

Since we're talking about quickdie(), immediate shutdown/crash restart
is exactly the case of concern, and the buildfarm exercises it all the
time.

> And in the
> SIGQUIT path, how often do we end up in the SIGKILL path, masking
> potential deadlocks?

True, we can't really tell that.  I wonder if we should make the
postmaster emit a log message when it times out and goes to SIGKILL.
After a few months we could scrape the buildfarm logs and get a
pretty good handle on it.

            regards, tom lane



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
I wrote:
> Not only DNS, but all the various auth libraries would have to be
> contended with.  Lots of work there compared to the likely rewards.

Wait a minute.  The entire authentication cycle happens inside
InitPostgres, using the backend's normal signal handlers.  So
maybe we are overthinking the problem.  What if we simply postpone
ProcessStartupPacket into that same place, and run it under the same
rules as we do for authentication?  We would waste more cycles than
we do now for the case where the client closes the connection without
sending a startup packet, but not enormously so, I think --- and
optimizing that case doesn't seem like a high-priority goal anyway.
And cases like DNS lookup taking forever don't seem like any more of
an issue than auth lookup taking forever.

            regards, tom lane



Re: SIGQUIT handling, redux

From
Andres Freund
Date:
Hi,

On 2020-09-09 16:30:37 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
> >> We could call it startup_packet_die or something?
> 
> > Yea, I think that'd be good.
> 
> I'll make it so.

Thanks!


> >> We see backends going through this code on a very regular basis in the
> >> buildfarm, but complete hangs are rare as can be.  I think you
> >> overestimate the severity of the problem.
> 
> > I don't think the BF exercises the problmetic paths to a significant
> > degree. It's mostly local socket connections, and where not it's
> > localhost. There's no slow DNS, no more complicated authentication
> > methods, no packet loss. How often do we ever actually end up even
> > getting close to any of the paths but immediate shutdowns?
> 
> Since we're talking about quickdie(), immediate shutdown/crash restart
> is exactly the case of concern, and the buildfarm exercises it all the
> time.

Yea, but only in simple cases. Largely no SSL / kerberos. Largely
untranslated. Mostly the immediate shutdowns aren't when inside plpython
or such.


> > And in the
> > SIGQUIT path, how often do we end up in the SIGKILL path, masking
> > potential deadlocks?
> 
> True, we can't really tell that.  I wonder if we should make the
> postmaster emit a log message when it times out and goes to SIGKILL.
> After a few months we could scrape the buildfarm logs and get a
> pretty good handle on it.

I think that'd be a good idea.

Greetings,

Andres Freund



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
I wrote:
>> Not only DNS, but all the various auth libraries would have to be
>> contended with.  Lots of work there compared to the likely rewards.

> Wait a minute.  The entire authentication cycle happens inside
> InitPostgres, using the backend's normal signal handlers.  So
> maybe we are overthinking the problem.  What if we simply postpone
> ProcessStartupPacket into that same place, and run it under the same
> rules as we do for authentication?

Or, continuing to look for other answers ...

During BackendInitialize we have not yet touched shared memory.
This is not incidental but must be so, per its header comment.
Therefore it seems like we could have these signal handlers (for
SIGTERM or timeout) do _exit(1), thereby resolving the signal
unsafety while not provoking a database restart.  We don't
care whether inside-the-process state is sane, and we shouldn't
have changed anything yet in shared memory.

This is obviously not 100.00% safe, since maybe something could
violate these assumptions, but it seems a lot safer than using
proc_exit() from a signal handler.

One way to help catch any such assumption-violations is to add
an assertion that no on_shmem_exit handlers have been set up yet.
If there are none, there should be no expectation of having to
clean up shared state.

            regards, tom lane



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
Here's a draft patch that I think would be reasonable to back-patch.
(Before v13, we'd need a bespoke SIGQUIT handler to substitute for
SignalHandlerForCrashExit, but that's easy enough.)

Aside from comment updates, this

* uses SignalHandlerForCrashExit for SIGQUIT

* renames startup_die per your request

* moves BackendInitialize's interrupt-re-disabling code up a bit to
reduce the scope where these interrupts are active.  This doesn't
make things a whole lot safer, but it can't hurt.

I'll take a closer look at the idea of using _exit(1) tomorrow,
but I'd be pretty hesitant to back-patch that.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 42223c0f61..e65086f7b4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -112,6 +112,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
+#include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
@@ -405,7 +406,7 @@ static void SIGHUP_handler(SIGNAL_ARGS);
 static void pmdie(SIGNAL_ARGS);
 static void reaper(SIGNAL_ARGS);
 static void sigusr1_handler(SIGNAL_ARGS);
-static void startup_die(SIGNAL_ARGS);
+static void startup_packet_die(SIGNAL_ARGS);
 static void dummy_handler(SIGNAL_ARGS);
 static void StartupPacketTimeoutHandler(void);
 static void CleanupBackend(int pid, int exitstatus);
@@ -4340,22 +4341,30 @@ BackendInitialize(Port *port)
     whereToSendOutput = DestRemote; /* now safe to ereport to client */

     /*
-     * We arrange for a simple exit(1) if we receive SIGTERM or SIGQUIT or
-     * timeout while trying to collect the startup packet.  Otherwise the
-     * postmaster cannot shutdown the database FAST or IMMED cleanly if a
-     * buggy client fails to send the packet promptly.  XXX it follows that
-     * the remainder of this function must tolerate losing control at any
-     * instant.  Likewise, any pg_on_exit_callback registered before or during
-     * this function must be prepared to execute at any instant between here
-     * and the end of this function.  Furthermore, affected callbacks execute
-     * partially or not at all when a second exit-inducing signal arrives
-     * after proc_exit_prepare() decrements on_proc_exit_index.  (Thanks to
-     * that mechanic, callbacks need not anticipate more than one call.)  This
-     * is fragile; it ought to instead follow the norm of handling interrupts
-     * at selected, safe opportunities.
-     */
-    pqsignal(SIGTERM, startup_die);
-    pqsignal(SIGQUIT, startup_die);
+     * We arrange to do proc_exit(1) if we receive SIGTERM or timeout while
+     * trying to collect the startup packet; while SIGQUIT results in
+     * _exit(2).  Otherwise the postmaster cannot shutdown the database FAST
+     * or IMMED cleanly if a buggy client fails to send the packet promptly.
+     *
+     * XXX this is pretty dangerous; signal handlers should not call anything
+     * as complex as proc_exit() directly.  We minimize the hazard by not
+     * keeping these handlers active for longer than we must.  However, it
+     * seems necessary to be able to escape out of DNS lookups as well as the
+     * startup packet reception proper, so we can't narrow the scope further
+     * than is done here.
+     *
+     * XXX it follows that the remainder of this function must tolerate losing
+     * control at any instant.  Likewise, any pg_on_exit_callback registered
+     * before or during this function must be prepared to execute at any
+     * instant between here and the end of this function.  Furthermore,
+     * affected callbacks execute partially or not at all when a second
+     * exit-inducing signal arrives after proc_exit_prepare() decrements
+     * on_proc_exit_index.  (Thanks to that mechanic, callbacks need not
+     * anticipate more than one call.)  This is fragile; it ought to instead
+     * follow the norm of handling interrupts at selected, safe opportunities.
+     */
+    pqsignal(SIGTERM, startup_packet_die);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     InitializeTimeouts();        /* establishes SIGALRM handler */
     PG_SETMASK(&StartupBlockSig);

@@ -4411,8 +4420,8 @@ BackendInitialize(Port *port)
         port->remote_hostname = strdup(remote_host);

     /*
-     * Ready to begin client interaction.  We will give up and exit(1) after a
-     * time delay, so that a broken client can't hog a connection
+     * Ready to begin client interaction.  We will give up and proc_exit(1)
+     * after a time delay, so that a broken client can't hog a connection
      * indefinitely.  PreAuthDelay and any DNS interactions above don't count
      * against the time limit.
      *
@@ -4434,6 +4443,12 @@ BackendInitialize(Port *port)
      */
     status = ProcessStartupPacket(port, false, false);

+    /*
+     * Disable the timeout, and prevent SIGTERM again.
+     */
+    disable_timeout(STARTUP_PACKET_TIMEOUT, false);
+    PG_SETMASK(&BlockSig);
+
     /*
      * Stop here if it was bad or a cancel packet.  ProcessStartupPacket
      * already did any appropriate error reporting.
@@ -4459,12 +4474,6 @@ BackendInitialize(Port *port)
     pfree(ps_data.data);

     set_ps_display("initializing");
-
-    /*
-     * Disable the timeout, and prevent SIGTERM/SIGQUIT again.
-     */
-    disable_timeout(STARTUP_PACKET_TIMEOUT, false);
-    PG_SETMASK(&BlockSig);
 }


@@ -5359,16 +5368,22 @@ sigusr1_handler(SIGNAL_ARGS)
 }

 /*
- * SIGTERM or SIGQUIT while processing startup packet.
+ * SIGTERM while processing startup packet.
  * Clean up and exit(1).
  *
- * XXX: possible future improvement: try to send a message indicating
- * why we are disconnecting.  Problem is to be sure we don't block while
- * doing so, nor mess up SSL initialization.  In practice, if the client
- * has wedged here, it probably couldn't do anything with the message anyway.
+ * Running proc_exit() from a signal handler is pretty unsafe, since we
+ * can't know what code we've interrupted.  But the alternative of using
+ * _exit(2) is also unpalatable, since it'd mean that a "fast shutdown"
+ * would cause a database crash cycle (forcing WAL replay at restart)
+ * if any sessions are in authentication.  So we live with it for now.
+ *
+ * One might be tempted to try to send a message indicating why we are
+ * disconnecting.  However, that would make this even more unsafe.  Also,
+ * it seems undesirable to provide clues about the database's state to
+ * a client that has not yet completed authentication.
  */
 static void
-startup_die(SIGNAL_ARGS)
+startup_packet_die(SIGNAL_ARGS)
 {
     proc_exit(1);
 }
@@ -5389,7 +5404,11 @@ dummy_handler(SIGNAL_ARGS)

 /*
  * Timeout while processing startup packet.
- * As for startup_die(), we clean up and exit(1).
+ * As for startup_packet_die(), we clean up and exit(1).
+ *
+ * This is theoretically just as hazardous as it is in startup_packet_die(),
+ * although in practice we're almost certainly waiting for client input,
+ * which greatly reduces the risk.
  */
 static void
 StartupPacketTimeoutHandler(void)

Re: SIGQUIT handling, redux

From
Tom Lane
Date:
I wrote:
> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
> anywhere, so I did a quick survey of backend signal handlers to see if
> that is true.

This is straying a bit from the stated topic of this thread, but ...
I did some further looking around to see whether there were any
unsafe signal handlers besides SIGQUIT ones.  The situation is not
too awful, but I did find several issues not already mentioned
in this thread:

StartupProcShutdownHandler (SIGTERM)

This conditionally calls proc_exit(1).  The conditions boil down
to are-we-interrupting-a-system(3)-call, so who knows how safe
that is?  I wouldn't care to bet that system() doesn't use malloc,
for instance.  Still, the odds are very good that if a signal did
arrive, it'd be interrupting system()'s waitpid() or equivalent
kernel call, which is likely safe enough.

bgworker_die (SIGTERM)

Calls ereport(FATAL).  This is surely not any safer than, say,
quickdie().  No, it's worse, because at least that won't try
to go out via proc_exit().

FloatExceptionHandler (SIGFPE)

Calls ereport(ERROR).  This might be okay, though, since the
trap should be synchronous with the offending calculation.
Besides, if you're risking divide by zero or the like in
critical code, You're Doing It Wrong.

RecoveryConflictInterrupt (called from SIGUSR1)

Calls a whole boatload of state tests that were never designed
to be interrupt-safe, such as transaction-state-related inquiries
in xact.c.  The lack of any explicit awareness in this code that
it's in a signal handler doesn't discourage people from inserting
even more dubious stuff.  I think this needs to be burned to the
ground and rewritten.

StandbyDeadLockHandler (from SIGALRM)
StandbyTimeoutHandler (ditto)

Calls CancelDBBackends, which just for starters tries to acquire
an LWLock.  I think the only reason we've gotten away with this
for this long is the high probability that by the time either
timeout fires, we're going to be blocked on a semaphore.

I don't have any ideas about how to fix any of these things,
but I thought getting the problems on the record would be good.

            regards, tom lane



RE: SIGQUIT handling, redux

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tom Lane <tgl@sss.pgh.pa.us>
> This is straying a bit from the stated topic of this thread, but ...
> I did some further looking around to see whether there were any
> unsafe signal handlers besides SIGQUIT ones.  The situation is not
> too awful, but I did find several issues not already mentioned
> in this thread:

Wow, your eyes catch this many issues.  (I was just wondering about one or two of them.)

I'm not sure if this is related, but I had been wondering if the following can be what it is now.


(1)
When logical replication is used, pg_ctl stop with the default fast mode emits the message about termination of logical
replicationlauncher.  Although it's not FATAL or ERROR, but I was a bit startled when I saw this message for the first
time. Why should this message be emitted while nothing about other postmaster children are reported?  The logical rep
launcherignores SIGINT (SIG_IGN). 

LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  background worker "logical replication launcher" (PID 10056) exited with exit code 1
LOG:  shutting down
LOG:  database system is shut down


(2)
When the physical standby stops, a FATAL message is output.  It may be annoying to the DBA that monitors messages with
highseverity. 

LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating walreceiver process due to administrator command
LOG:  shutting down
LOG:  database system is shut down


(3)
When WaitLatch(EXIT_ON_POSTMASTER_DEATH) detects postmaster death, it calls proc_exit(1).  Can we call _exit(1) here
likewise?


Regards
Takayuki Tsunakawa





Re: SIGQUIT handling, redux

From
Kyotaro Horiguchi
Date:
At Wed, 09 Sep 2020 10:46:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> > I also think we should
> > back-patch these changes, as the commit mentioned in Horiguchi-san's
> > patch for pgarch_exit() was.
> 
> Agreed; I'll go make it happen.

Thank you for committing this!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: SIGQUIT handling, redux

From
Robert Haas
Date:
On Wed, Sep 9, 2020 at 10:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> bgworker_die (SIGTERM)
>
> Calls ereport(FATAL).  This is surely not any safer than, say,
> quickdie().  No, it's worse, because at least that won't try
> to go out via proc_exit().

I think bgworker_die() is pretty much a terrible idea. Every
background worker I've written has actually needed to use
CHECK_FOR_INTERRUPTS(). I think that the only way this could actually
be safe is if you have a background worker that never uses ereport()
itself, so that the ereport() in the signal handler can't be
interrupting one that's already happening. This seems unlikely to be
the normal case, or anything close to it. Most background workers
probably are shared-memory connected and use a lot of PostgreSQL
infrastructure and thus ereport() all over the place.

Now what to do about it I don't know exactly, but it would be nice to
do something.

> StandbyDeadLockHandler (from SIGALRM)
> StandbyTimeoutHandler (ditto)
>
> Calls CancelDBBackends, which just for starters tries to acquire
> an LWLock.  I think the only reason we've gotten away with this
> for this long is the high probability that by the time either
> timeout fires, we're going to be blocked on a semaphore.

Yeah, I'm not sure these are so bad. In fact, in the deadlock case, I
believe the old coding was designed to make sure we *had to* be
blocked on a semaphore, but I'm not sure whether that's still true.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Sep 9, 2020 at 10:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> bgworker_die (SIGTERM)
>> Calls ereport(FATAL).  This is surely not any safer than, say,
>> quickdie().  No, it's worse, because at least that won't try
>> to go out via proc_exit().

> I think bgworker_die() is pretty much a terrible idea.

Yeah.  It'd likely be better to insist that bgworkers handle SIGTERM
the same way backends do, via CHECK_FOR_INTERRUPTS.  Not sure how big
a change that would be.

> I think that the only way this could actually
> be safe is if you have a background worker that never uses ereport()
> itself, so that the ereport() in the signal handler can't be
> interrupting one that's already happening.

That doesn't begin to make it safe, because quite aside from anything
that happens in elog.c, we're then going to go out via proc_exit()
which will invoke who knows what.

>> StandbyDeadLockHandler (from SIGALRM)
>> StandbyTimeoutHandler (ditto)
>> Calls CancelDBBackends, which just for starters tries to acquire
>> an LWLock.  I think the only reason we've gotten away with this
>> for this long is the high probability that by the time either
>> timeout fires, we're going to be blocked on a semaphore.

> Yeah, I'm not sure these are so bad. In fact, in the deadlock case, I
> believe the old coding was designed to make sure we *had to* be
> blocked on a semaphore, but I'm not sure whether that's still true.

Looking at this closer, the only code that could get interrupted
by these timeouts is a call to ProcWaitForSignal, which is

void
ProcWaitForSignal(uint32 wait_event_info)
{
    (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
                     wait_event_info);
    ResetLatch(MyLatch);
    CHECK_FOR_INTERRUPTS();
}

Interrupting the latch operations might be safe enough,
although now I'm casting a side-eye at Munro's recent
changes to share WaitEvent state all over the place.
I wonder whether blocking on an LWLock inside the
signal handler will break an interrupted WaitLatch.

Also, man that CHECK_FOR_INTERRUPTS() looks like trouble.
Could we take that out?

            regards, tom lane



Re: SIGQUIT handling, redux

From
Robert Haas
Date:
On Thu, Sep 10, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Also, man that CHECK_FOR_INTERRUPTS() looks like trouble.
> Could we take that out?

Maybe I'm missing something, but why wouldn't that be a horrible idea?
We do not want to have long waits where we refuse to respond to
interrupts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Sep 10, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, man that CHECK_FOR_INTERRUPTS() looks like trouble.
>> Could we take that out?

> Maybe I'm missing something, but why wouldn't that be a horrible idea?
> We do not want to have long waits where we refuse to respond to
> interrupts.

It might be appropriate for some of the callers to do it.  But I don't
see any great argument why ProcWaitForSignal itself has to do it.

            regards, tom lane



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
I wrote:
> I'll take a closer look at the idea of using _exit(1) tomorrow,
> but I'd be pretty hesitant to back-patch that.

Here's a patch for that; it passes light testing, including forcing
the backend through the SIGTERM and timeout code paths.  There's
not much to it except for changing the signal handlers, but I did
add a cross-check that no on-exit handlers have been registered
before we get done with using these signal handlers.

I looked briefly at the idea of postponing ProcessStartupPacket
until InitPostgres has set up a fairly normal environment.  It
seems like it'd take a fair amount of refactoring.  I really
doubt it's worth the effort, even though the result would be
arguably cleaner logic than what we have now.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 15ad675bc1..081022a206 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4298,6 +4298,8 @@ report_fork_failure_to_client(Port *port, int errnum)
  * returns: nothing.  Will not return at all if there's any failure.
  *
  * Note: this code does not depend on having any access to shared memory.
+ * Indeed, our approach to SIGTERM/timeout handling *requires* that
+ * shared memory not have been touched yet; see comments within.
  * In the EXEC_BACKEND case, we are physically attached to shared memory
  * but have not yet set up most of our local pointers to shmem structures.
  */
@@ -4341,27 +4343,14 @@ BackendInitialize(Port *port)
     whereToSendOutput = DestRemote; /* now safe to ereport to client */

     /*
-     * We arrange to do proc_exit(1) if we receive SIGTERM or timeout while
-     * trying to collect the startup packet; while SIGQUIT results in
-     * _exit(2).  Otherwise the postmaster cannot shutdown the database FAST
-     * or IMMED cleanly if a buggy client fails to send the packet promptly.
+     * We arrange to do _exit(1) if we receive SIGTERM or timeout while trying
+     * to collect the startup packet; while SIGQUIT results in _exit(2).
+     * Otherwise the postmaster cannot shutdown the database FAST or IMMED
+     * cleanly if a buggy client fails to send the packet promptly.
      *
-     * XXX this is pretty dangerous; signal handlers should not call anything
-     * as complex as proc_exit() directly.  We minimize the hazard by not
-     * keeping these handlers active for longer than we must.  However, it
-     * seems necessary to be able to escape out of DNS lookups as well as the
-     * startup packet reception proper, so we can't narrow the scope further
-     * than is done here.
-     *
-     * XXX it follows that the remainder of this function must tolerate losing
-     * control at any instant.  Likewise, any pg_on_exit_callback registered
-     * before or during this function must be prepared to execute at any
-     * instant between here and the end of this function.  Furthermore,
-     * affected callbacks execute partially or not at all when a second
-     * exit-inducing signal arrives after proc_exit_prepare() decrements
-     * on_proc_exit_index.  (Thanks to that mechanic, callbacks need not
-     * anticipate more than one call.)  This is fragile; it ought to instead
-     * follow the norm of handling interrupts at selected, safe opportunities.
+     * Exiting with _exit(1) is only possible because we have not yet touched
+     * shared memory; therefore no outside-the-process state needs to get
+     * cleaned up.
      */
     pqsignal(SIGTERM, process_startup_packet_die);
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
@@ -4420,8 +4409,8 @@ BackendInitialize(Port *port)
         port->remote_hostname = strdup(remote_host);

     /*
-     * Ready to begin client interaction.  We will give up and proc_exit(1)
-     * after a time delay, so that a broken client can't hog a connection
+     * Ready to begin client interaction.  We will give up and _exit(1) after
+     * a time delay, so that a broken client can't hog a connection
      * indefinitely.  PreAuthDelay and any DNS interactions above don't count
      * against the time limit.
      *
@@ -4449,6 +4438,17 @@ BackendInitialize(Port *port)
     disable_timeout(STARTUP_PACKET_TIMEOUT, false);
     PG_SETMASK(&BlockSig);

+    /*
+     * As a safety check that nothing in startup has yet performed
+     * shared-memory modifications that would need to be undone if we had
+     * exited through SIGTERM or timeout above, check that no on_shmem_exit
+     * handlers have been registered yet.  (This isn't terribly bulletproof,
+     * since someone might misuse an on_proc_exit handler for shmem cleanup,
+     * but it's a cheap and helpful check.  We cannot disallow on_proc_exit
+     * handlers unfortunately, since pq_init() already registered one.)
+     */
+    check_on_shmem_exit_lists_are_empty();
+
     /*
      * Stop here if it was bad or a cancel packet.  ProcessStartupPacket
      * already did any appropriate error reporting.
@@ -5369,23 +5369,21 @@ sigusr1_handler(SIGNAL_ARGS)

 /*
  * SIGTERM while processing startup packet.
- * Clean up and exit(1).
  *
- * Running proc_exit() from a signal handler is pretty unsafe, since we
- * can't know what code we've interrupted.  But the alternative of using
- * _exit(2) is also unpalatable, since it'd mean that a "fast shutdown"
- * would cause a database crash cycle (forcing WAL replay at restart)
- * if any sessions are in authentication.  So we live with it for now.
+ * Running proc_exit() from a signal handler would be quite unsafe.
+ * However, since we have not yet touched shared memory, we can just
+ * pull the plug and exit without running any atexit handlers.
  *
- * One might be tempted to try to send a message indicating why we are
- * disconnecting.  However, that would make this even more unsafe.  Also,
- * it seems undesirable to provide clues about the database's state to
- * a client that has not yet completed authentication.
+ * One might be tempted to try to send a message, or log one, indicating
+ * why we are disconnecting.  However, that would be quite unsafe in itself.
+ * Also, it seems undesirable to provide clues about the database's state
+ * to a client that has not yet completed authentication, or even sent us
+ * a startup packet.
  */
 static void
 process_startup_packet_die(SIGNAL_ARGS)
 {
-    proc_exit(1);
+    _exit(1);
 }

 /*
@@ -5404,16 +5402,12 @@ dummy_handler(SIGNAL_ARGS)

 /*
  * Timeout while processing startup packet.
- * As for process_startup_packet_die(), we clean up and exit(1).
- *
- * This is theoretically just as hazardous as in process_startup_packet_die(),
- * although in practice we're almost certainly waiting for client input,
- * which greatly reduces the risk.
+ * As for process_startup_packet_die(), we exit via _exit(1).
  */
 static void
 StartupPacketTimeoutHandler(void)
 {
-    proc_exit(1);
+    _exit(1);
 }


diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 11c3f132a1..36a067c924 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -416,3 +416,20 @@ on_exit_reset(void)
     on_proc_exit_index = 0;
     reset_on_dsm_detach();
 }
+
+/* ----------------------------------------------------------------
+ *        check_on_shmem_exit_lists_are_empty
+ *
+ *        Debugging check that no shmem cleanup handlers have been registered
+ *        prematurely in the current process.
+ * ----------------------------------------------------------------
+ */
+void
+check_on_shmem_exit_lists_are_empty(void)
+{
+    if (before_shmem_exit_index)
+        elog(FATAL, "before_shmem_exit has been called prematurely");
+    if (on_shmem_exit_index)
+        elog(FATAL, "on_shmem_exit has been called prematurely");
+    /* Checking DSM detach state seems unnecessary given the above */
+}
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 462fe46341..88994fdc26 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -72,6 +72,7 @@ extern void on_shmem_exit(pg_on_exit_callback function, Datum arg);
 extern void before_shmem_exit(pg_on_exit_callback function, Datum arg);
 extern void cancel_before_shmem_exit(pg_on_exit_callback function, Datum arg);
 extern void on_exit_reset(void);
+extern void check_on_shmem_exit_lists_are_empty(void);

 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;

Re: SIGQUIT handling, redux

From
Tom Lane
Date:
I wrote:
> I looked briefly at the idea of postponing ProcessStartupPacket
> until InitPostgres has set up a fairly normal environment.  It
> seems like it'd take a fair amount of refactoring.  I really
> doubt it's worth the effort, even though the result would be
> arguably cleaner logic than what we have now.

I felt more ambitious this morning and decided to take a harder look
at this idea.  But I soon realized that there would be a concrete
disadvantage to delaying ProcessStartupPacket: until we have done that,
we don't have the correct value for FrontendProtocol so there is a
problem with reporting startup-time failures to the client.  At least
some such failures, such as "too many clients already", are pretty
routine so we don't want to downgrade their user-friendliness.

If memory serves, libpq has some ability to cope with a v2 error message
even when it's expecting v3.  But I wouldn't bet on that being true of
all client libraries, and anyway it's a very under-tested code path.

So I think we'd better go with the simple fix I showed before.
It's simple enough that maybe we could back-patch it, once it's
aged awhile in HEAD.  OTOH, given the lack of field reports of
trouble here, I'm not sure back-patching is worth the risk.

            regards, tom lane



Re: SIGQUIT handling, redux

From
Andres Freund
Date:
Hi,

On 2020-09-11 11:52:55 -0400, Tom Lane wrote:
> It's simple enough that maybe we could back-patch it, once it's
> aged awhile in HEAD.  OTOH, given the lack of field reports of
> trouble here, I'm not sure back-patching is worth the risk.

FWIW, looking at collected stack traces in azure, there's a slow but steady
stream of crashes below StartupPacketTimeoutHandler. Most seem to be things
like
libcrypto->malloc->StartupPacketTimeoutHandler->proc_exit->socket_close->free->crash
there's a few other variants, some where the stack apparently was not
decipherable for the relevant tooling.

Note that this wouldn't even include cases where this caused hangs - which is
quite common IME.


Unsurprisingly just in versions before 14, where this change went in.

I think that might be enough evidence for backpatching the commit? I've not
heard of issues due to the checks in check_on_shmem_exit_lists_are_empty().

Greetings,

Andres Freund



Re: SIGQUIT handling, redux

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2020-09-11 11:52:55 -0400, Tom Lane wrote:
>> It's simple enough that maybe we could back-patch it, once it's
>> aged awhile in HEAD.  OTOH, given the lack of field reports of
>> trouble here, I'm not sure back-patching is worth the risk.

> FWIW, looking at collected stack traces in azure, there's a slow but steady
> stream of crashes below StartupPacketTimeoutHandler. ...
> Unsurprisingly just in versions before 14, where this change went in.
> I think that might be enough evidence for backpatching the commit? I've not
> heard of issues due to the checks in check_on_shmem_exit_lists_are_empty().

I'd be willing to take a look at this in a few weeks when $real_life
is a bit less demanding.  Right before minor releases would likely be
a bad idea anyway.

            regards, tom lane



Re: SIGQUIT handling, redux

From
Andres Freund
Date:
Hi,

On 2023-08-02 12:35:19 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2020-09-11 11:52:55 -0400, Tom Lane wrote:
> >> It's simple enough that maybe we could back-patch it, once it's
> >> aged awhile in HEAD.  OTOH, given the lack of field reports of
> >> trouble here, I'm not sure back-patching is worth the risk.
> 
> > FWIW, looking at collected stack traces in azure, there's a slow but steady
> > stream of crashes below StartupPacketTimeoutHandler. ...
> > Unsurprisingly just in versions before 14, where this change went in.
> > I think that might be enough evidence for backpatching the commit? I've not
> > heard of issues due to the checks in check_on_shmem_exit_lists_are_empty().
> 
> I'd be willing to take a look at this in a few weeks when $real_life
> is a bit less demanding.

Cool.


> Right before minor releases would likely be a bad idea anyway.

Agreed. I had just waded through the stacks, so I thought it'd be worth
bringing up, didn't intend to imply we should backpatch immediately.


Aside: Analyzing this kind of thing at scale is made considerably more painful
by "expected" ereport(PANIC)s (say out of disk space during WAL writes)
calling abort() and dumping core... While there are other PANICs you really
want cores of.

Greetings,

Andres Freund