Re: Weird failure with latches in curculio on v15 - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: Weird failure with latches in curculio on v15
Date
Msg-id 20230202220113.GA3945808@nathanxps13
Whole thread Raw
In response to Re: Weird failure with latches in curculio on v15  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Weird failure with latches in curculio on v15
List pgsql-hackers
On Thu, Feb 02, 2023 at 04:14:54PM -0500, Robert Haas wrote:
> +       /*
> +        * When exitOnSigterm is set and we are in the startup process, use the
> +        * special wrapper for system() that enables exiting immediately upon
> +        * receiving SIGTERM.  This ensures we can break out of system() if
> +        * required.
> +        */
> 
> This comment, for me, raises more questions than it answers. Why do we
> only do this in the startup process?

Currently, this functionality only exists in the startup process because it
is only used for restore_command.  More below...

> Also, and this part is not the fault of this patch but a defect of the
> pre-existing comments, under what circumstances do we not want to exit
> when we get a SIGTERM? It's standard behavior for PostgreSQL backends
> to exit when they receive SIGTERM, so the question isn't why we
> sometimes exit immediately but why we ever don't. The existing code
> calls ExecuteRecoveryCommand with exitOnSigterm true in some cases and
> false in other cases, and AFAICS there are zero words of comments
> explaining the reasoning.

I've been digging into the history here.  This e-mail seems to have the
most context [0].  IIUC this was intended to prevent "fast" shutdowns from
escalating to "immediate" shutdowns because the restore command died
unexpectedly.  This doesn't apply to archive_cleanup_command because we
don't FATAL if it dies unexpectedly.  It seems like this idea should apply
to recovery_end_command, too, but AFAICT it doesn't use the same approach.
My guess is that this hasn't come up because it's less likely that both 1)
recovery_end_command is used and 2) someone initiates shutdown while it is
running.

BTW the relevant commits are cdd46c7 (added SIGTERM handling for
restore_command), 9e403c2 (added recovery_end_command), and c21ac0b (added
what is today called archive_cleanup_command).

> +       if (exitOnSigterm && MyBackendType == B_STARTUP)
> +               rc = RunInterruptibleShellCommand(command);
> +       else
> +               rc = system(command);
> 
> And this looks like pure magic. I'm all in favor of not relying on
> system(), but using it under some opaque set of conditions and
> otherwise doing something else is not the way. At the very least this
> needs to be explained a whole lot better.

If we applied this exit-on-SIGTERM behavior to recovery_end_command, I
think we could combine failOnSignal and exitOnSigterm into one flag, and
then it might be a little easier to explain what is going on.  In any case,
I agree that this deserves a lengthy explanation, which I'll continue to
work on.

[0] https://postgr.es/m/499047FE.9090407%40enterprisedb.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Remove unused code related to iso-8859-1 type
Next
From: Thomas Munro
Date:
Subject: Re: Move defaults toward ICU in 16?