Re: Weird failure with latches in curculio on v15 - Mailing list pgsql-hackers
From | Nathan Bossart |
---|---|
Subject | Re: Weird failure with latches in curculio on v15 |
Date | |
Msg-id | 20230202220113.GA3945808@nathanxps13 Whole thread Raw |
In response to | Re: Weird failure with latches in curculio on v15 (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Weird failure with latches in curculio on v15
|
List | pgsql-hackers |
On Thu, Feb 02, 2023 at 04:14:54PM -0500, Robert Haas wrote: > + /* > + * When exitOnSigterm is set and we are in the startup process, use the > + * special wrapper for system() that enables exiting immediately upon > + * receiving SIGTERM. This ensures we can break out of system() if > + * required. > + */ > > This comment, for me, raises more questions than it answers. Why do we > only do this in the startup process? Currently, this functionality only exists in the startup process because it is only used for restore_command. More below... > Also, and this part is not the fault of this patch but a defect of the > pre-existing comments, under what circumstances do we not want to exit > when we get a SIGTERM? It's standard behavior for PostgreSQL backends > to exit when they receive SIGTERM, so the question isn't why we > sometimes exit immediately but why we ever don't. The existing code > calls ExecuteRecoveryCommand with exitOnSigterm true in some cases and > false in other cases, and AFAICS there are zero words of comments > explaining the reasoning. I've been digging into the history here. This e-mail seems to have the most context [0]. IIUC this was intended to prevent "fast" shutdowns from escalating to "immediate" shutdowns because the restore command died unexpectedly. This doesn't apply to archive_cleanup_command because we don't FATAL if it dies unexpectedly. It seems like this idea should apply to recovery_end_command, too, but AFAICT it doesn't use the same approach. My guess is that this hasn't come up because it's less likely that both 1) recovery_end_command is used and 2) someone initiates shutdown while it is running. BTW the relevant commits are cdd46c7 (added SIGTERM handling for restore_command), 9e403c2 (added recovery_end_command), and c21ac0b (added what is today called archive_cleanup_command). > + if (exitOnSigterm && MyBackendType == B_STARTUP) > + rc = RunInterruptibleShellCommand(command); > + else > + rc = system(command); > > And this looks like pure magic. I'm all in favor of not relying on > system(), but using it under some opaque set of conditions and > otherwise doing something else is not the way. At the very least this > needs to be explained a whole lot better. If we applied this exit-on-SIGTERM behavior to recovery_end_command, I think we could combine failOnSignal and exitOnSigterm into one flag, and then it might be a little easier to explain what is going on. In any case, I agree that this deserves a lengthy explanation, which I'll continue to work on. [0] https://postgr.es/m/499047FE.9090407%40enterprisedb.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: