Re: Weird failure with latches in curculio on v15 - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Weird failure with latches in curculio on v15 |
Date | |
Msg-id | Y9sam108o4mxZFiS@paquier.xyz Whole thread Raw |
In response to | Re: Weird failure with latches in curculio on v15 (Nathan Bossart <nathandbossart@gmail.com>) |
Responses |
Re: Weird failure with latches in curculio on v15
Re: Weird failure with latches in curculio on v15 |
List | pgsql-hackers |
On Wed, Feb 01, 2023 at 02:35:55PM -0800, Nathan Bossart wrote: > Here is a first draft for the proposed stopgap fix. If we want to proceed > with this, I can provide patches for the back branches. > + /* > + * PreRestoreCommand() is used to tell the SIGTERM handler for the startup > + * process that it is okay to proc_exit() right away on SIGTERM. This is > + * done for the duration of the system() call because there isn't a good > + * way to break out while it is executing. Since we might call proc_exit() > + * in a signal handler here, it is extremely important that nothing but the > + * system() call happens between the calls to PreRestoreCommand() and > + * PostRestoreCommand(). Any additional code must go before or after this > + * section. > + */ > + if (exitOnSigterm) > + PreRestoreCommand(); > + > rc = system(command); > + > + if (exitOnSigterm) > + PostRestoreCommand(); > + > pgstat_report_wait_end(); Hmm. Isn't that something that we should also document in startup.c where both routines are defined? If we begin to use PreRestoreCommand() and PostRestoreCommand() in more code paths in the future, that could be again an issue. That looks enough to me to reduce the window back to what it was before 9a740f8, as exitOnSigterm is only used for restore_command. There is a different approach possible here: rely more on wait_event_info rather than failOnSignal and exitOnSigterm to decide which code path should do what. Andres Freund wrote: > - the error message for a failed restore command seems to have gotten > worse: > could not restore file \"%s\" from archive: %s" > -> > "%s \"%s\": %s", commandName, command IMO, we don't lose any context with this method: the command type and the command string itself are the bits actually relevant. Perhaps something like that would be more intuitive? One idea: "could not execute command for %s: %s", commandName, command > - shell_* imo is not a good namespace for something called from xlog.c, > xlogarchive.c. I realize the intention is that shell_archive.c is > going to be its own "restore module", but for now it imo looks odd shell_restore.c does not sound that bad to me, FWIW. The parallel with the archive counterparts is here. My recent history is not that good when it comes to naming, based on the feedback I received, though. > And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that > doesn't look right: > - We now have two places open-coding what BuildRestoreCommand did Yeah, BuildRestoreCommand() was just a small wrapper on top of the new percentrepl.c, making it rather irrelevant at this stage, IMO. For the two code paths where it was called. > - The comment moved out of RestoreArchivedFile() doesn't seems less > useful at its new location We are talking about that: - /* - * Remember, we rollforward UNTIL the restore fails so failure here is - * just part of the process... that makes it difficult to determine - * whether the restore failed because there isn't an archive to restore, - * or because the administrator has specified the restore program - * incorrectly. We have to assume the former. - * - * However, if the failure was due to any sort of signal, it's best to - * punt and abort recovery. (If we "return false" here, upper levels will - * assume that recovery is complete and start up the database!) It's - * essential to abort on child SIGINT and SIGQUIT, because per spec - * system() ignores SIGINT and SIGQUIT while waiting; if we see one of - * those it's a good bet we should have gotten it too. - * - * On SIGTERM, assume we have received a fast shutdown request, and exit - * cleanly. It's pure chance whether we receive the SIGTERM first, or the - * child process. If we receive it first, the signal handler will call - * proc_exit, otherwise we do it here. If we or the child process received - * SIGTERM for any other reason than a fast shutdown request, postmaster - * will perform an immediate shutdown when it sees us exiting - * unexpectedly. - * - * We treat hard shell errors such as "command not found" as fatal, too. - */ The failure processing is stuck within the way we build and handle the command given down to system(), so keeping this in shell_restore.c (or whatever name you think would be a better fit) makes sense to me. Now, thinking a bit more of this, we could just push the description down to ExecuteRecoveryCommand(), that actually does the work, adaptinh the comment based on the refactored internals of the routine. -- Michael
Attachment
pgsql-hackers by date: