Thread: Weird failure with latches in curculio on v15
Hi all, While browsing the buildfarm, I have noticed this failure on curcilio: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001%3A05%3A17 The test that has reported a failure is the check on the archive module callback: # Failed test 'check shutdown callback of shell archive module' # at t/020_archive_status.pl line 248. # Looks like you failed 1 test of 17. [02:28:06] t/020_archive_status.pl .............. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/17 subtests Looking closer, this is a result of an assertion failure in the latch code: 2023-02-01 02:28:05.615 CET [6961:8] LOG: received fast shutdown request 2023-02-01 02:28:05.615 CET [6961:9] LOG: aborting any active transactions 2023-02-01 02:28:05.616 CET [30681:9] LOG: process 30681 releasing ProcSignal slot 33, but it contains 0 TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: "latch.c", Line: 451, PID: 30681) The information available in standby2.log shows that 30681 is the startup process. I am not sure what all that means, yet. Thoughts or comments welcome. -- Michael
Attachment
Hi, On 2023-02-01 10:53:17 +0900, Michael Paquier wrote: > While browsing the buildfarm, I have noticed this failure on curcilio: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001%3A05%3A17 > > The test that has reported a failure is the check on the archive > module callback: > # Failed test 'check shutdown callback of shell archive module' > # at t/020_archive_status.pl line 248. > # Looks like you failed 1 test of 17. > [02:28:06] t/020_archive_status.pl .............. > Dubious, test returned 1 (wstat 256, 0x100) > Failed 1/17 subtests > > Looking closer, this is a result of an assertion failure in the latch > code: > 2023-02-01 02:28:05.615 CET [6961:8] LOG: received fast shutdown request > 2023-02-01 02:28:05.615 CET [6961:9] LOG: aborting any active transactions > 2023-02-01 02:28:05.616 CET [30681:9] LOG: process 30681 releasing ProcSignal slot 33, but it contains 0 > TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: "latch.c", Line: 451, PID: 30681) Given the ProcSignal LOG message before it, I don't think this is about latches. > The information available in standby2.log shows that 30681 is the > startup process. I am not sure what all that means, yet. > > Thoughts or comments welcome. Perhaps a wild write overwriting shared memory state? Greetings, Andres Freund
My database off assertion failures has four like that, all 15 and master: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001:05:17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-01-11%2011:16:40 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-11-22%2012:19:21 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-11-17%2021:47:02 It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM handler which is allowed to call that while in_restore_command is true. Here's a different one, some kind of latch corruption in the WAL writer under 017_shm: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2022-01-20%2016:26:54 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2022-01-20%2016:26:54
Hi, On 2023-02-01 16:21:16 +1300, Thomas Munro wrote: > My database off assertion failures has four like that, all 15 and master: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001:05:17 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-01-11%2011:16:40 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-11-22%2012:19:21 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-11-17%2021:47:02 > > It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM > handler which is allowed to call that while in_restore_command is > true. Ugh, no wonder we're getting crashes. This whole business seems bogus as hell. RestoreArchivedFile(): ... /* * Check signals before restore command and reset afterwards. */ PreRestoreCommand(); /* * Copy xlog from archival storage to XLOGDIR */ ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname); PostRestoreCommand(); /* SIGTERM: set flag to abort redo and exit */ static void StartupProcShutdownHandler(SIGNAL_ARGS) { int save_errno = errno; if (in_restore_command) proc_exit(1); else shutdown_requested = true; WakeupRecovery(); errno = save_errno; } Where PreRestoreCommand()/PostRestoreCommand() set in_restore_command. There's *a lot* of stuff happening inside shell_restore() that's not compatible with doing proc_exit() inside a signal handler. We're allocating memory! Interact with stdout. There's also the fact that system() isn't signal safe, but that's a much less likely problematic issue. This appears to have gotten worse over a sequence of commits. The following commits each added something betwen PreRestoreCommand() and PostRestoreCommand(). commit 1b06d7bac901e5fd20bba597188bae2882bf954b Author: Fujii Masao <fujii@postgresql.org> Date: 2021-11-22 10:28:21 +0900 Report wait events for local shell commands like archive_command. added pgstat_report_wait_start/end. Unlikely to cause big issues, but not good. commit 7fed801135bae14d63b11ee4a10f6083767046d8 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2022-08-29 13:55:38 -0400 Clean up inconsistent use of fflush(). Made it a bit worse by adding an fflush(). That certainly seems like it could cause hangs. commit 9a740f81eb02e04179d78f3df2ce671276c27b07 Author: Michael Paquier <michael@paquier.xyz> Date: 2023-01-16 16:31:43 +0900 Refactor code in charge of running shell-based recovery commands which completely broke the mechanism. We suddenly run the entirety of shell_restore(), which does pallocs etc to build the string passed to system, and raises errors, all within a section in which a signal handler can invoke proc_exit(). That's just completely broken. Sorry, but particularly in this area, you got to be a heck of a lot more careful. I don't see a choice but to revert the recent changes. They need a fairly large rewrite. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-02-01 16:21:16 +1300, Thomas Munro wrote: >> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM >> handler which is allowed to call that while in_restore_command is >> true. > Ugh, no wonder we're getting crashes. This whole business seems bogus as > hell. Indeed :-( > I don't see a choice but to revert the recent changes. They need a > fairly large rewrite. 9a740f81e clearly made things a lot worse, but it wasn't great before. Can we see a way forward to removing the problem entirely? The fundamental issue is that we have no good way to break out of system(), and I think the original idea was that in_restore_command would be set *only* for the duration of the system() call. That's clearly been lost sight of completely, but maybe as a stopgap we could try to get back to that. regards, tom lane
On Wed, Feb 01, 2023 at 10:12:26AM -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2023-02-01 16:21:16 +1300, Thomas Munro wrote: >>> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM >>> handler which is allowed to call that while in_restore_command is >>> true. > >> Ugh, no wonder we're getting crashes. This whole business seems bogus as >> hell. > > Indeed :-( Ugh. My bad. > The fundamental issue is that we have no good way to break out > of system(), and I think the original idea was that > in_restore_command would be set *only* for the duration of the > system() call. That's clearly been lost sight of completely, > but maybe as a stopgap we could try to get back to that. +1. I'll produce some patches. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-02-01 10:12:26 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-02-01 16:21:16 +1300, Thomas Munro wrote: > >> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM > >> handler which is allowed to call that while in_restore_command is > >> true. > > > Ugh, no wonder we're getting crashes. This whole business seems bogus as > > hell. > > Indeed :-( > > > I don't see a choice but to revert the recent changes. They need a > > fairly large rewrite. > > 9a740f81e clearly made things a lot worse, but it wasn't great > before. Can we see a way forward to removing the problem entirely? Yea, I think we can - we should stop relying on system(). If we instead run the command properly as a subprocess, we don't need to do bad things in the signal handler anymore. > The fundamental issue is that we have no good way to break out > of system(), and I think the original idea was that > in_restore_command would be set *only* for the duration of the > system() call. That's clearly been lost sight of completely, > but maybe as a stopgap we could try to get back to that. We could push the functions setting in_restore_command down into ExecuteRecoveryCommand(). But I don't think that'd end up necessarily being right either - we'd now use the mechanism in places we previously didn't (cleanup/end commands). And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that doesn't look right: - We now have two places open-coding what BuildRestoreCommand did - I'm doubtful that the new shell_* functions are the base for a good API to abstract restoring files - the error message for a failed restore command seems to have gotten worse: could not restore file \"%s\" from archive: %s" -> "%s \"%s\": %s", commandName, command - shell_* imo is not a good namespace for something called from xlog.c, xlogarchive.c. I realize the intention is that shell_archive.c is going to be its own "restore module", but for now it imo looks odd - The comment moved out of RestoreArchivedFile() doesn't seems less useful at its new location - explanation of why we use GetOldestRestartPoint() is halfway lost My name is listed as the first Reviewed-by, but I certainly haven't done any meaningful review of these patches. I just replied to top-level email proposing "recovery modules". Greetings, Andres Freund
On Wed, Feb 1, 2023 at 11:58 AM Andres Freund <andres@anarazel.de> wrote: > > 9a740f81e clearly made things a lot worse, but it wasn't great > > before. Can we see a way forward to removing the problem entirely? > > Yea, I think we can - we should stop relying on system(). If we instead > run the command properly as a subprocess, we don't need to do bad things > in the signal handler anymore. I like the idea of not relying on system(). In most respects, doing fork() + exec() ourselves seems superior. We can control where the output goes, what we do while waiting, etc. But system() runs the command through the shell, so that for example you don't have to invent your own way of splitting a string into words to be passed to exec[whatever](). I've never understood how you're supposed to get that behavior other than by calling system(). -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-02-01 12:08:24 -0500, Robert Haas wrote: > On Wed, Feb 1, 2023 at 11:58 AM Andres Freund <andres@anarazel.de> wrote: > > > 9a740f81e clearly made things a lot worse, but it wasn't great > > > before. Can we see a way forward to removing the problem entirely? > > > > Yea, I think we can - we should stop relying on system(). If we instead > > run the command properly as a subprocess, we don't need to do bad things > > in the signal handler anymore. > > I like the idea of not relying on system(). In most respects, doing > fork() + exec() ourselves seems superior. We can control where the > output goes, what we do while waiting, etc. But system() runs the > command through the shell, so that for example you don't have to > invent your own way of splitting a string into words to be passed to > exec[whatever](). I've never understood how you're supposed to get > that behavior other than by calling system(). We could just exec the shell in the forked process, using -c to invoke the command. That should give us pretty much the same efficiency as system(), with a lot more control. I think we already do that somewhere. <dig>. Ah, yes, spawn_process() in pg_regress.c. I suspect we couldn't use exec for restore_command etc, as I think it's not uncommon to use && in the command. Perhaps we should abstract the relevant pieces of spawn_process() that into something more general? The OS specifics are sufficiently complicated that I don't think it'd be good to have multiple copies. It's too bad that we have the history of passing things to shell, otherwise we could define a common argument handling of the GUC and just execve ourselves, but that ship has sailed. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-02-01 12:08:24 -0500, Robert Haas wrote: >> I like the idea of not relying on system(). In most respects, doing >> fork() + exec() ourselves seems superior. We can control where the >> output goes, what we do while waiting, etc. But system() runs the >> command through the shell, so that for example you don't have to >> invent your own way of splitting a string into words to be passed to >> exec[whatever](). I've never understood how you're supposed to get >> that behavior other than by calling system(). > We could just exec the shell in the forked process, using -c to invoke > the command. That should give us pretty much the same efficiency as > system(), with a lot more control. The main thing that system() brings to the table is platform-specific knowledge of where the shell is. I'm not very sure that we want to wire in "/bin/sh". regards, tom lane
On Wed, Feb 01, 2023 at 08:58:01AM -0800, Andres Freund wrote: > On 2023-02-01 10:12:26 -0500, Tom Lane wrote: >> The fundamental issue is that we have no good way to break out >> of system(), and I think the original idea was that >> in_restore_command would be set *only* for the duration of the >> system() call. That's clearly been lost sight of completely, >> but maybe as a stopgap we could try to get back to that. > > We could push the functions setting in_restore_command down into > ExecuteRecoveryCommand(). But I don't think that'd end up necessarily > being right either - we'd now use the mechanism in places we previously > didn't (cleanup/end commands). Right, we'd only want to set it for restore_command. I think that's doable. > And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that > doesn't look right: > - We now have two places open-coding what BuildRestoreCommand did This was done because BuildRestoreCommand() had become a thin wrapper around replace_percent_placeholders(). I can add it back if you don't think this was the right decision. > - I'm doubtful that the new shell_* functions are the base for a good > API to abstract restoring files Why? > - the error message for a failed restore command seems to have gotten > worse: > could not restore file \"%s\" from archive: %s" > -> > "%s \"%s\": %s", commandName, command Okay, I'll work on improving this message. > - shell_* imo is not a good namespace for something called from xlog.c, > xlogarchive.c. I realize the intention is that shell_archive.c is > going to be its own "restore module", but for now it imo looks odd What do you propose instead? FWIW this should go away with recovery modules. This is just an intermediate state to simplify those patches. > - The comment moved out of RestoreArchivedFile() doesn't seems less > useful at its new location Where do you think it should go? > - explanation of why we use GetOldestRestartPoint() is halfway lost Okay, I'll work on adding more context here. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-02-01 12:27:19 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-02-01 12:08:24 -0500, Robert Haas wrote: > >> I like the idea of not relying on system(). In most respects, doing > >> fork() + exec() ourselves seems superior. We can control where the > >> output goes, what we do while waiting, etc. But system() runs the > >> command through the shell, so that for example you don't have to > >> invent your own way of splitting a string into words to be passed to > >> exec[whatever](). I've never understood how you're supposed to get > >> that behavior other than by calling system(). > > > We could just exec the shell in the forked process, using -c to invoke > > the command. That should give us pretty much the same efficiency as > > system(), with a lot more control. > > The main thing that system() brings to the table is platform-specific > knowledge of where the shell is. I'm not very sure that we want to > wire in "/bin/sh". We seem to be doing OK with using SHELLPROG in pg_regress, which just seems to be using $SHELL from the build environment. Greetings, Andres Freund
On Wed, Feb 01, 2023 at 09:58:06AM -0800, Nathan Bossart wrote: > On Wed, Feb 01, 2023 at 08:58:01AM -0800, Andres Freund wrote: >> On 2023-02-01 10:12:26 -0500, Tom Lane wrote: >>> The fundamental issue is that we have no good way to break out >>> of system(), and I think the original idea was that >>> in_restore_command would be set *only* for the duration of the >>> system() call. That's clearly been lost sight of completely, >>> but maybe as a stopgap we could try to get back to that. >> >> We could push the functions setting in_restore_command down into >> ExecuteRecoveryCommand(). But I don't think that'd end up necessarily >> being right either - we'd now use the mechanism in places we previously >> didn't (cleanup/end commands). > > Right, we'd only want to set it for restore_command. I think that's > doable. Here is a first draft for the proposed stopgap fix. If we want to proceed with this, I can provide patches for the back branches. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Feb 01, 2023 at 10:18:27AM -0800, Andres Freund wrote: > On 2023-02-01 12:27:19 -0500, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >> The main thing that system() brings to the table is platform-specific >> knowledge of where the shell is. I'm not very sure that we want to >> wire in "/bin/sh". > > We seem to be doing OK with using SHELLPROG in pg_regress, which just > seems to be using $SHELL from the build environment. It looks like this had better centralize a bit more of the logic from pg_regress.c if that were to happen, to avoid more fuzzy logic with WIN32. That becomes invasive for a back-patch. By the way, there is something that's itching me a bit here. 9a740f8 has enlarged by a lot the window between PreRestoreCommand() and PostRestoreCommand(), however curculio has reported a failure on REL_15_STABLE, where we only manipulate my_wait_event_info while the flag is on. Or I am getting that right that there is no way out of it unless we remove the dependency to system() even in the back-branches? Could there be an extra missing piece here? -- Michael
Attachment
On Wed, Feb 01, 2023 at 02:35:55PM -0800, Nathan Bossart wrote: > Here is a first draft for the proposed stopgap fix. If we want to proceed > with this, I can provide patches for the back branches. > + /* > + * PreRestoreCommand() is used to tell the SIGTERM handler for the startup > + * process that it is okay to proc_exit() right away on SIGTERM. This is > + * done for the duration of the system() call because there isn't a good > + * way to break out while it is executing. Since we might call proc_exit() > + * in a signal handler here, it is extremely important that nothing but the > + * system() call happens between the calls to PreRestoreCommand() and > + * PostRestoreCommand(). Any additional code must go before or after this > + * section. > + */ > + if (exitOnSigterm) > + PreRestoreCommand(); > + > rc = system(command); > + > + if (exitOnSigterm) > + PostRestoreCommand(); > + > pgstat_report_wait_end(); Hmm. Isn't that something that we should also document in startup.c where both routines are defined? If we begin to use PreRestoreCommand() and PostRestoreCommand() in more code paths in the future, that could be again an issue. That looks enough to me to reduce the window back to what it was before 9a740f8, as exitOnSigterm is only used for restore_command. There is a different approach possible here: rely more on wait_event_info rather than failOnSignal and exitOnSigterm to decide which code path should do what. Andres Freund wrote: > - the error message for a failed restore command seems to have gotten > worse: > could not restore file \"%s\" from archive: %s" > -> > "%s \"%s\": %s", commandName, command IMO, we don't lose any context with this method: the command type and the command string itself are the bits actually relevant. Perhaps something like that would be more intuitive? One idea: "could not execute command for %s: %s", commandName, command > - shell_* imo is not a good namespace for something called from xlog.c, > xlogarchive.c. I realize the intention is that shell_archive.c is > going to be its own "restore module", but for now it imo looks odd shell_restore.c does not sound that bad to me, FWIW. The parallel with the archive counterparts is here. My recent history is not that good when it comes to naming, based on the feedback I received, though. > And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that > doesn't look right: > - We now have two places open-coding what BuildRestoreCommand did Yeah, BuildRestoreCommand() was just a small wrapper on top of the new percentrepl.c, making it rather irrelevant at this stage, IMO. For the two code paths where it was called. > - The comment moved out of RestoreArchivedFile() doesn't seems less > useful at its new location We are talking about that: - /* - * Remember, we rollforward UNTIL the restore fails so failure here is - * just part of the process... that makes it difficult to determine - * whether the restore failed because there isn't an archive to restore, - * or because the administrator has specified the restore program - * incorrectly. We have to assume the former. - * - * However, if the failure was due to any sort of signal, it's best to - * punt and abort recovery. (If we "return false" here, upper levels will - * assume that recovery is complete and start up the database!) It's - * essential to abort on child SIGINT and SIGQUIT, because per spec - * system() ignores SIGINT and SIGQUIT while waiting; if we see one of - * those it's a good bet we should have gotten it too. - * - * On SIGTERM, assume we have received a fast shutdown request, and exit - * cleanly. It's pure chance whether we receive the SIGTERM first, or the - * child process. If we receive it first, the signal handler will call - * proc_exit, otherwise we do it here. If we or the child process received - * SIGTERM for any other reason than a fast shutdown request, postmaster - * will perform an immediate shutdown when it sees us exiting - * unexpectedly. - * - * We treat hard shell errors such as "command not found" as fatal, too. - */ The failure processing is stuck within the way we build and handle the command given down to system(), so keeping this in shell_restore.c (or whatever name you think would be a better fit) makes sense to me. Now, thinking a bit more of this, we could just push the description down to ExecuteRecoveryCommand(), that actually does the work, adaptinh the comment based on the refactored internals of the routine. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > Hmm. Isn't that something that we should also document in startup.c > where both routines are defined? If we begin to use > PreRestoreCommand() and PostRestoreCommand() in more code paths in the > future, that could be again an issue. I was vaguely wondering about removing both of those functions in favor of an integrated function that does a system() call with those things before and after it. regards, tom lane
On Wed, Feb 01, 2023 at 09:34:44PM -0500, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: >> Hmm. Isn't that something that we should also document in startup.c >> where both routines are defined? If we begin to use >> PreRestoreCommand() and PostRestoreCommand() in more code paths in the >> future, that could be again an issue. > > I was vaguely wondering about removing both of those functions > in favor of an integrated function that does a system() call > with those things before and after it. It seems to me that this is pretty much the same as storing in_restore_command in shell_restore.c, and that for recovery modules this comes down to the addition of an extra callback called in startup.c to check if the flag is up or not. Now the patch is doing things the opposite way: like on HEAD, store the flag in startup.c but switch it at will with the routines in startup.c. I find the approach of the patch a bit more intuitive, TBH, as that makes the interface simpler for other recovery modules that may want to switch the flag back-and-forth, and I suspect that there may be cases in recovery modules where we'd still want to switch the flag, but not necessarily link it to system(). -- Michael
Attachment
On Thu, Feb 02, 2023 at 01:24:15PM +0900, Michael Paquier wrote: > On Wed, Feb 01, 2023 at 09:34:44PM -0500, Tom Lane wrote: >> I was vaguely wondering about removing both of those functions >> in favor of an integrated function that does a system() call >> with those things before and after it. > > It seems to me that this is pretty much the same as storing > in_restore_command in shell_restore.c, and that for recovery modules > this comes down to the addition of an extra callback called in > startup.c to check if the flag is up or not. Now the patch is doing > things the opposite way: like on HEAD, store the flag in startup.c but > switch it at will with the routines in startup.c. I find the approach > of the patch a bit more intuitive, TBH, as that makes the interface > simpler for other recovery modules that may want to switch the flag > back-and-forth, and I suspect that there may be cases in recovery > modules where we'd still want to switch the flag, but not necessarily > link it to system(). Hm. I don't know if we want to encourage further use of in_restore_command since it seems to be prone to misuse. Here's a v2 that demonstrateѕ Tom's idea (bikeshedding on names and comments is welcome). I personally like this approach a bit more. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 2, 2023 at 3:10 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Hm. I don't know if we want to encourage further use of > in_restore_command since it seems to be prone to misuse. Here's a v2 that > demonstrateѕ Tom's idea (bikeshedding on names and comments is welcome). I > personally like this approach a bit more. + /* + * When exitOnSigterm is set and we are in the startup process, use the + * special wrapper for system() that enables exiting immediately upon + * receiving SIGTERM. This ensures we can break out of system() if + * required. + */ This comment, for me, raises more questions than it answers. Why do we only do this in the startup process? Also, and this part is not the fault of this patch but a defect of the pre-existing comments, under what circumstances do we not want to exit when we get a SIGTERM? It's standard behavior for PostgreSQL backends to exit when they receive SIGTERM, so the question isn't why we sometimes exit immediately but why we ever don't. The existing code calls ExecuteRecoveryCommand with exitOnSigterm true in some cases and false in other cases, and AFAICS there are zero words of comments explaining the reasoning. + if (exitOnSigterm && MyBackendType == B_STARTUP) + rc = RunInterruptibleShellCommand(command); + else + rc = system(command); And this looks like pure magic. I'm all in favor of not relying on system(), but using it under some opaque set of conditions and otherwise doing something else is not the way. At the very least this needs to be explained a whole lot better. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Feb 02, 2023 at 04:14:54PM -0500, Robert Haas wrote: > + /* > + * When exitOnSigterm is set and we are in the startup process, use the > + * special wrapper for system() that enables exiting immediately upon > + * receiving SIGTERM. This ensures we can break out of system() if > + * required. > + */ > > This comment, for me, raises more questions than it answers. Why do we > only do this in the startup process? Currently, this functionality only exists in the startup process because it is only used for restore_command. More below... > Also, and this part is not the fault of this patch but a defect of the > pre-existing comments, under what circumstances do we not want to exit > when we get a SIGTERM? It's standard behavior for PostgreSQL backends > to exit when they receive SIGTERM, so the question isn't why we > sometimes exit immediately but why we ever don't. The existing code > calls ExecuteRecoveryCommand with exitOnSigterm true in some cases and > false in other cases, and AFAICS there are zero words of comments > explaining the reasoning. I've been digging into the history here. This e-mail seems to have the most context [0]. IIUC this was intended to prevent "fast" shutdowns from escalating to "immediate" shutdowns because the restore command died unexpectedly. This doesn't apply to archive_cleanup_command because we don't FATAL if it dies unexpectedly. It seems like this idea should apply to recovery_end_command, too, but AFAICT it doesn't use the same approach. My guess is that this hasn't come up because it's less likely that both 1) recovery_end_command is used and 2) someone initiates shutdown while it is running. BTW the relevant commits are cdd46c7 (added SIGTERM handling for restore_command), 9e403c2 (added recovery_end_command), and c21ac0b (added what is today called archive_cleanup_command). > + if (exitOnSigterm && MyBackendType == B_STARTUP) > + rc = RunInterruptibleShellCommand(command); > + else > + rc = system(command); > > And this looks like pure magic. I'm all in favor of not relying on > system(), but using it under some opaque set of conditions and > otherwise doing something else is not the way. At the very least this > needs to be explained a whole lot better. If we applied this exit-on-SIGTERM behavior to recovery_end_command, I think we could combine failOnSignal and exitOnSigterm into one flag, and then it might be a little easier to explain what is going on. In any case, I agree that this deserves a lengthy explanation, which I'll continue to work on. [0] https://postgr.es/m/499047FE.9090407%40enterprisedb.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Feb 02, 2023 at 02:01:13PM -0800, Nathan Bossart wrote: > I've been digging into the history here. This e-mail seems to have the > most context [0]. IIUC this was intended to prevent "fast" shutdowns from > escalating to "immediate" shutdowns because the restore command died > unexpectedly. This doesn't apply to archive_cleanup_command because we > don't FATAL if it dies unexpectedly. It seems like this idea should apply > to recovery_end_command, too, but AFAICT it doesn't use the same approach. > My guess is that this hasn't come up because it's less likely that both 1) > recovery_end_command is used and 2) someone initiates shutdown while it is > running. Actually, this still doesn't really explain why we need to exit immediately in the SIGTERM handler for restore_command. We already have handling for when the command indicates it exited due to SIGTERM, so it should be no problem if the command receives it before the startup process. And HandleStartupProcInterrupts() should exit at an appropriate time after the startup process receives SIGTERM. My guess was that this is meant to allow breaking out of the system() call, but I don't understand why that's important here. Maybe we could just remove this exit-in-SIGTERM-handler business... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Feb 02, 2023 at 02:39:19PM -0800, Nathan Bossart wrote: > Maybe we could just > remove this exit-in-SIGTERM-handler business... I've spent some time testing this. It seems to work pretty well, but only if I keep the exit-on-SIGTERM logic in shell_restore(). Without that, I'm seeing delayed shutdowns, which I assume means HandleStartupProcInterrupts() isn't getting called (I'm still investigating this). Іn any case, the fact that shell_restore() exits if the command fails due to SIGTERM seems like an implementation detail that we won't necessarily want to rely on once recovery modules are available. In short, we seem to depend on the SIGTERM handling in RestoreArchivedFile() in order to be responsive to shutdown requests. One idea I have is to approximate the current behavior by simply checking for the shutdown_requested flag before before and after executing restore_command. This seems to work as desired even if the exit-on-SIGTERM logic is removed from shell_restore(). Unless there is some reason to break out of system() (versus just waiting for the command to fail after it receives SIGTERM), I think this approach should suffice. I've attached a draft patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 2023-02-02 10:23:21 +0900, Michael Paquier wrote: > On Wed, Feb 01, 2023 at 10:18:27AM -0800, Andres Freund wrote: > > On 2023-02-01 12:27:19 -0500, Tom Lane wrote: > >> Andres Freund <andres@anarazel.de> writes: > >> The main thing that system() brings to the table is platform-specific > >> knowledge of where the shell is. I'm not very sure that we want to > >> wire in "/bin/sh". > > > > We seem to be doing OK with using SHELLPROG in pg_regress, which just > > seems to be using $SHELL from the build environment. > > It looks like this had better centralize a bit more of the logic from > pg_regress.c if that were to happen, to avoid more fuzzy logic with > WIN32. That becomes invasive for a back-patch. I don't think we should consider backpatching such a change. There's enough subtlety that I'd want to see it bake for some time. > By the way, there is something that's itching me a bit here. 9a740f8 > has enlarged by a lot the window between PreRestoreCommand() and > PostRestoreCommand(), however curculio has reported a failure on > REL_15_STABLE, where we only manipulate my_wait_event_info while the > flag is on. Or I am getting that right that there is no way out of it > unless we remove the dependency to system() even in the back-branches? > Could there be an extra missing piece here? Yea, that's indeed odd. Ugh, I think I might understand what's happening: The signal arrives just after the fork() (within system()). Because we have all our processes configure themselves as process group leaders, and we signal the entire process group (c.f. signal_child()), both the child process and the parent will process the signal. So we'll end up doing a proc_exit() in both. As both are trying to remove themselves from the same PGPROC etc entry, that doesn't end well. I don't see how we can solve that properly as long as we use system(). A workaround for the back branches could be to have a test in StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and not do the proc_exit() if they don't match. We probably should just do an _exit() in that case. I doubt the idea to signal the entire process group in in signal_child() is good. I regularly see core dumps of archive commands because we sent SIGQUIT during an immediate shutdown, and of course cp etc don't have a SIGQUIT handler, and the default action is to core dump. But a replacement for it is not a small amount of work. While a subprocess is running, we can't just handle SIGQUIT with _exit() while the subprocess is running, we need to first signal the child with something appropriate (SIGKILL?). OTOH, the current approach only works on systems with setsid(2) support, so we probably shouldn't rely so hard on it anyway. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Ugh, I think I might understand what's happening: > The signal arrives just after the fork() (within system()). Because we > have all our processes configure themselves as process group leaders, > and we signal the entire process group (c.f. signal_child()), both the > child process and the parent will process the signal. So we'll end up > doing a proc_exit() in both. As both are trying to remove themselves > from the same PGPROC etc entry, that doesn't end well. Ugh ... > I don't see how we can solve that properly as long as we use system(). ... but I don't see how that's system()'s fault? Doing the fork() ourselves wouldn't change anything about that. > A workaround for the back branches could be to have a test in > StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and > not do the proc_exit() if they don't match. We probably should just do > an _exit() in that case. Might work. > OTOH, the current approach only works on systems with setsid(2) support, > so we probably shouldn't rely so hard on it anyway. setsid(2) is required since SUSv2, so I'm not sure which systems are of concern here ... other than Redmond's of course. regards, tom lane
On Fri, Feb 3, 2023 at 8:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > Ugh, I think I might understand what's happening: > > > The signal arrives just after the fork() (within system()). Because we > > have all our processes configure themselves as process group leaders, > > and we signal the entire process group (c.f. signal_child()), both the > > child process and the parent will process the signal. So we'll end up > > doing a proc_exit() in both. As both are trying to remove themselves > > from the same PGPROC etc entry, that doesn't end well. > > Ugh ... Yuck, but yeah that makes sense. > > I don't see how we can solve that properly as long as we use system(). > > ... but I don't see how that's system()'s fault? Doing the fork() > ourselves wouldn't change anything about that. What if we block signals, fork, then in the child, install the default SIGTERM handler, then unblock, and then exec the shell? If SIGTERM is delivered either before or after exec (but before whatever is loaded installs a new handler) then the child is terminated, but without running the handler. Isn't that what we want here?
Hi, On 2023-02-02 14:39:19 -0800, Nathan Bossart wrote: > Actually, this still doesn't really explain why we need to exit immediately > in the SIGTERM handler for restore_command. We already have handling for > when the command indicates it exited due to SIGTERM, so it should be no > problem if the command receives it before the startup process. And > HandleStartupProcInterrupts() should exit at an appropriate time after the > startup process receives SIGTERM. > My guess was that this is meant to allow breaking out of the system() call, > but I don't understand why that's important here. Maybe we could just > remove this exit-in-SIGTERM-handler business... I don't think you can, at least not easily. For one, we have no guarantee that the child process got a signal at all - we don't have a hard dependency on setsid(). And even if we have setsid(), there's no guarantee that the executed process reacts to SIGTERM and that the child didn't create its own process group (and thus isn't reached by the signal to the process group, sent in signal_child()). Greetings, Andres Freund
On Fri, Feb 3, 2023 at 8:35 PM Andres Freund <andres@anarazel.de> wrote: > we don't have a > hard dependency on setsid() FTR There are no Unixes without setsid()... HAVE_SETSID was only left in the tree because we were discussing whether to replace it with !defined(WIN32) or whether that somehow made things more confusing, but then while trying to figure out what to do about that, I noticed that Windows *does* have a near-equivalent thing, or IIRC several things like that, and that kinda stopped me in my tracks.
Hi, On 2023-02-03 02:24:03 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Ugh, I think I might understand what's happening: > > > The signal arrives just after the fork() (within system()). Because we > > have all our processes configure themselves as process group leaders, > > and we signal the entire process group (c.f. signal_child()), both the > > child process and the parent will process the signal. So we'll end up > > doing a proc_exit() in both. As both are trying to remove themselves > > from the same PGPROC etc entry, that doesn't end well. > > Ugh ... > > > I don't see how we can solve that properly as long as we use system(). > > ... but I don't see how that's system()'s fault? Doing the fork() > ourselves wouldn't change anything about that. If we did the fork ourselves, we'd temporarily change the signal mask before the fork() and reset it immediately in the parent, but not in the child. We can't do that with system(), because we don't get control back early enough - we'd just block signals for the entire duration of system(). I wonder if this shows a problem with the change in 14 to make pgarch.c be attached to shared memory. Before that it didn't have to worry about problems like the above in the archiver, but now we do. It's less severe than the startup process issue, because we don't have a comparable signal handler in pgarch, but still. I'm e.g. not sure that there aren't issues with procsignal_sigusr1_handler() or such executing in a forked process. > > A workaround for the back branches could be to have a test in > > StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and > > not do the proc_exit() if they don't match. We probably should just do > > an _exit() in that case. > > Might work. I wonder if we should add code complaining loudly about such a mismatch to proc_exit(), in addition to handling it more silently in StartupProcShutdownHandler(). Also, an assertion in [Auxiliary]ProcKill that proc->xid == MyProcPid == getpid() seems like a good idea. > > OTOH, the current approach only works on systems with setsid(2) support, > > so we probably shouldn't rely so hard on it anyway. > > setsid(2) is required since SUSv2, so I'm not sure which systems > are of concern here ... other than Redmond's of course. I was thinking of windows, yes. Greetings, Andres Freund
Thomas Munro <thomas.munro@gmail.com> writes: > On Fri, Feb 3, 2023 at 8:35 PM Andres Freund <andres@anarazel.de> wrote: >> we don't have a hard dependency on setsid() > FTR There are no Unixes without setsid()... Yeah. What I just got done reading in SUSv2 (1997) is "Derived from the POSIX.1-1988 standard". We need not concern ourselves with any systems not having it. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2023-02-03 02:24:03 -0500, Tom Lane wrote: >> setsid(2) is required since SUSv2, so I'm not sure which systems >> are of concern here ... other than Redmond's of course. > I was thinking of windows, yes. But given the lack of fork(2), Windows requires a completely different solution anyway, no? regards, tom lane
Hi, On 2023-02-03 20:34:36 +1300, Thomas Munro wrote: > What if we block signals, fork, then in the child, install the default > SIGTERM handler, then unblock, and then exec the shell? Yep. I was momentarily wondering why we'd even need to unblock signals, but while exec (et al) reset the signal handler, they don't reset the mask... We could, for good measure, do PGSharedMemoryDetach() etc. But I don't think it's quite worth it if we're careful with signals. However ClosePostmasterPorts() might be a good idea? I think not doing it might cause issues like keeping the listen sockets alive after we shut down postmaster, preventing us from startup up again? Looks like PR_SET_PDEATHSIG isn't reset across an execve(). But that actually seems good? > If SIGTERM is delivered either before or after exec (but before > whatever is loaded installs a new handler) then the child is > terminated, but without running the handler. Isn't that what we want > here? Yep, I think so. Greetings, Andres Freund
Hi, On 2023-02-03 02:50:38 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-02-03 02:24:03 -0500, Tom Lane wrote: > >> setsid(2) is required since SUSv2, so I'm not sure which systems > >> are of concern here ... other than Redmond's of course. > > > I was thinking of windows, yes. > > But given the lack of fork(2), Windows requires a completely > different solution anyway, no? Not sure it needs to be that different. I think what we basically want is: 1) Something vaguely popen() shaped that starts a subprocess, while being careful about signal handlers, returning the pid of the child process. Not sure if we want to redirect stdout/stderr or not. Probably not? 2) A blocking wrapper around 1) that takes care to forward fatal signals to the subprocess, including in the SIGQUIT case and probably being interruptible with query cancels etc in the relevant process types. Thinking about popen() suggests that we have a similar problem with COPY FROM PROGRAM as we have in pgarch (i.e. not as bad as the startup process issue, but still not great, due to procsignal_sigusr1_handler()). What's worse, the problem exists for untrusted PLs as well, and obviously we can't ensure that signals are correctly masked there. This seems to suggest that we ought to install a MyProcPid != getpid() like defense in all our signal handlers... Greetings, Andres Freund
On Fri, Feb 3, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote: > Thinking about popen() suggests that we have a similar problem with COPY > FROM PROGRAM as we have in pgarch (i.e. not as bad as the startup > process issue, but still not great, due to > procsignal_sigusr1_handler()). A small mercy: while we promote some kinds of fatal-ish signals to group level with kill(-PID, ...), we don't do that for SIGUSR1 for latches or procsignals.
Hi, On February 3, 2023 9:19:23 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote: >On Fri, Feb 3, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote: >> Thinking about popen() suggests that we have a similar problem with COPY >> FROM PROGRAM as we have in pgarch (i.e. not as bad as the startup >> process issue, but still not great, due to >> procsignal_sigusr1_handler()). > >A small mercy: while we promote some kinds of fatal-ish signals to >group level with kill(-PID, ...), we don't do that for SIGUSR1 for >latches or procsignals. Not as bad, but we still do SetLatch() from a bunch of places that would be reached... Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Feb 02, 2023 at 11:44:22PM -0800, Andres Freund wrote: > On 2023-02-03 02:24:03 -0500, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >> > A workaround for the back branches could be to have a test in >> > StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and >> > not do the proc_exit() if they don't match. We probably should just do >> > an _exit() in that case. >> >> Might work. > > I wonder if we should add code complaining loudly about such a mismatch > to proc_exit(), in addition to handling it more silently in > StartupProcShutdownHandler(). Also, an assertion in > [Auxiliary]ProcKill that proc->xid == MyProcPid == getpid() seems like a > good idea. From the discussion, it sounds like we don't want to depend on the child process receiving/handling the signal, so we can't get rid of the break-out-of-system() behavior (at least not in back-branches). I've put together some work-in-progress patches for the stopgap/back-branch fix. 0001 is just v1-0001 from upthread. This moves Pre/PostRestoreCommand to surround only the call to system(). I think this should get us closer to pre-v15 behavior. 0002 adds the getpid() check mentioned above to StartupProcShutdownHandler(), and it adds assertions to proc_exit() and [Auxiliary]ProcKill(). 0003 adds checks for shutdown requests before and after the call to shell_restore(). IMO the Pre/PostRestoreCommand stuff is an implementation detail for restore_command, so I think it behooves us to have some additional shutdown checks that apply even for recovery modules. This patch could probably be moved to the recovery modules thread. Is this somewhat close to what folks had in mind? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Feb 03, 2023 at 10:54:17AM -0800, Nathan Bossart wrote: > 0001 is just v1-0001 from upthread. This moves Pre/PostRestoreCommand to > surround only the call to system(). I think this should get us closer to > pre-v15 behavior. + if (exitOnSigterm) + PreRestoreCommand(); + rc = system(command); + + if (exitOnSigterm) + PostRestoreCommand(); I don't really want to let that hanging around on HEAD much longer, so I'm OK to do that for HEAD, then figure out what needs to be done for the older issue at hand. + /* + * PreRestoreCommand() is used to tell the SIGTERM handler for the startup + * process that it is okay to proc_exit() right away on SIGTERM. This is + * done for the duration of the system() call because there isn't a good + * way to break out while it is executing. Since we might call proc_exit() + * in a signal handler here, it is extremely important that nothing but the + * system() call happens between the calls to PreRestoreCommand() and + * PostRestoreCommand(). Any additional code must go before or after this + * section. + */ Still, it seems to me that the large comment block in shell_restore() ought to be moved to ExecuteRecoveryCommand(), no? The assumptions under which one can use exitOnSigterm and failOnSignal could be completed in the header of the function based on that. -- Michael
Attachment
Hi, On 2023-02-03 10:54:17 -0800, Nathan Bossart wrote: > @@ -146,7 +146,25 @@ ExecuteRecoveryCommand(const char *command, const char *commandName, > */ > fflush(NULL); > pgstat_report_wait_start(wait_event_info); > + > + /* > + * PreRestoreCommand() is used to tell the SIGTERM handler for the startup > + * process that it is okay to proc_exit() right away on SIGTERM. This is > + * done for the duration of the system() call because there isn't a good > + * way to break out while it is executing. Since we might call proc_exit() > + * in a signal handler here, it is extremely important that nothing but the > + * system() call happens between the calls to PreRestoreCommand() and > + * PostRestoreCommand(). Any additional code must go before or after this > + * section. > + */ > + if (exitOnSigterm) > + PreRestoreCommand(); > + > rc = system(command); > + > + if (exitOnSigterm) > + PostRestoreCommand(); > + > pgstat_report_wait_end(); > > if (rc != 0) It's somewhat weird that we now call the startup-process specific PreRestoreCommand/PostRestoreCommand() in other processes than the startup process. Gated on a variable that's not immediately obviously tied to being in the startup process. I think at least we ought to add comments to PreRestoreCommand / PostRestoreCommand that they need to be robust against being called outside of the startup process, and similarly a comment in ExecuteRecoveryCommand(), explaining that all this stuff just works in the startup process. > diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c > index bcd23542f1..503eb1a5a6 100644 > --- a/src/backend/postmaster/startup.c > +++ b/src/backend/postmaster/startup.c > @@ -19,6 +19,8 @@ > */ > #include "postgres.h" > > +#include <unistd.h> > + > #include "access/xlog.h" > #include "access/xlogrecovery.h" > #include "access/xlogutils.h" > @@ -121,7 +123,17 @@ StartupProcShutdownHandler(SIGNAL_ARGS) > int save_errno = errno; > > if (in_restore_command) > - proc_exit(1); > + { > + /* > + * If we are in a child process (e.g., forked by system() in > + * shell_restore()), we don't want to call any exit callbacks. The > + * parent will take care of that. > + */ > + if (MyProcPid == (int) getpid()) > + proc_exit(1); > + else > + _exit(1); I think it might be worth adding something like const char msg[] = "StartupProcShutdownHandler() called in child process"; write(STDERR_FILENO, msg, sizeof(msg)); to this path. Otherwise it might end up being a very hard to debug path. Greetings, Andres Freund
Hi, On 2023-02-02 11:06:19 +0900, Michael Paquier wrote: > > - the error message for a failed restore command seems to have gotten > > worse: > > could not restore file \"%s\" from archive: %s" > > -> > > "%s \"%s\": %s", commandName, command > > IMO, we don't lose any context with this method: the command type and > the command string itself are the bits actually relevant. Perhaps > something like that would be more intuitive? One idea: > "could not execute command for %s: %s", commandName, command We do - you now can't identify the filename that failed without parsing the command, which obviously isn't easily possible generically, because, well, it's user-configurable. I like that the command is logged now, but I think we need the filename be at a predictable position in addition. > > - shell_* imo is not a good namespace for something called from xlog.c, > > xlogarchive.c. I realize the intention is that shell_archive.c is > > going to be its own "restore module", but for now it imo looks odd > > shell_restore.c does not sound that bad to me, FWIW. The parallel > with the archive counterparts is here. My recent history is not that > good when it comes to naming, based on the feedback I received, > though. I don't mind shell_restore.c much - after all, the filename is namespaced by the directory. However, I do mind function names like shell_restore(), that could also be about restoring what SHELL is set to, or whatever. And functions aren't namespaced in C. > > And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that > > doesn't look right: > > - We now have two places open-coding what BuildRestoreCommand did > > Yeah, BuildRestoreCommand() was just a small wrapper on top of the new > percentrepl.c, making it rather irrelevant at this stage, IMO. For > the two code paths where it was called. I don't at all agree. Particularly because you didn't even leave a pointer in each of the places that if you update one, you also need to update the other. I don't mind the amount of code it adds, I do mind that it, without any recognizable reason, implements policy in multiple places. > > - The comment moved out of RestoreArchivedFile() doesn't seems less > > useful at its new location > > We are talking about that: > - /* > - * Remember, we rollforward UNTIL the restore fails so failure here is > - * just part of the process... that makes it difficult to determine > - * whether the restore failed because there isn't an archive to restore, > - * or because the administrator has specified the restore program > - * incorrectly. We have to assume the former. > - * > - * However, if the failure was due to any sort of signal, it's best to > - * punt and abort recovery. (If we "return false" here, upper levels will > - * assume that recovery is complete and start up the database!) It's > - * essential to abort on child SIGINT and SIGQUIT, because per spec > - * system() ignores SIGINT and SIGQUIT while waiting; if we see one of > - * those it's a good bet we should have gotten it too. > - * > - * On SIGTERM, assume we have received a fast shutdown request, and exit > - * cleanly. It's pure chance whether we receive the SIGTERM first, or the > - * child process. If we receive it first, the signal handler will call > - * proc_exit, otherwise we do it here. If we or the child process received > - * SIGTERM for any other reason than a fast shutdown request, postmaster > - * will perform an immediate shutdown when it sees us exiting > - * unexpectedly. > - * > - * We treat hard shell errors such as "command not found" as fatal, too. > - */ > > The failure processing is stuck within the way we build and handle the > command given down to system(), so keeping this in shell_restore.c (or > whatever name you think would be a better fit) makes sense to me. > Now, thinking a bit more of this, we could just push the description > down to ExecuteRecoveryCommand(), that actually does the work, > adaptinh the comment based on the refactored internals of the > routine. Nothing in the shell_restore() comments explains / asserts that it is only ever to be called in the startup process. And outside of the startup process none of the above actually makes sense. That's kind of my problem with these changes. They try to introduce new abstraction layers, but don't provide real abstraction, because they're very tightly bound to the way the functions were called before the refactoring. And none of these restrictions are actually documented. Greetings, Andres Freund
On Sat, Feb 04, 2023 at 03:30:29AM -0800, Andres Freund wrote: > That's kind of my problem with these changes. They try to introduce new > abstraction layers, but don't provide real abstraction, because they're > very tightly bound to the way the functions were called before the > refactoring. And none of these restrictions are actually documented. Okay. Michael, why don't we revert the shell_restore stuff for now? Once the archive modules interface changes and the fix for this SIGTERM-during-system() problem are in, I will work through this feedback and give recovery modules another try. I'm still hoping to have recovery modules ready in time for the v16 feature freeze. My intent was to improve this code by refactoring and reducing code duplication, but I seem to have missed the mark. I am sorry. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sat, Feb 04, 2023 at 10:03:54AM -0800, Nathan Bossart wrote: > Okay. Michael, why don't we revert the shell_restore stuff for now? Once > the archive modules interface changes and the fix for this > SIGTERM-during-system() problem are in, I will work through this feedback > and give recovery modules another try. I'm still hoping to have recovery > modules ready in time for the v16 feature freeze. Yes, at this stage a revert of the refactoring with shell_restore.c is the best path forward. From the discussion, I got the following things on top of my mind, for reference: - Should we include archive_cleanup_command into the recovery modules at all? We've discussed offloading that from the checkpointer, and it makes the failure handling trickier when it comes to unexpected GUC configurations, for one. The same may actually apply to restore_end_command. Though it is done in the startup process now, there may be an argument to offload that somewhere else based on the timing of the end-of-recovery checkpoint. My opinion on this stuff is that only including restore_command in the modules would make most users I know of happy enough as it removes the overhead of the command invocation from the startup process, if able to replay things fast enough so as the restore command is the bottleneck. restore_end_command would be simple enough, but if there is a wish to redesign the startup process to offload it somewhere else, then the recovery module makes backward-compatibility concerns harder to think about in the long-term. - Do we need to reconsider the assumptions of the startup process where SIGTERM enforces an immediate shutdown while running system() for the restore command? For example, the difference of behavior when a restore_command uses a system sleep() that does not react on signals from what I recall? - Fixing the original issue of this thread may finish by impacting what you are trying to do in this area, so fixing the original issue first sounds like a pre-requirement to me at the end because it may impact the final design of the modules and their callbacks. (I have not looked at all the arguments raised about what to do with ~15, still it does not look like we have a clear picture here yet.) -- Michael
Attachment
Hi, On 2023-02-04 10:03:54 -0800, Nathan Bossart wrote: > On Sat, Feb 04, 2023 at 03:30:29AM -0800, Andres Freund wrote: > > That's kind of my problem with these changes. They try to introduce new > > abstraction layers, but don't provide real abstraction, because they're > > very tightly bound to the way the functions were called before the > > refactoring. And none of these restrictions are actually documented. > > Okay. Michael, why don't we revert the shell_restore stuff for now? Once > the archive modules interface changes and the fix for this > SIGTERM-during-system() problem are in, I will work through this feedback > and give recovery modules another try. I'm still hoping to have recovery > modules ready in time for the v16 feature freeze. > > My intent was to improve this code by refactoring and reducing code > duplication, but I seem to have missed the mark. I am sorry. FWIW, I think the patches were going roughly the right direction, they just needs a bit more work. I don't think we should expose the proc_exit() hack, and its supporting infrastructure, to the pluggable *_command logic. It's bad enough as-is, but having to do this stuff within an extension module seems likely to end badly. There's just way too much action-at-a-distance. I think Thomas has been hacking on a interruptible system() replacement. With that, a lot of this ugliness would be resolved. Greetings, Andres Freund
On Sun, Feb 05, 2023 at 09:49:57AM +0900, Michael Paquier wrote: > - Should we include archive_cleanup_command into the recovery modules > at all? We've discussed offloading that from the checkpointer, and it > makes the failure handling trickier when it comes to unexpected GUC > configurations, for one. The same may actually apply to > restore_end_command. Though it is done in the startup process now, > there may be an argument to offload that somewhere else based on the > timing of the end-of-recovery checkpoint. My opinion on this stuff is > that only including restore_command in the modules would make most > users I know of happy enough as it removes the overhead of the command > invocation from the startup process, if able to replay things fast > enough so as the restore command is the bottleneck. > restore_end_command would be simple enough, but if there is a wish to > redesign the startup process to offload it somewhere else, then the > recovery module makes backward-compatibility concerns harder to think > about in the long-term. I agree. I think we ought to first focus on getting the recovery modules interface and restore_command functionality in place before we take on more difficult things like archive_cleanup_command. But I still think the archive_cleanup_command/recovery_end_command functionality should eventually be added to recovery modules. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-02-05 14:19:38 -0800, Nathan Bossart wrote: > On Sun, Feb 05, 2023 at 09:49:57AM +0900, Michael Paquier wrote: > > - Should we include archive_cleanup_command into the recovery modules > > at all? We've discussed offloading that from the checkpointer, and it > > makes the failure handling trickier when it comes to unexpected GUC > > configurations, for one. The same may actually apply to > > restore_end_command. Though it is done in the startup process now, > > there may be an argument to offload that somewhere else based on the > > timing of the end-of-recovery checkpoint. My opinion on this stuff is > > that only including restore_command in the modules would make most > > users I know of happy enough as it removes the overhead of the command > > invocation from the startup process, if able to replay things fast > > enough so as the restore command is the bottleneck. > > restore_end_command would be simple enough, but if there is a wish to > > redesign the startup process to offload it somewhere else, then the > > recovery module makes backward-compatibility concerns harder to think > > about in the long-term. > > I agree. I think we ought to first focus on getting the recovery modules > interface and restore_command functionality in place before we take on more > difficult things like archive_cleanup_command. But I still think the > archive_cleanup_command/recovery_end_command functionality should > eventually be added to recovery modules. I tend not to agree. If you make the API that small, you're IME likely to end up with something that looks somewhat incoherent once extended. The more I think about it, the less I am convinced that one-callback-per-segment, invoked just before needing the file, is the right approach to address the performance issues of restore_commmand. The main performance issue isn't the shell invocation overhead, it's synchronously needing to restore the archive, before replay can continue. It's also gonna be slow if a restore module copies the segment from a remote system - the latency is the problem. The only way the restore module approach can do better, is to asynchronously restore ahead of the current segment. But for that the API really isn't suited well. The signature of the relevant callback is: > +typedef bool (*RecoveryRestoreCB) (const char *file, const char *path, > + const char *lastRestartPointFileName); That's not very suited to restoring "ahead of time". You need to parse file to figure out whether a segment or something else is restored, turn "file" back into an LSN, figure out where to to store further segments, somehow hand off to some background worker, etc. That doesn't strike me as something we want to happen inside multiple restore libraries. I think at the very least you'd want to have a separate callback for restoring segments than for restoring other files. But more likely a separate callback for each type of file to be restored. For the timeline history case an parameter indicating that we don't want to restore the file, just to see if there's a conflict, would make sense. For the segment files, we'd likely need a parameter to indicate whether the restore is random or not. Greetings, Andres Freund
On Sun, Feb 05, 2023 at 09:49:57AM +0900, Michael Paquier wrote: > Yes, at this stage a revert of the refactoring with shell_restore.c is > the best path forward. Done that now, as of 2f6e15a. -- Michael
Attachment
On Sun, Feb 05, 2023 at 03:01:57PM -0800, Andres Freund wrote: > I think at the very least you'd want to have a separate callback for > restoring segments than for restoring other files. But more likely a > separate callback for each type of file to be restored. > > For the timeline history case an parameter indicating that we don't want > to restore the file, just to see if there's a conflict, would make > sense. That seems reasonable. > For the segment files, we'd likely need a parameter to indicate whether > the restore is random or not. Wouldn't this approach still require each module to handle restoring ahead of time? I agree that the shell overhead isn't the main performance issue, but it's unclear to me how much of this should be baked into PostgreSQL. I mean, we could introduce a GUC that tells us how far ahead to restore and have a background worker (or multiple background workers) asynchronously pull files into a staging directory via the callbacks. Is that the sort of scope you are envisioning? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-02-05 15:57:47 -0800, Nathan Bossart wrote: > > For the segment files, we'd likely need a parameter to indicate whether > > the restore is random or not. > > Wouldn't this approach still require each module to handle restoring ahead > of time? Yes, to some degree at least. I was just describing a few pretty obvious improvements. The core code can make that a lot easier though. The problem of where to store such files can be provided by core code (presumably a separate directory). A GUC for aggressiveness can be provided. Etc. > I agree that the shell overhead isn't the main performance issue, > but it's unclear to me how much of this should be baked into > PostgreSQL. I don't know fully either. But just reimplementing all of it in different modules doesn't seem like a sane approach either. A lot of it is policy that we need to solve once, centrally. > I mean, we could introduce a GUC that tells us how far ahead to > restore and have a background worker (or multiple background workers) > asynchronously pull files into a staging directory via the callbacks. > Is that the sort of scope you are envisioning? Closer, at least. Greetings, Andres Freund
On Sun, Feb 05, 2023 at 04:07:50PM -0800, Andres Freund wrote: > On 2023-02-05 15:57:47 -0800, Nathan Bossart wrote: >> I agree that the shell overhead isn't the main performance issue, >> but it's unclear to me how much of this should be baked into >> PostgreSQL. > > I don't know fully either. But just reimplementing all of it in > different modules doesn't seem like a sane approach either. A lot of it > is policy that we need to solve once, centrally. > >> I mean, we could introduce a GUC that tells us how far ahead to >> restore and have a background worker (or multiple background workers) >> asynchronously pull files into a staging directory via the callbacks. >> Is that the sort of scope you are envisioning? > > Closer, at least. Got it. I suspect we'll want to do something similar for archive modules eventually, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sun, Feb 5, 2023 at 7:46 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Got it. I suspect we'll want to do something similar for archive modules > eventually, too. +1. I felt like the archive modules work was a step forward when we did, because basic_archive does some things that you're not likely to get right if you do it on your own. And a similar approach to restore_command might be also be valuable, at least in my opinion. However, the gains that we can get out of the archive module facility in its present form do seem to be somewhat limited, for exactly the kinds of reasons being discussed here. I kind of wonder whether we ought to try to flip the model around. At present, the idea is that the archiver is doing its thing and it makes callbacks into the archive module. But what if we got rid of the archiver main loop altogether and put the main loop inside of the archive module, and have it call back to some functions that we provide? One function could be like char *pgarch_next_file_to_be_archived_if_there_is_one_ready(void) and the other could be like void pgarch_some_file_that_you_gave_me_previously_is_now_fully_archived(char *which_one). That way, we'd break the tight coupling where you have to get a unit of work and perform it in full before you can get the next unit of work. Some variant of this could work on the restore side, too, I think, although we have less certainty about how much it makes to prefetch for restore than we do about what needs to be archived. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 08, 2023 at 10:22:24AM -0500, Robert Haas wrote: > I felt like the archive modules work was a step forward when we did, > because basic_archive does some things that you're not likely to get > right if you do it on your own. And a similar approach to > restore_command might be also be valuable, at least in my opinion. > However, the gains that we can get out of the archive module facility > in its present form do seem to be somewhat limited, for exactly the > kinds of reasons being discussed here. I'm glad to hear that there is interest in taking this stuff to the next level. I'm currently planning to first get the basic API in place for recovery modules like we have for archive modules, but I'm hoping to position it so that it leads naturally to asynchronous, parallel, and/or batching approaches down the road (v17?). > I kind of wonder whether we ought to try to flip the model around. At > present, the idea is that the archiver is doing its thing and it makes > callbacks into the archive module. But what if we got rid of the > archiver main loop altogether and put the main loop inside of the > archive module, and have it call back to some functions that we > provide? One function could be like char > *pgarch_next_file_to_be_archived_if_there_is_one_ready(void) and the > other could be like void > pgarch_some_file_that_you_gave_me_previously_is_now_fully_archived(char > *which_one). That way, we'd break the tight coupling where you have to > get a unit of work and perform it in full before you can get the next > unit of work. Some variant of this could work on the restore side, > too, I think, although we have less certainty about how much it makes > to prefetch for restore than we do about what needs to be archived. I think this could be a good approach if we decide not to bake too much into PostgreSQL itself (e.g., such as creating multiple archive workers that each call out to the module). Archive module authors would effectively need to write their own archiver processes. That sounds super flexible, but it also sounds like it might be harder to get right. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Feb 8, 2023 at 12:43 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I think this could be a good approach if we decide not to bake too much > into PostgreSQL itself (e.g., such as creating multiple archive workers > that each call out to the module). Archive module authors would > effectively need to write their own archiver processes. That sounds super > flexible, but it also sounds like it might be harder to get right. Yep. That's a problem, and I'm certainly open to better ideas. However, if we assume that the archive module is likely to be doing something like juggling a bunch of file descriptors over which it is speaking HTTP, what other model works, really? It might be juggling those file descriptors indirectly, or it might be relying on an intermediate library like curl or something from Amazon that talks to S3 or whatever, but only it knows what resources it's juggling, or what functions it needs to call to manage them. On the other hand, we don't really need a lot from it. We need it to CHECK_FOR_INTERRUPTS() and handle that without leaking resources or breaking the world in some way, and we sort of need it to, you know, actually archive stuff, but apart from that I guess it can do what it likes (unless I'm missing some other important function of the archiver?). It's probably a good idea if the archiver function returns when it's fully caught up and there's no more work to do. Then we could handle decisions about hibernation in the core code, rather than having every archive module invent its own way of doing that. But when there's work happening, as far as I can see, the archive module needs to have control pretty nearly all the time, or it's not going to be able to do anything clever. Always happy to hear if you see it differently.... -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 08, 2023 at 04:24:15PM -0500, Robert Haas wrote: > On Wed, Feb 8, 2023 at 12:43 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I think this could be a good approach if we decide not to bake too much >> into PostgreSQL itself (e.g., such as creating multiple archive workers >> that each call out to the module). Archive module authors would >> effectively need to write their own archiver processes. That sounds super >> flexible, but it also sounds like it might be harder to get right. > > Yep. That's a problem, and I'm certainly open to better ideas. > > However, if we assume that the archive module is likely to be doing > something like juggling a bunch of file descriptors over which it is > speaking HTTP, what other model works, really? It might be juggling > those file descriptors indirectly, or it might be relying on an > intermediate library like curl or something from Amazon that talks to > S3 or whatever, but only it knows what resources it's juggling, or > what functions it needs to call to manage them. On the other hand, we > don't really need a lot from it. We need it to CHECK_FOR_INTERRUPTS() > and handle that without leaking resources or breaking the world in > some way, and we sort of need it to, you know, actually archive stuff, > but apart from that I guess it can do what it likes (unless I'm > missing some other important function of the archiver?). > > It's probably a good idea if the archiver function returns when it's > fully caught up and there's no more work to do. Then we could handle > decisions about hibernation in the core code, rather than having every > archive module invent its own way of doing that. But when there's work > happening, as far as I can see, the archive module needs to have > control pretty nearly all the time, or it's not going to be able to do > anything clever. > > Always happy to hear if you see it differently.... These are all good points. Perhaps there could be a base archiver implementation that shell_archive uses (and that other modules could use if desired, which might be important for backward compatibility with the existing callbacks). But if you want to do something fancier than archiving sequentially, you could write your own. In any case, I'm not really wedded to any particular approach at the moment, so I am likewise open to better ideas. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Feb 08, 2023 at 02:25:54PM -0800, Nathan Bossart wrote: > These are all good points. Perhaps there could be a base archiver > implementation that shell_archive uses (and that other modules could use if > desired, which might be important for backward compatibility with the > existing callbacks). But if you want to do something fancier than > archiving sequentially, you could write your own. Which is basically the kind of things you can already achieve with a background worker and a module of your own? > In any case, I'm not really wedded to any particular approach at the > moment, so I am likewise open to better ideas. I am not sure how much we should try to move from core into the modules when it comes to the current archiver process, with how much control you'd like to give to users. It also looks like to me that this is the kind of problem where we would not have the correct callback design until someone comes in and develops a solution that would shape around it. On top of that, this is the kind of things achievable with just a bgworker, and perhaps simpler as the parallel state can just be maintained in it, which is what the archiver process is about at the end? Or were there some restrictions in the bgworker APIs that would not fit with what an archiver process should do? Perhaps these restrictions, if any, are what we'd better try to lift first? -- Michael
Attachment
On Thu, Feb 09, 2023 at 08:56:24AM +0900, Michael Paquier wrote: > On Wed, Feb 08, 2023 at 02:25:54PM -0800, Nathan Bossart wrote: >> These are all good points. Perhaps there could be a base archiver >> implementation that shell_archive uses (and that other modules could use if >> desired, which might be important for backward compatibility with the >> existing callbacks). But if you want to do something fancier than >> archiving sequentially, you could write your own. > > Which is basically the kind of things you can already achieve with a > background worker and a module of your own? IMO one of the big pieces that's missing is a way to get the next N files to archive. Right now, you'd have to trawl through archive_status on your own if you wanted to batch/parallelize. I think one advantage of what Robert is suggesting is that we could easily provide a supported way to get the next set of files to archive, and we can asynchronously mark them "done". Otherwise, each module has to implement this. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Feb 8, 2023 at 7:24 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Thu, Feb 09, 2023 at 08:56:24AM +0900, Michael Paquier wrote: > > On Wed, Feb 08, 2023 at 02:25:54PM -0800, Nathan Bossart wrote: > >> These are all good points. Perhaps there could be a base archiver > >> implementation that shell_archive uses (and that other modules could use if > >> desired, which might be important for backward compatibility with the > >> existing callbacks). But if you want to do something fancier than > >> archiving sequentially, you could write your own. > > > > Which is basically the kind of things you can already achieve with a > > background worker and a module of your own? > > IMO one of the big pieces that's missing is a way to get the next N files > to archive. Right now, you'd have to trawl through archive_status on your > own if you wanted to batch/parallelize. I think one advantage of what > Robert is suggesting is that we could easily provide a supported way to get > the next set of files to archive, and we can asynchronously mark them > "done". Otherwise, each module has to implement this. Right. I think that we could certainly, as Michael suggests, have people provide their own background worker rather than having the archiver invoke the user-supplied code directly. As long as the functions that you need in order to get the necessary information can be called from some other process, that's fine. The only difficulty I see is that if the archiving is happening from a separate background worker rather than from the archiver, then what is the archiver doing? We could somehow arrange to not run the archiver process at all, or I guess to just sit there and have it do nothing. Or, we can decide not to have a separate background worker and just have the archiver call the user-supplied core directly. I kind of like that approach at the moment; it seems more elegant to me. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I think that we could certainly, as Michael suggests, have people > provide their own background worker rather than having the archiver > invoke the user-supplied code directly. As long as the functions that > you need in order to get the necessary information can be called from > some other process, that's fine. The only difficulty I see is that if > the archiving is happening from a separate background worker rather > than from the archiver, then what is the archiver doing? We could > somehow arrange to not run the archiver process at all, or I guess to > just sit there and have it do nothing. Or, we can decide not to have a > separate background worker and just have the archiver call the > user-supplied core directly. I kind of like that approach at the > moment; it seems more elegant to me. I'm fairly concerned about the idea of making it common for people to write their own main loop for the archiver. That means that, if we have a bug fix that requires the archiver to do X, we will not just be patching our own code but trying to get an indeterminate set of third parties to add the fix to their code. If we think we need primitives to let the archiver hooks get all the pending files, or whatever, by all means add those. But don't cede fundamental control of the archiver. The hooks need to be decoration on a framework we provide, not the framework themselves. regards, tom lane
On Thu, Feb 9, 2023 at 10:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm fairly concerned about the idea of making it common for people > to write their own main loop for the archiver. That means that, if > we have a bug fix that requires the archiver to do X, we will not > just be patching our own code but trying to get an indeterminate > set of third parties to add the fix to their code. I don't know what kind of bug we could really have in the main loop that would be common to every implementation. They're probably all going to check for interrupts, do some work, and then wait for I/O on some things by calling select() or some equivalent. But the work, and the wait for the I/O, would be different for every implementation. I would anticipate that the amount of common code would be nearly zero. Imagine two archive modules, one of which archives files via HTTP and the other of which archives them via SSH. They need to do a lot of the same things, but the code is going to be totally different. When the HTTP archiver module needs to open a new connection, it's going to call some libcurl function. When the SSH archiver module needs to do the same thing, it's going to call some libssh function. It seems quite likely that the HTTP implementation would want to juggle multiple connections in parallel, but the SSH implementation might not want to do that, or its logic for determining how many connections to open might be completely different based on the behavior of that protocol vs. the other protocol. Once either implementation has sent as much data it can over the connections it has open, it needs to wait for those sockets to become write-ready or, possibly, read-ready. There again, each one will be calling into a different library to do that. It could be that in this particular case, but would be waiting for a set of file descriptors, and we could provide some framework for waiting on a set of file descriptors provided by the module. But you could also have some other archiver implementation that is, say, waiting for a process to terminate rather than for a file descriptor to become ready for I/O. > If we think we need primitives to let the archiver hooks get all > the pending files, or whatever, by all means add those. But don't > cede fundamental control of the archiver. The hooks need to be > decoration on a framework we provide, not the framework themselves. I don't quite see how you can make asynchronous and parallel archiving work if the archiver process only calls into the archive module at times that it chooses. That would mean that the module has to return control to the archiver when it's in the middle of archiving one or more files -- and then I don't see how it can get control back at the appropriate time. Do you have a thought about that? -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Feb 09, 2023 at 11:12:21AM -0500, Robert Haas wrote: > On Thu, Feb 9, 2023 at 10:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If we think we need primitives to let the archiver hooks get all >> the pending files, or whatever, by all means add those. But don't >> cede fundamental control of the archiver. The hooks need to be >> decoration on a framework we provide, not the framework themselves. > > I don't quite see how you can make asynchronous and parallel archiving > work if the archiver process only calls into the archive module at > times that it chooses. That would mean that the module has to return > control to the archiver when it's in the middle of archiving one or > more files -- and then I don't see how it can get control back at the > appropriate time. Do you have a thought about that? I've been thinking about this, actually. I'm wondering if we could provide a list of files to the archiving callback (configurable via a variable in ArchiveModuleState), and then have the callback return a list of files that are archived. (Or maybe we just put the list of files that need archiving in ArchiveModuleState.) The returned list could include files that were sent to the callback previously. The archive module would be responsible for creating background worker(s) (if desired), dispatching files to-be-archived to its background worker(s), and gathering the list of archived files to return. This is admittedly half-formed, but I'm tempted to hack something together quickly to see whether it might be viable. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-02-09 11:12:21 -0500, Robert Haas wrote: > On Thu, Feb 9, 2023 at 10:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'm fairly concerned about the idea of making it common for people > > to write their own main loop for the archiver. That means that, if > > we have a bug fix that requires the archiver to do X, we will not > > just be patching our own code but trying to get an indeterminate > > set of third parties to add the fix to their code. I'm somewhat concerned about that too, but perhaps from a different angle. First, I think we don't do our users a service by defaulting the in-core implementation to something that doesn't scale to even a moderately busy server. Second, I doubt we'll get the API for any of this right, without an acutual user that does something more complicated than restoring one-by-one in a blocking manner. > I don't know what kind of bug we could really have in the main loop > that would be common to every implementation. They're probably all > going to check for interrupts, do some work, and then wait for I/O on > some things by calling select() or some equivalent. But the work, and > the wait for the I/O, would be different for every implementation. I > would anticipate that the amount of common code would be nearly zero. I don't think it's that hard to imagine problems. To be reasonably fast, a decent restore implementation will have to 'restore ahead'. Which also provides ample things to go wrong. E.g. - WAL source is switched, restore module needs to react to that, but doesn't, we end up lots of wasted work, or worse, filename conflicts - recovery follows a timeline, restore module doesn't catch on quickly enough - end of recovery happens, restore just continues on > > If we think we need primitives to let the archiver hooks get all > > the pending files, or whatever, by all means add those. But don't > > cede fundamental control of the archiver. The hooks need to be > > decoration on a framework we provide, not the framework themselves. > > I don't quite see how you can make asynchronous and parallel archiving > work if the archiver process only calls into the archive module at > times that it chooses. That would mean that the module has to return > control to the archiver when it's in the middle of archiving one or > more files -- and then I don't see how it can get control back at the > appropriate time. Do you have a thought about that? I don't think archiver is the hard part, that already has a dedicated process, and it also has something of a queuing system already. The startup process imo is the complicated one... If we had a 'restorer' process, startup fed some sort of a queue with things to restore in the near future, it might be more realistic to do something you describe? Greetings, Andres Freund
Here is a new version of the stopgap/back-branch fix for restore_command. This is more or less a rebased version of v4 with an added stderr message as Andres suggested upthread. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 9, 2023 at 10:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I've been thinking about this, actually. I'm wondering if we could provide > a list of files to the archiving callback (configurable via a variable in > ArchiveModuleState), and then have the callback return a list of files that > are archived. (Or maybe we just put the list of files that need archiving > in ArchiveModuleState.) The returned list could include files that were > sent to the callback previously. The archive module would be responsible > for creating background worker(s) (if desired), dispatching files > to-be-archived to its background worker(s), and gathering the list of > archived files to return. Hmm. So in this design, the archiver doesn't really do the archiving any more, because the interface makes that impossible. It has to use a separate background worker process for that, full stop. I don't think that's a good design. It's fine if some people want to implement it that way, but it shouldn't be forced by the interface. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 10, 2023 at 12:59 AM Andres Freund <andres@anarazel.de> wrote: > I'm somewhat concerned about that too, but perhaps from a different > angle. First, I think we don't do our users a service by defaulting the > in-core implementation to something that doesn't scale to even a moderately > busy server. +1. > Second, I doubt we'll get the API for any of this right, without > an acutual user that does something more complicated than restoring one-by-one > in a blocking manner. Fair. > I don't think it's that hard to imagine problems. To be reasonably fast, a > decent restore implementation will have to 'restore ahead'. Which also > provides ample things to go wrong. E.g. > > - WAL source is switched, restore module needs to react to that, but doesn't, > we end up lots of wasted work, or worse, filename conflicts > - recovery follows a timeline, restore module doesn't catch on quickly enough > - end of recovery happens, restore just continues on I don't see how you can prevent those things from happening. If the restore process is working in some way that requires an event loop, and I think that will be typical for any kind of remote archiving, then either it has control most of the time, so the event loop can be run inside the restore process, or, as Nathan proposes, we don't let the archiver have control and it needs to run that restore process in a separate background worker. The hazards that you mention here exist either way. If the event loop is running inside the restore process, it can decide not to call the functions that we provide in a timely fashion and thus fail to react as it should. If the event loop runs inside a separate background worker, then that process can fail to be responsive in precisely the same way. Fundamentally, if the author of a restore module writes code to have multiple I/Os in flight at the same time and does not write code to cancel those I/Os if something changes, then such cancellation will not occur. That remains true no matter which process is performing the I/O. > > I don't quite see how you can make asynchronous and parallel archiving > > work if the archiver process only calls into the archive module at > > times that it chooses. That would mean that the module has to return > > control to the archiver when it's in the middle of archiving one or > > more files -- and then I don't see how it can get control back at the > > appropriate time. Do you have a thought about that? > > I don't think archiver is the hard part, that already has a dedicated > process, and it also has something of a queuing system already. The startup > process imo is the complicated one... > > If we had a 'restorer' process, startup fed some sort of a queue with things > to restore in the near future, it might be more realistic to do something you > describe? Some kind of queueing system might be a useful part of the interface, and a dedicated restorer process does sound like a good idea. But the archiver doesn't have this solved, precisely because you have to archive a single file, return control, and wait to be invoked again for the next file. That does not scale. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-02-16 15:18:57 +0530, Robert Haas wrote: > On Fri, Feb 10, 2023 at 12:59 AM Andres Freund <andres@anarazel.de> wrote: > > I don't think it's that hard to imagine problems. To be reasonably fast, a > > decent restore implementation will have to 'restore ahead'. Which also > > provides ample things to go wrong. E.g. > > > > - WAL source is switched, restore module needs to react to that, but doesn't, > > we end up lots of wasted work, or worse, filename conflicts > > - recovery follows a timeline, restore module doesn't catch on quickly enough > > - end of recovery happens, restore just continues on > > I don't see how you can prevent those things from happening. If the > restore process is working in some way that requires an event loop, > and I think that will be typical for any kind of remote archiving, > then either it has control most of the time, so the event loop can be > run inside the restore process, or, as Nathan proposes, we don't let > the archiver have control and it needs to run that restore process in > a separate background worker. The hazards that you mention here exist > either way. If the event loop is running inside the restore process, > it can decide not to call the functions that we provide in a timely > fashion and thus fail to react as it should. If the event loop runs > inside a separate background worker, then that process can fail to be > responsive in precisely the same way. Fundamentally, if the author of > a restore module writes code to have multiple I/Os in flight at the > same time and does not write code to cancel those I/Os if something > changes, then such cancellation will not occur. That remains true no > matter which process is performing the I/O. IDK. I think we can make that easier or harder. Right now the proposed API doesn't provide anything to allow to address this. > > > I don't quite see how you can make asynchronous and parallel archiving > > > work if the archiver process only calls into the archive module at > > > times that it chooses. That would mean that the module has to return > > > control to the archiver when it's in the middle of archiving one or > > > more files -- and then I don't see how it can get control back at the > > > appropriate time. Do you have a thought about that? > > > > I don't think archiver is the hard part, that already has a dedicated > > process, and it also has something of a queuing system already. The startup > > process imo is the complicated one... > > > > If we had a 'restorer' process, startup fed some sort of a queue with things > > to restore in the near future, it might be more realistic to do something you > > describe? > > Some kind of queueing system might be a useful part of the interface, > and a dedicated restorer process does sound like a good idea. But the > archiver doesn't have this solved, precisely because you have to > archive a single file, return control, and wait to be invoked again > for the next file. That does not scale. But there's nothing inherent in that. We know for certain which files we're going to archive. And we don't need to work one-by-one. The archiver could just start multiple subprocesses at the same time. All the blocking it does right now are artificially imposed by the use of system(). We could instead just use something popen() like and have a configurable number of processes running at the same time. What I was trying to point out was that the work a "restorer" process has to do is more speculative, because we don't know when we'll promote, whether we'll follow a timeline increase, whether the to-be-restored WAL already exists. That's solvable, but a bunch of the relevant work ought to be solved in core core code, instead of just in archive modules. Greetings, Andres Freund
On Thu, Feb 16, 2023 at 03:08:14PM +0530, Robert Haas wrote: > On Thu, Feb 9, 2023 at 10:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I've been thinking about this, actually. I'm wondering if we could provide >> a list of files to the archiving callback (configurable via a variable in >> ArchiveModuleState), and then have the callback return a list of files that >> are archived. (Or maybe we just put the list of files that need archiving >> in ArchiveModuleState.) The returned list could include files that were >> sent to the callback previously. The archive module would be responsible >> for creating background worker(s) (if desired), dispatching files >> to-be-archived to its background worker(s), and gathering the list of >> archived files to return. > > Hmm. So in this design, the archiver doesn't really do the archiving > any more, because the interface makes that impossible. It has to use a > separate background worker process for that, full stop. > > I don't think that's a good design. It's fine if some people want to > implement it that way, but it shouldn't be forced by the interface. I don't think it would force you to use a background worker, but if you wanted to, the tools would be available. At least, that is the intent. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Feb 16, 2023 at 10:28 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Hmm. So in this design, the archiver doesn't really do the archiving > > any more, because the interface makes that impossible. It has to use a > > separate background worker process for that, full stop. > > > > I don't think that's a good design. It's fine if some people want to > > implement it that way, but it shouldn't be forced by the interface. > > I don't think it would force you to use a background worker, but if you > wanted to, the tools would be available. At least, that is the intent. I'm 100% amenable to somebody demonstrating how that is super easy, barely an inconvenience. But I think we would need to see some code showing at least what the API is going to look like, and ideally a sample implementation, in order for me to be convinced of that. What I suspect is that if somebody tries to do that they are going to find that the core API has to be quite opinionated about how the archive module has to do things, which I think is not what we want. But if that turns out to be false, cool! -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Feb 16, 2023 at 10:02 PM Andres Freund <andres@anarazel.de> wrote: > But there's nothing inherent in that. We know for certain which files we're > going to archive. And we don't need to work one-by-one. The archiver could > just start multiple subprocesses at the same time. But what if it doesn't want to start multiple processes, just multiplex within a single process? > What I was trying to point out was that the work a "restorer" process has to > do is more speculative, because we don't know when we'll promote, whether > we'll follow a timeline increase, whether the to-be-restored WAL already > exists. That's solvable, but a bunch of the relevant work ought to be solved > in core core code, instead of just in archive modules. Yep, I can see that there are some things to figure out there, and I agree that they should be figured out in the core code. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-02-18 15:51:06 +0530, Robert Haas wrote: > On Thu, Feb 16, 2023 at 10:02 PM Andres Freund <andres@anarazel.de> wrote: > > But there's nothing inherent in that. We know for certain which files we're > > going to archive. And we don't need to work one-by-one. The archiver could > > just start multiple subprocesses at the same time. > > But what if it doesn't want to start multiple processes, just > multiplex within a single process? To me that seems even simpler? Nothing but the archiver is supposed to create .done files and nothing is supposed to remove .ready files without archiver having created the .done files. So the archiver process can scan archive_status until its done or until N archives have been collected, and then process them at once? Only the creation of the .done files would be serial, but I don't think that's commonly a problem (and could be optimized as well, by creating multiple files and then fsyncing them in a second pass, avoiding N filesystem journal flushes). Maybe I am misunderstanding what you see as the problem? Greetings, Andres Freund
On Sun, Feb 19, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote: > To me that seems even simpler? Nothing but the archiver is supposed to create > .done files and nothing is supposed to remove .ready files without archiver > having created the .done files. So the archiver process can scan > archive_status until its done or until N archives have been collected, and > then process them at once? Only the creation of the .done files would be > serial, but I don't think that's commonly a problem (and could be optimized as > well, by creating multiple files and then fsyncing them in a second pass, > avoiding N filesystem journal flushes). > > Maybe I am misunderstanding what you see as the problem? Well right now the archiver process calls ArchiveFileCB when there's a file ready for archiving, and that process is supposed to archive the whole thing before returning. That pretty obviously seems to preclude having more than one file being archived at the same time. What callback structure do you have in mind to allow for that? I mean, my idea was to basically just have one big callback: ArchiverModuleMainLoopCB(). Which wouldn't return, or perhaps, would only return when archiving was totally caught up and there was nothing more to do right now. And then that callback could call functions like AreThereAnyMoreFilesIShouldBeArchivingAndIfYesWhatIsTheNextOne(). So it would call that function and it would find out about a file and start an HTTP session or whatever and then call that function again and start another HTTP session for the second file and so on until it had as much concurrency as it wanted. And then when it hit the concurrency limit, it would wait until at least one HTTP request finished. At that point it would call HeyEverybodyISuccessfullyArchivedAWalFile(), after which it could again ask for the next file and start a request for that one and so on and so forth. I don't really understand what the other possible model is here, honestly. Right now, control remains within the archive module for the entire time that a file is being archived. If we generalize the model to allow multiple files to be in the process of being archived at the same time, the archive module is going to need to have control as long as >= 1 of them are in progress, at least AFAICS. If you have some other idea how it would work, please explain it to me... -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Feb 19, 2023 at 08:06:24PM +0530, Robert Haas wrote: > I mean, my idea was to basically just have one big callback: > ArchiverModuleMainLoopCB(). Which wouldn't return, or perhaps, would > only return when archiving was totally caught up and there was nothing > more to do right now. And then that callback could call functions like > AreThereAnyMoreFilesIShouldBeArchivingAndIfYesWhatIsTheNextOne(). So > it would call that function and it would find out about a file and > start an HTTP session or whatever and then call that function again > and start another HTTP session for the second file and so on until it > had as much concurrency as it wanted. And then when it hit the > concurrency limit, it would wait until at least one HTTP request > finished. At that point it would call > HeyEverybodyISuccessfullyArchivedAWalFile(), after which it could > again ask for the next file and start a request for that one and so on > and so forth. This archiving implementation is not completely impossible with the current API infrastructure, either? If you consider the archiving as a two-step process where segments are first copied into a cheap, reliable area. Then these could be pushed in block in a more remote area like a S3 bucket? Of course this depends on other things like the cluster structure, but redundancy can be added with standby archiving, as well. I am not sure exactly how many requirements we want to push into a callback, to be honest, and surely more requirements pushed to the callback increases the odds of implementation mistakes, like a full loop. There already many ways to get it wrong with archiving, like missing a flush of the archived segment before the callback returns to ensure its durability.. -- Michael
Attachment
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Sun, Feb 19, 2023 at 08:06:24PM +0530, Robert Haas wrote: > > I mean, my idea was to basically just have one big callback: > > ArchiverModuleMainLoopCB(). Which wouldn't return, or perhaps, would > > only return when archiving was totally caught up and there was nothing > > more to do right now. And then that callback could call functions like > > AreThereAnyMoreFilesIShouldBeArchivingAndIfYesWhatIsTheNextOne(). So > > it would call that function and it would find out about a file and > > start an HTTP session or whatever and then call that function again > > and start another HTTP session for the second file and so on until it > > had as much concurrency as it wanted. And then when it hit the > > concurrency limit, it would wait until at least one HTTP request > > finished. At that point it would call > > HeyEverybodyISuccessfullyArchivedAWalFile(), after which it could > > again ask for the next file and start a request for that one and so on > > and so forth. > > This archiving implementation is not completely impossible with the > current API infrastructure, either? If you consider the archiving as > a two-step process where segments are first copied into a cheap, > reliable area. Then these could be pushed in block in a more remote > area like a S3 bucket? Of course this depends on other things like > the cluster structure, but redundancy can be added with standby > archiving, as well. Surely it can't be too cheap as it needs to be reliable.. We have looked at this before (copying to a queue area before copying with a separate process off-system) and it simply isn't great and requires more work than you really want to do if you can help it and for no real benefit. > I am not sure exactly how many requirements we want to push into a > callback, to be honest, and surely more requirements pushed to the > callback increases the odds of implementation mistakes, like a full > loop. There already many ways to get it wrong with archiving, like > missing a flush of the archived segment before the callback returns to > ensure its durability.. Without any actual user of any of this it's surprising to me how much effort has been put into it. Have I missed the part where someone has said they're actually implementing an archive library that we can look at and see how it works and how the archive library and the core system could work better together..? We (pgbackrest) are generally interested in the idea to reduce the startup time, but that's not really a big issue for us currently and so it hasn't really risen up to the level of being something we're working on, not to mention that if it keeps changing each release then it's just going to end up being more work for us for a feature that doesn't gain us all that much. Now, all that said, at least in initial discussions, we expect the pgbackrest archive_library to look very similar to how we handle archive_command and async archiving today- when called if there's multiple WAL files to process then we fork an async process off and it goes and spawns multiple processes and does its work to move the WAL files to the off-system storage and when we are called via archive_command we just check a status flag to see if that WAL has been archived yet by the async process or not. If not and there's no async process running then we'll start a new one (starting a new async process periodically actually makes things a lot easier to test for us too, which is why we don't just have an async process running around forever- the startup time typically isn't that big of a deal), if there is a status flag then we return whatever it says, and if the async process is running and no status flag yet then we wait. Once we have that going then perhaps there could be some interesting iteration between pgbackrest and the core code to improve things, but all this discussion and churn feels more likely to put folks off of trying to implement something using this approach than the opposite, unless someone in this discussion is actually working on an archive library, but that isn't the impression I've gotten, at least (though if there is such a work in progress out there, I'd love to see it!). Thanks, Stephen
Attachment
On Tue, Feb 14, 2023 at 09:47:55AM -0800, Nathan Bossart wrote: > Here is a new version of the stopgap/back-branch fix for restore_command. > This is more or less a rebased version of v4 with an added stderr message > as Andres suggested upthread. So, this thread has moved around many subjects, still we did not get to the core of the issue which is what we should try to do to avoid sporadic failures like what the top of the thread is mentioning. Perhaps beginning a new thread with a patch and a summary would be better at this stage? Another thing I am wondering is if it could be possible to test that rather reliably. I have been playing with a few scenarios like holding the system() call for a bit with hardcoded sleep()s, without much success. I'll try harder on that part.. It's been mentioned as well that we could just move away from system() in the long-term. -- Michael
Attachment
On Tue, Feb 21, 2023 at 1:03 PM Michael Paquier <michael@paquier.xyz> wrote: > Perhaps beginning a new thread with a patch and a summary would be > better at this stage? Another thing I am wondering is if it could be > possible to test that rather reliably. I have been playing with a few > scenarios like holding the system() call for a bit with hardcoded > sleep()s, without much success. I'll try harder on that part.. It's > been mentioned as well that we could just move away from system() in > the long-term. I've been experimenting with ideas for master, which I'll start a new thread about. Actually I was already thinking about this before this broken signal handler stuff came up, because it was already unacceptable that all these places that are connected to shared memory ignore interrupts for unbounded time while a shell script/whatever runs. At first I thought it would be relatively simple to replace system() with something that has a latch wait loop (though there are details to argue about, like whether you want to allow interrupts that throw, and if so, how you clean up the subprocess, which have several plausible solutions). But once I started looking at the related popen-based stuff where you want to communicate with the subprocess (for example COPY FROM PROGRAM), I realised that it needs more analysis and work: that stuff is currently entirely based on stdio FILE (that is, fread() and fwrite()), but it's not really possible (at least portably) to make that nonblocking, and in fact it's a pretty terrible interface in terms of error reporting in general. I've been sketching/thinking about a new module called 'subprocess', with a couple of ways to start processes, and interact with them via WaitEventSet and direct pipe I/O; or if buffering is needed, it'd be our own, not <stdio.h>'s. But don't let me stop anyone else proposing ideas.
On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote: > On Tue, Feb 14, 2023 at 09:47:55AM -0800, Nathan Bossart wrote: >> Here is a new version of the stopgap/back-branch fix for restore_command. >> This is more or less a rebased version of v4 with an added stderr message >> as Andres suggested upthread. > > So, this thread has moved around many subjects, still we did not get > to the core of the issue which is what we should try to do to avoid > sporadic failures like what the top of the thread is mentioning. > > Perhaps beginning a new thread with a patch and a summary would be > better at this stage? Another thing I am wondering is if it could be > possible to test that rather reliably. I have been playing with a few > scenarios like holding the system() call for a bit with hardcoded > sleep()s, without much success. I'll try harder on that part.. It's > been mentioned as well that we could just move away from system() in > the long-term. I'm happy to create a new thread if needed, but I can't tell if there is any interest in this stopgap/back-branch fix. Perhaps we should just jump straight to the long-term fix that Thomas is looking into. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote: > > Perhaps beginning a new thread with a patch and a summary would be > > better at this stage? Another thing I am wondering is if it could be > > possible to test that rather reliably. I have been playing with a few > > scenarios like holding the system() call for a bit with hardcoded > > sleep()s, without much success. I'll try harder on that part.. It's > > been mentioned as well that we could just move away from system() in > > the long-term. > > I'm happy to create a new thread if needed, but I can't tell if there is > any interest in this stopgap/back-branch fix. Perhaps we should just jump > straight to the long-term fix that Thomas is looking into. Unfortunately the latch-friendly subprocess module proposal I was talking about would be for 17. I may post a thread fairly soon with design ideas + list of problems and decision points as I see them, and hopefully some sketch code, but it won't be a proposal for [/me checks calendar] next week's commitfest and probably wouldn't be appropriate in a final commitfest anyway, and I also have some other existing stuff to clear first. So please do continue with the stopgap ideas. BTW Here's an idea (untested) about how to reproduce the problem. You could copy the source from a system() implementation, call it doomed_system(), and insert kill(-getppid(), SIGQUIT) in between sigprocmask(SIG_SETMASK, &omask, NULL) and exec*(). Parent and self will handle the signal and both reach the proc_exit(). The systems that failed are running code like this: https://github.com/openbsd/src/blob/master/lib/libc/stdlib/system.c https://github.com/DragonFlyBSD/DragonFlyBSD/blob/master/lib/libc/stdlib/system.c I'm pretty sure these other implementations could fail in just the same way (they restore the handler before unblocking, so can run it just before exec() replaces the image): https://github.com/freebsd/freebsd-src/blob/main/lib/libc/stdlib/system.c https://github.com/lattera/glibc/blob/master/sysdeps/posix/system.c The glibc one is a bit busier and, huh, has a lock (I guess maybe deadlockable if proc_exit() also calls system(), but hopefully it doesn't), and uses fork() instead of vfork() but I don't think that's a material difference here (with fork(), parent and child run concurrently, while with vfork() the parent is suspended until the child exists or execs, and then processes its pending signals).
On Wed, Feb 22, 2023 at 09:48:10PM +1300, Thomas Munro wrote: > On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote: >> > Perhaps beginning a new thread with a patch and a summary would be >> > better at this stage? Another thing I am wondering is if it could be >> > possible to test that rather reliably. I have been playing with a few >> > scenarios like holding the system() call for a bit with hardcoded >> > sleep()s, without much success. I'll try harder on that part.. It's >> > been mentioned as well that we could just move away from system() in >> > the long-term. >> >> I'm happy to create a new thread if needed, but I can't tell if there is >> any interest in this stopgap/back-branch fix. Perhaps we should just jump >> straight to the long-term fix that Thomas is looking into. > > Unfortunately the latch-friendly subprocess module proposal I was > talking about would be for 17. I may post a thread fairly soon with > design ideas + list of problems and decision points as I see them, and > hopefully some sketch code, but it won't be a proposal for [/me checks > calendar] next week's commitfest and probably wouldn't be appropriate > in a final commitfest anyway, and I also have some other existing > stuff to clear first. So please do continue with the stopgap ideas. I've created a new thread for the stopgap fix [0]. [0] https://postgr.es/m/20230223231503.GA743455%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-02-19 20:06:24 +0530, Robert Haas wrote: > On Sun, Feb 19, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote: > > To me that seems even simpler? Nothing but the archiver is supposed to create > > .done files and nothing is supposed to remove .ready files without archiver > > having created the .done files. So the archiver process can scan > > archive_status until its done or until N archives have been collected, and > > then process them at once? Only the creation of the .done files would be > > serial, but I don't think that's commonly a problem (and could be optimized as > > well, by creating multiple files and then fsyncing them in a second pass, > > avoiding N filesystem journal flushes). > > > > Maybe I am misunderstanding what you see as the problem? > > Well right now the archiver process calls ArchiveFileCB when there's a > file ready for archiving, and that process is supposed to archive the > whole thing before returning. That pretty obviously seems to preclude > having more than one file being archived at the same time. What > callback structure do you have in mind to allow for that? TBH, I think the current archive and restore module APIs aren't useful. I think it was a mistake to add archive modules without having demonstrated that one can do something useful with them that the restore_command didn't already do. If anything, archive modules have made it harder to improve archiving performance via concurrency. My point was that it's easy to have multiple archive commands in process at the same time, because we already have a queuing system, and that archive_command is entire compatible with doing that, because running multiple subprocesses is pretty trivial. It wasn't that the archive API is suitable for that. > I mean, my idea was to basically just have one big callback: > ArchiverModuleMainLoopCB(). Which wouldn't return, or perhaps, would > only return when archiving was totally caught up and there was nothing > more to do right now. And then that callback could call functions like > AreThereAnyMoreFilesIShouldBeArchivingAndIfYesWhatIsTheNextOne(). So > it would call that function and it would find out about a file and > start an HTTP session or whatever and then call that function again > and start another HTTP session for the second file and so on until it > had as much concurrency as it wanted. And then when it hit the > concurrency limit, it would wait until at least one HTTP request > finished. At that point it would call > HeyEverybodyISuccessfullyArchivedAWalFile(), after which it could > again ask for the next file and start a request for that one and so on > and so forth. > I don't really understand what the other possible model is here, > honestly. Right now, control remains within the archive module for the > entire time that a file is being archived. If we generalize the model > to allow multiple files to be in the process of being archived at the > same time, the archive module is going to need to have control as long > as >= 1 of them are in progress, at least AFAICS. If you have some > other idea how it would work, please explain it to me... I don't think that a main loop approach is the only viable one. It might be the most likely to succeed one though. As an alternative, consider something like struct ArchiveFileState { int fd; enum WaitFor { READ, WRITE, CONNECT }; void *file_private; } typedef bool (*ArchiveFileStartCB)(ArchiveModuleState *state, ArchiveFileState *file_state, const char *file, const char *path); typedef bool (*ArchiveFileContinueCB)(ArchiveModuleState *state, ArchiveFileState *file_state); An archive module could open an HTTP connection, do IO until it's blocked, put the fd in file_state, return. The main loop could do big event loop around all of the file descriptors and whenever any of FDs signal IO is ready, call ArchiveFileContinueCB() for that file. I don't know if that's better than ArchiverModuleMainLoopCB(). I can see both advantages and disadvantages. Greetings, Andres Freund
On Sat, Feb 25, 2023 at 11:00:31AM -0800, Andres Freund wrote: > TBH, I think the current archive and restore module APIs aren't useful. I > think it was a mistake to add archive modules without having demonstrated that > one can do something useful with them that the restore_command didn't already > do. If anything, archive modules have made it harder to improve archiving > performance via concurrency. I must respectfully disagree that this work is useless. Besides the performance and security benefits of not shelling out for every WAL file, I've found it very useful to be able to use the standard module framework to develop archive modules. It's relatively easy to make use of GUCs, background workers, compression, etc. Of course, there is room for improvement in areas like concurrency support as you rightly point out, but I don't think that makes the current state worthless. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Greetings, * Nathan Bossart (nathandbossart@gmail.com) wrote: > On Sat, Feb 25, 2023 at 11:00:31AM -0800, Andres Freund wrote: > > TBH, I think the current archive and restore module APIs aren't useful. I > > think it was a mistake to add archive modules without having demonstrated that > > one can do something useful with them that the restore_command didn't already > > do. If anything, archive modules have made it harder to improve archiving > > performance via concurrency. > > I must respectfully disagree that this work is useless. Besides the > performance and security benefits of not shelling out for every WAL file, > I've found it very useful to be able to use the standard module framework > to develop archive modules. It's relatively easy to make use of GUCs, > background workers, compression, etc. Of course, there is room for > improvement in areas like concurrency support as you rightly point out, but > I don't think that makes the current state worthless. Would be great to see these archive modules, perhaps it would help show how this functionality is useful and what could be done in core to make things easier for the archive module. Thanks, Stephen
Attachment
On Tue, Feb 28, 2023 at 11:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Sat, Feb 25, 2023 at 11:00:31AM -0800, Andres Freund wrote: > > TBH, I think the current archive and restore module APIs aren't useful. I > > think it was a mistake to add archive modules without having demonstrated that > > one can do something useful with them that the restore_command didn't already > > do. If anything, archive modules have made it harder to improve archiving > > performance via concurrency. > > I must respectfully disagree that this work is useless. Besides the > performance and security benefits of not shelling out for every WAL file, > I've found it very useful to be able to use the standard module framework > to develop archive modules. It's relatively easy to make use of GUCs, > background workers, compression, etc. Of course, there is room for > improvement in areas like concurrency support as you rightly point out, but > I don't think that makes the current state worthless. I also disagree with Andres. The status quo ante was that we did not provide any way of doing archiving correctly even to a directory on the local machine. We could only recommend silly things like 'cp' that are incorrect in multiple ways. basic_archive isn't the most wonderful thing ever, and its deficiencies are more obvious to me now than they were when I committed it. But it's better than recommending a shell command that doesn't even work. -- Robert Haas EDB: http://www.enterprisedb.com