Thread: Weird failure with latches in curculio on v15

Weird failure with latches in curculio on v15

From
Michael Paquier
Date:
Hi all,

While browsing the buildfarm, I have noticed this failure on curcilio:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001%3A05%3A17

The test that has reported a failure is the check on the archive
module callback:
#   Failed test 'check shutdown callback of shell archive module'
#   at t/020_archive_status.pl line 248.
# Looks like you failed 1 test of 17.
[02:28:06] t/020_archive_status.pl ..............
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/17 subtests

Looking closer, this is a result of an assertion failure in the latch
code:
2023-02-01 02:28:05.615 CET [6961:8] LOG:  received fast shutdown request
2023-02-01 02:28:05.615 CET [6961:9] LOG:  aborting any active transactions
2023-02-01 02:28:05.616 CET [30681:9] LOG:  process 30681 releasing ProcSignal slot 33, but it contains 0
TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: "latch.c", Line: 451, PID: 30681)

The information available in standby2.log shows that 30681 is the
startup process.  I am not sure what all that means, yet.

Thoughts or comments welcome.
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-01 10:53:17 +0900, Michael Paquier wrote:
> While browsing the buildfarm, I have noticed this failure on curcilio:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001%3A05%3A17
> 
> The test that has reported a failure is the check on the archive
> module callback:
> #   Failed test 'check shutdown callback of shell archive module'
> #   at t/020_archive_status.pl line 248.
> # Looks like you failed 1 test of 17.
> [02:28:06] t/020_archive_status.pl .............. 
> Dubious, test returned 1 (wstat 256, 0x100)
> Failed 1/17 subtests 
> 
> Looking closer, this is a result of an assertion failure in the latch
> code:
> 2023-02-01 02:28:05.615 CET [6961:8] LOG:  received fast shutdown request
> 2023-02-01 02:28:05.615 CET [6961:9] LOG:  aborting any active transactions
> 2023-02-01 02:28:05.616 CET [30681:9] LOG:  process 30681 releasing ProcSignal slot 33, but it contains 0
> TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: "latch.c", Line: 451, PID: 30681)

Given the ProcSignal LOG message before it, I don't think this is about
latches.


> The information available in standby2.log shows that 30681 is the
> startup process.  I am not sure what all that means, yet.
>
> Thoughts or comments welcome.

Perhaps a wild write overwriting shared memory state?

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> My database off assertion failures has four like that, all 15 and master:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001:05:17
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-01-11%2011:16:40
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-11-22%2012:19:21
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-11-17%2021:47:02
> 
> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> handler which is allowed to call that while in_restore_command is
> true.

Ugh, no wonder we're getting crashes. This whole business seems bogus as
hell.


RestoreArchivedFile():
...
    /*
     * Check signals before restore command and reset afterwards.
     */
    PreRestoreCommand();

    /*
     * Copy xlog from archival storage to XLOGDIR
     */
    ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname);

    PostRestoreCommand();


/* SIGTERM: set flag to abort redo and exit */
static void
StartupProcShutdownHandler(SIGNAL_ARGS)
{
    int            save_errno = errno;

    if (in_restore_command)
        proc_exit(1);
    else
        shutdown_requested = true;
    WakeupRecovery();

    errno = save_errno;
}

Where PreRestoreCommand()/PostRestoreCommand() set in_restore_command.



There's *a lot* of stuff happening inside shell_restore() that's not
compatible with doing proc_exit() inside a signal handler. We're
allocating memory! Interact with stdout.

There's also the fact that system() isn't signal safe, but that's a much
less likely problematic issue.


This appears to have gotten worse over a sequence of commits. The
following commits each added something betwen PreRestoreCommand() and
PostRestoreCommand().


commit 1b06d7bac901e5fd20bba597188bae2882bf954b
Author: Fujii Masao <fujii@postgresql.org>
Date:   2021-11-22 10:28:21 +0900
 
    Report wait events for local shell commands like archive_command.

added pgstat_report_wait_start/end. Unlikely to cause big issues, but
not good.


commit 7fed801135bae14d63b11ee4a10f6083767046d8
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2022-08-29 13:55:38 -0400

    Clean up inconsistent use of fflush().

Made it a bit worse by adding an fflush(). That certainly seems like it
could cause hangs.


commit 9a740f81eb02e04179d78f3df2ce671276c27b07
Author: Michael Paquier <michael@paquier.xyz>
Date:   2023-01-16 16:31:43 +0900

    Refactor code in charge of running shell-based recovery commands

which completely broke the mechanism. We suddenly run the entirety of
shell_restore(), which does pallocs etc to build the string passed to
system, and raises errors, all within a section in which a signal
handler can invoke proc_exit().  That's just completely broken.


Sorry, but particularly in this area, you got to be a heck of a lot more
careful.

I don't see a choice but to revert the recent changes. They need a
fairly large rewrite.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
>> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
>> handler which is allowed to call that while in_restore_command is
>> true.

> Ugh, no wonder we're getting crashes. This whole business seems bogus as
> hell.

Indeed :-(

> I don't see a choice but to revert the recent changes. They need a
> fairly large rewrite.

9a740f81e clearly made things a lot worse, but it wasn't great
before.  Can we see a way forward to removing the problem entirely?

The fundamental issue is that we have no good way to break out
of system(), and I think the original idea was that
in_restore_command would be set *only* for the duration of the
system() call.  That's clearly been lost sight of completely,
but maybe as a stopgap we could try to get back to that.

            regards, tom lane



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Wed, Feb 01, 2023 at 10:12:26AM -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
>>> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
>>> handler which is allowed to call that while in_restore_command is
>>> true.
> 
>> Ugh, no wonder we're getting crashes. This whole business seems bogus as
>> hell.
> 
> Indeed :-(

Ugh.  My bad.

> The fundamental issue is that we have no good way to break out
> of system(), and I think the original idea was that
> in_restore_command would be set *only* for the duration of the
> system() call.  That's clearly been lost sight of completely,
> but maybe as a stopgap we could try to get back to that.

+1.  I'll produce some patches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> >> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> >> handler which is allowed to call that while in_restore_command is
> >> true.
>
> > Ugh, no wonder we're getting crashes. This whole business seems bogus as
> > hell.
>
> Indeed :-(
>
> > I don't see a choice but to revert the recent changes. They need a
> > fairly large rewrite.
>
> 9a740f81e clearly made things a lot worse, but it wasn't great
> before.  Can we see a way forward to removing the problem entirely?

Yea, I think we can - we should stop relying on system(). If we instead
run the command properly as a subprocess, we don't need to do bad things
in the signal handler anymore.


> The fundamental issue is that we have no good way to break out
> of system(), and I think the original idea was that
> in_restore_command would be set *only* for the duration of the
> system() call.  That's clearly been lost sight of completely,
> but maybe as a stopgap we could try to get back to that.

We could push the functions setting in_restore_command down into
ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
being right either - we'd now use the mechanism in places we previously
didn't (cleanup/end commands).

And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
doesn't look right:
- We now have two places open-coding what BuildRestoreCommand did

- I'm doubtful that the new shell_* functions are the base for a good
  API to abstract restoring files

- the error message for a failed restore command seems to have gotten
  worse:
  could not restore file \"%s\" from archive: %s"
  ->
  "%s \"%s\": %s", commandName, command

- shell_* imo is not a good namespace for something called from xlog.c,
  xlogarchive.c. I realize the intention is that shell_archive.c is
  going to be its own "restore module", but for now it imo looks odd

- The comment moved out of RestoreArchivedFile() doesn't seems less
  useful at its new location

- explanation of why we use GetOldestRestartPoint() is halfway lost


My name is listed as the first Reviewed-by, but I certainly haven't done
any meaningful review of these patches. I just replied to top-level
email proposing "recovery modules".

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Wed, Feb 1, 2023 at 11:58 AM Andres Freund <andres@anarazel.de> wrote:
> > 9a740f81e clearly made things a lot worse, but it wasn't great
> > before.  Can we see a way forward to removing the problem entirely?
>
> Yea, I think we can - we should stop relying on system(). If we instead
> run the command properly as a subprocess, we don't need to do bad things
> in the signal handler anymore.

I like the idea of not relying on system(). In most respects, doing
fork() + exec() ourselves seems superior. We can control where the
output goes, what we do while waiting, etc. But system() runs the
command through the shell, so that for example you don't have to
invent your own way of splitting a string into words to be passed to
exec[whatever](). I've never understood how you're supposed to get
that behavior other than by calling system().

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-01 12:08:24 -0500, Robert Haas wrote:
> On Wed, Feb 1, 2023 at 11:58 AM Andres Freund <andres@anarazel.de> wrote:
> > > 9a740f81e clearly made things a lot worse, but it wasn't great
> > > before.  Can we see a way forward to removing the problem entirely?
> >
> > Yea, I think we can - we should stop relying on system(). If we instead
> > run the command properly as a subprocess, we don't need to do bad things
> > in the signal handler anymore.
> 
> I like the idea of not relying on system(). In most respects, doing
> fork() + exec() ourselves seems superior. We can control where the
> output goes, what we do while waiting, etc. But system() runs the
> command through the shell, so that for example you don't have to
> invent your own way of splitting a string into words to be passed to
> exec[whatever](). I've never understood how you're supposed to get
> that behavior other than by calling system().

We could just exec the shell in the forked process, using -c to invoke
the command. That should give us pretty much the same efficiency as
system(), with a lot more control.

I think we already do that somewhere. <dig>. Ah, yes, spawn_process() in
pg_regress.c.  I suspect we couldn't use exec for restore_command etc,
as I think it's not uncommon to use && in the command.


Perhaps we should abstract the relevant pieces of spawn_process() that
into something more general? The OS specifics are sufficiently
complicated that I don't think it'd be good to have multiple copies.


It's too bad that we have the history of passing things to shell,
otherwise we could define a common argument handling of the GUC and just
execve ourselves, but that ship has sailed.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-02-01 12:08:24 -0500, Robert Haas wrote:
>> I like the idea of not relying on system(). In most respects, doing
>> fork() + exec() ourselves seems superior. We can control where the
>> output goes, what we do while waiting, etc. But system() runs the
>> command through the shell, so that for example you don't have to
>> invent your own way of splitting a string into words to be passed to
>> exec[whatever](). I've never understood how you're supposed to get
>> that behavior other than by calling system().

> We could just exec the shell in the forked process, using -c to invoke
> the command. That should give us pretty much the same efficiency as
> system(), with a lot more control.

The main thing that system() brings to the table is platform-specific
knowledge of where the shell is.  I'm not very sure that we want to
wire in "/bin/sh".

            regards, tom lane



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Wed, Feb 01, 2023 at 08:58:01AM -0800, Andres Freund wrote:
> On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
>> The fundamental issue is that we have no good way to break out
>> of system(), and I think the original idea was that
>> in_restore_command would be set *only* for the duration of the
>> system() call.  That's clearly been lost sight of completely,
>> but maybe as a stopgap we could try to get back to that.
> 
> We could push the functions setting in_restore_command down into
> ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
> being right either - we'd now use the mechanism in places we previously
> didn't (cleanup/end commands).

Right, we'd only want to set it for restore_command.  I think that's
doable.

> And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
> doesn't look right:
> - We now have two places open-coding what BuildRestoreCommand did

This was done because BuildRestoreCommand() had become a thin wrapper
around replace_percent_placeholders().  I can add it back if you don't
think this was the right decision.

> - I'm doubtful that the new shell_* functions are the base for a good
>   API to abstract restoring files

Why?

> - the error message for a failed restore command seems to have gotten
>   worse:
>   could not restore file \"%s\" from archive: %s"
>   ->
>   "%s \"%s\": %s", commandName, command

Okay, I'll work on improving this message.

> - shell_* imo is not a good namespace for something called from xlog.c,
>   xlogarchive.c. I realize the intention is that shell_archive.c is
>   going to be its own "restore module", but for now it imo looks odd

What do you propose instead?  FWIW this should go away with recovery
modules.  This is just an intermediate state to simplify those patches.

> - The comment moved out of RestoreArchivedFile() doesn't seems less
>   useful at its new location

Where do you think it should go?

> - explanation of why we use GetOldestRestartPoint() is halfway lost

Okay, I'll work on adding more context here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-01 12:27:19 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-02-01 12:08:24 -0500, Robert Haas wrote:
> >> I like the idea of not relying on system(). In most respects, doing
> >> fork() + exec() ourselves seems superior. We can control where the
> >> output goes, what we do while waiting, etc. But system() runs the
> >> command through the shell, so that for example you don't have to
> >> invent your own way of splitting a string into words to be passed to
> >> exec[whatever](). I've never understood how you're supposed to get
> >> that behavior other than by calling system().
> 
> > We could just exec the shell in the forked process, using -c to invoke
> > the command. That should give us pretty much the same efficiency as
> > system(), with a lot more control.
> 
> The main thing that system() brings to the table is platform-specific
> knowledge of where the shell is.  I'm not very sure that we want to
> wire in "/bin/sh".

We seem to be doing OK with using SHELLPROG in pg_regress, which just
seems to be using $SHELL from the build environment.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Wed, Feb 01, 2023 at 09:58:06AM -0800, Nathan Bossart wrote:
> On Wed, Feb 01, 2023 at 08:58:01AM -0800, Andres Freund wrote:
>> On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
>>> The fundamental issue is that we have no good way to break out
>>> of system(), and I think the original idea was that
>>> in_restore_command would be set *only* for the duration of the
>>> system() call.  That's clearly been lost sight of completely,
>>> but maybe as a stopgap we could try to get back to that.
>> 
>> We could push the functions setting in_restore_command down into
>> ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
>> being right either - we'd now use the mechanism in places we previously
>> didn't (cleanup/end commands).
> 
> Right, we'd only want to set it for restore_command.  I think that's
> doable.

Here is a first draft for the proposed stopgap fix.  If we want to proceed
with this, I can provide patches for the back branches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Weird failure with latches in curculio on v15

From
Michael Paquier
Date:
On Wed, Feb 01, 2023 at 10:18:27AM -0800, Andres Freund wrote:
> On 2023-02-01 12:27:19 -0500, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>> The main thing that system() brings to the table is platform-specific
>> knowledge of where the shell is.  I'm not very sure that we want to
>> wire in "/bin/sh".
>
> We seem to be doing OK with using SHELLPROG in pg_regress, which just
> seems to be using $SHELL from the build environment.

It looks like this had better centralize a bit more of the logic from
pg_regress.c if that were to happen, to avoid more fuzzy logic with
WIN32.  That becomes invasive for a back-patch.

By the way, there is something that's itching me a bit here.  9a740f8
has enlarged by a lot the window between PreRestoreCommand() and
PostRestoreCommand(), however curculio has reported a failure on
REL_15_STABLE, where we only manipulate my_wait_event_info while the
flag is on.  Or I am getting that right that there is no way out of it
unless we remove the dependency to system() even in the back-branches?
Could there be an extra missing piece here?
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15

From
Michael Paquier
Date:
On Wed, Feb 01, 2023 at 02:35:55PM -0800, Nathan Bossart wrote:
> Here is a first draft for the proposed stopgap fix.  If we want to proceed
> with this, I can provide patches for the back branches.

> +    /*
> +     * PreRestoreCommand() is used to tell the SIGTERM handler for the startup
> +     * process that it is okay to proc_exit() right away on SIGTERM.  This is
> +     * done for the duration of the system() call because there isn't a good
> +     * way to break out while it is executing.  Since we might call proc_exit()
> +     * in a signal handler here, it is extremely important that nothing but the
> +     * system() call happens between the calls to PreRestoreCommand() and
> +     * PostRestoreCommand().  Any additional code must go before or after this
> +     * section.
> +     */
> +    if (exitOnSigterm)
> +        PreRestoreCommand();
> +
>      rc = system(command);
> +
> +    if (exitOnSigterm)
> +        PostRestoreCommand();
> +
>      pgstat_report_wait_end();

Hmm.  Isn't that something that we should also document in startup.c
where both routines are defined?  If we begin to use
PreRestoreCommand() and PostRestoreCommand() in more code paths in the
future, that could be again an issue.  That looks enough to me to
reduce the window back to what it was before 9a740f8, as exitOnSigterm
is only used for restore_command.  There is a different approach
possible here: rely more on wait_event_info rather than failOnSignal
and exitOnSigterm to decide which code path should do what.

Andres Freund wrote:
> - the error message for a failed restore command seems to have gotten
>   worse:
>   could not restore file \"%s\" from archive: %s"
>   ->
>   "%s \"%s\": %s", commandName, command

IMO, we don't lose any context with this method: the command type and
the command string itself are the bits actually relevant.  Perhaps
something like that would be more intuitive?  One idea:
"could not execute command for %s: %s", commandName, command

> - shell_* imo is not a good namespace for something called from xlog.c,
>   xlogarchive.c. I realize the intention is that shell_archive.c is
>   going to be its own "restore module", but for now it imo looks odd

shell_restore.c does not sound that bad to me, FWIW.  The parallel
with the archive counterparts is here.  My recent history is not that
good when it comes to naming, based on the feedback I received,
though.

> And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
> doesn't look right:
> - We now have two places open-coding what BuildRestoreCommand did

Yeah, BuildRestoreCommand() was just a small wrapper on top of the new
percentrepl.c, making it rather irrelevant at this stage, IMO.  For
the two code paths where it was called.

> - The comment moved out of RestoreArchivedFile() doesn't seems less
>   useful at its new location

We are talking about that:
-   /*
-    * Remember, we rollforward UNTIL the restore fails so failure here is
-    * just part of the process... that makes it difficult to determine
-    * whether the restore failed because there isn't an archive to restore,
-    * or because the administrator has specified the restore program
-    * incorrectly.  We have to assume the former.
-    *
-    * However, if the failure was due to any sort of signal, it's best to
-    * punt and abort recovery.  (If we "return false" here, upper levels will
-    * assume that recovery is complete and start up the database!) It's
-    * essential to abort on child SIGINT and SIGQUIT, because per spec
-    * system() ignores SIGINT and SIGQUIT while waiting; if we see one of
-    * those it's a good bet we should have gotten it too.
-    *
-    * On SIGTERM, assume we have received a fast shutdown request, and exit
-    * cleanly. It's pure chance whether we receive the SIGTERM first, or the
-    * child process. If we receive it first, the signal handler will call
-    * proc_exit, otherwise we do it here. If we or the child process received
-    * SIGTERM for any other reason than a fast shutdown request, postmaster
-    * will perform an immediate shutdown when it sees us exiting
-    * unexpectedly.
-    *
-    * We treat hard shell errors such as "command not found" as fatal, too.
-    */

The failure processing is stuck within the way we build and handle the
command given down to system(), so keeping this in shell_restore.c (or
whatever name you think would be a better fit) makes sense to me.
Now, thinking a bit more of this, we could just push the description
down to ExecuteRecoveryCommand(), that actually does the work,
adaptinh the comment based on the refactored internals of the
routine.
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> Hmm.  Isn't that something that we should also document in startup.c
> where both routines are defined?  If we begin to use
> PreRestoreCommand() and PostRestoreCommand() in more code paths in the
> future, that could be again an issue.

I was vaguely wondering about removing both of those functions
in favor of an integrated function that does a system() call
with those things before and after it.

            regards, tom lane



Re: Weird failure with latches in curculio on v15

From
Michael Paquier
Date:
On Wed, Feb 01, 2023 at 09:34:44PM -0500, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Hmm.  Isn't that something that we should also document in startup.c
>> where both routines are defined?  If we begin to use
>> PreRestoreCommand() and PostRestoreCommand() in more code paths in the
>> future, that could be again an issue.
>
> I was vaguely wondering about removing both of those functions
> in favor of an integrated function that does a system() call
> with those things before and after it.

It seems to me that this is pretty much the same as storing
in_restore_command in shell_restore.c, and that for recovery modules
this comes down to the addition of an extra callback called in
startup.c to check if the flag is up or not.  Now the patch is doing
things the opposite way: like on HEAD, store the flag in startup.c but
switch it at will with the routines in startup.c.  I find the approach
of the patch a bit more intuitive, TBH, as that makes the interface
simpler for other recovery modules that may want to switch the flag
back-and-forth, and I suspect that there may be cases in recovery
modules where we'd still want to switch the flag, but not necessarily
link it to system().
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Thu, Feb 02, 2023 at 01:24:15PM +0900, Michael Paquier wrote:
> On Wed, Feb 01, 2023 at 09:34:44PM -0500, Tom Lane wrote:
>> I was vaguely wondering about removing both of those functions
>> in favor of an integrated function that does a system() call
>> with those things before and after it.
> 
> It seems to me that this is pretty much the same as storing
> in_restore_command in shell_restore.c, and that for recovery modules
> this comes down to the addition of an extra callback called in
> startup.c to check if the flag is up or not.  Now the patch is doing 
> things the opposite way: like on HEAD, store the flag in startup.c but
> switch it at will with the routines in startup.c.  I find the approach
> of the patch a bit more intuitive, TBH, as that makes the interface
> simpler for other recovery modules that may want to switch the flag
> back-and-forth, and I suspect that there may be cases in recovery
> modules where we'd still want to switch the flag, but not necessarily
> link it to system().

Hm.  I don't know if we want to encourage further use of
in_restore_command since it seems to be prone to misuse.  Here's a v2 that
demonstrateѕ Tom's idea (bikeshedding on names and comments is welcome).  I
personally like this approach a bit more.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Thu, Feb 2, 2023 at 3:10 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Hm.  I don't know if we want to encourage further use of
> in_restore_command since it seems to be prone to misuse.  Here's a v2 that
> demonstrateѕ Tom's idea (bikeshedding on names and comments is welcome).  I
> personally like this approach a bit more.

+       /*
+        * When exitOnSigterm is set and we are in the startup process, use the
+        * special wrapper for system() that enables exiting immediately upon
+        * receiving SIGTERM.  This ensures we can break out of system() if
+        * required.
+        */

This comment, for me, raises more questions than it answers. Why do we
only do this in the startup process?

Also, and this part is not the fault of this patch but a defect of the
pre-existing comments, under what circumstances do we not want to exit
when we get a SIGTERM? It's standard behavior for PostgreSQL backends
to exit when they receive SIGTERM, so the question isn't why we
sometimes exit immediately but why we ever don't. The existing code
calls ExecuteRecoveryCommand with exitOnSigterm true in some cases and
false in other cases, and AFAICS there are zero words of comments
explaining the reasoning.

+       if (exitOnSigterm && MyBackendType == B_STARTUP)
+               rc = RunInterruptibleShellCommand(command);
+       else
+               rc = system(command);

And this looks like pure magic. I'm all in favor of not relying on
system(), but using it under some opaque set of conditions and
otherwise doing something else is not the way. At the very least this
needs to be explained a whole lot better.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Thu, Feb 02, 2023 at 04:14:54PM -0500, Robert Haas wrote:
> +       /*
> +        * When exitOnSigterm is set and we are in the startup process, use the
> +        * special wrapper for system() that enables exiting immediately upon
> +        * receiving SIGTERM.  This ensures we can break out of system() if
> +        * required.
> +        */
> 
> This comment, for me, raises more questions than it answers. Why do we
> only do this in the startup process?

Currently, this functionality only exists in the startup process because it
is only used for restore_command.  More below...

> Also, and this part is not the fault of this patch but a defect of the
> pre-existing comments, under what circumstances do we not want to exit
> when we get a SIGTERM? It's standard behavior for PostgreSQL backends
> to exit when they receive SIGTERM, so the question isn't why we
> sometimes exit immediately but why we ever don't. The existing code
> calls ExecuteRecoveryCommand with exitOnSigterm true in some cases and
> false in other cases, and AFAICS there are zero words of comments
> explaining the reasoning.

I've been digging into the history here.  This e-mail seems to have the
most context [0].  IIUC this was intended to prevent "fast" shutdowns from
escalating to "immediate" shutdowns because the restore command died
unexpectedly.  This doesn't apply to archive_cleanup_command because we
don't FATAL if it dies unexpectedly.  It seems like this idea should apply
to recovery_end_command, too, but AFAICT it doesn't use the same approach.
My guess is that this hasn't come up because it's less likely that both 1)
recovery_end_command is used and 2) someone initiates shutdown while it is
running.

BTW the relevant commits are cdd46c7 (added SIGTERM handling for
restore_command), 9e403c2 (added recovery_end_command), and c21ac0b (added
what is today called archive_cleanup_command).

> +       if (exitOnSigterm && MyBackendType == B_STARTUP)
> +               rc = RunInterruptibleShellCommand(command);
> +       else
> +               rc = system(command);
> 
> And this looks like pure magic. I'm all in favor of not relying on
> system(), but using it under some opaque set of conditions and
> otherwise doing something else is not the way. At the very least this
> needs to be explained a whole lot better.

If we applied this exit-on-SIGTERM behavior to recovery_end_command, I
think we could combine failOnSignal and exitOnSigterm into one flag, and
then it might be a little easier to explain what is going on.  In any case,
I agree that this deserves a lengthy explanation, which I'll continue to
work on.

[0] https://postgr.es/m/499047FE.9090407%40enterprisedb.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Thu, Feb 02, 2023 at 02:01:13PM -0800, Nathan Bossart wrote:
> I've been digging into the history here.  This e-mail seems to have the
> most context [0].  IIUC this was intended to prevent "fast" shutdowns from
> escalating to "immediate" shutdowns because the restore command died
> unexpectedly.  This doesn't apply to archive_cleanup_command because we
> don't FATAL if it dies unexpectedly.  It seems like this idea should apply
> to recovery_end_command, too, but AFAICT it doesn't use the same approach.
> My guess is that this hasn't come up because it's less likely that both 1)
> recovery_end_command is used and 2) someone initiates shutdown while it is
> running.

Actually, this still doesn't really explain why we need to exit immediately
in the SIGTERM handler for restore_command.  We already have handling for
when the command indicates it exited due to SIGTERM, so it should be no
problem if the command receives it before the startup process.  And
HandleStartupProcInterrupts() should exit at an appropriate time after the
startup process receives SIGTERM.

My guess was that this is meant to allow breaking out of the system() call,
but I don't understand why that's important here.  Maybe we could just
remove this exit-in-SIGTERM-handler business...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Thu, Feb 02, 2023 at 02:39:19PM -0800, Nathan Bossart wrote:
> Maybe we could just
> remove this exit-in-SIGTERM-handler business...

I've spent some time testing this.  It seems to work pretty well, but only
if I keep the exit-on-SIGTERM logic in shell_restore().  Without that, I'm
seeing delayed shutdowns, which I assume means
HandleStartupProcInterrupts() isn't getting called (I'm still investigating
this).  Іn any case, the fact that shell_restore() exits if the command
fails due to SIGTERM seems like an implementation detail that we won't
necessarily want to rely on once recovery modules are available.  In short,
we seem to depend on the SIGTERM handling in RestoreArchivedFile() in order
to be responsive to shutdown requests.

One idea I have is to approximate the current behavior by simply checking
for the shutdown_requested flag before before and after executing
restore_command.  This seems to work as desired even if the exit-on-SIGTERM
logic is removed from shell_restore().  Unless there is some reason to
break out of system() (versus just waiting for the command to fail after it
receives SIGTERM), I think this approach should suffice.

I've attached a draft patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-02 10:23:21 +0900, Michael Paquier wrote:
> On Wed, Feb 01, 2023 at 10:18:27AM -0800, Andres Freund wrote:
> > On 2023-02-01 12:27:19 -0500, Tom Lane wrote:
> >> Andres Freund <andres@anarazel.de> writes:
> >> The main thing that system() brings to the table is platform-specific
> >> knowledge of where the shell is.  I'm not very sure that we want to
> >> wire in "/bin/sh".
> > 
> > We seem to be doing OK with using SHELLPROG in pg_regress, which just
> > seems to be using $SHELL from the build environment.
> 
> It looks like this had better centralize a bit more of the logic from
> pg_regress.c if that were to happen, to avoid more fuzzy logic with
> WIN32.  That becomes invasive for a back-patch.

I don't think we should consider backpatching such a change. There's
enough subtlety that I'd want to see it bake for some time.


> By the way, there is something that's itching me a bit here.  9a740f8
> has enlarged by a lot the window between PreRestoreCommand() and
> PostRestoreCommand(), however curculio has reported a failure on
> REL_15_STABLE, where we only manipulate my_wait_event_info while the
> flag is on.  Or I am getting that right that there is no way out of it
> unless we remove the dependency to system() even in the back-branches?
> Could there be an extra missing piece here?

Yea, that's indeed odd.


Ugh, I think I might understand what's happening:

The signal arrives just after the fork() (within system()). Because we
have all our processes configure themselves as process group leaders,
and we signal the entire process group (c.f. signal_child()), both the
child process and the parent will process the signal. So we'll end up
doing a proc_exit() in both. As both are trying to remove themselves
from the same PGPROC etc entry, that doesn't end well.

I don't see how we can solve that properly as long as we use system().

A workaround for the back branches could be to have a test in
StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and
not do the proc_exit() if they don't match. We probably should just do
an _exit() in that case.


I doubt the idea to signal the entire process group in in signal_child()
is good. I regularly see core dumps of archive commands because we sent
SIGQUIT during an immediate shutdown, and of course cp etc don't have a
SIGQUIT handler, and the default action is to core dump.

But a replacement for it is not a small amount of work. While a
subprocess is running, we can't just handle SIGQUIT with _exit() while
the subprocess is running, we need to first signal the child with
something appropriate (SIGKILL?).

OTOH, the current approach only works on systems with setsid(2) support,
so we probably shouldn't rely so hard on it anyway.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Ugh, I think I might understand what's happening:

> The signal arrives just after the fork() (within system()). Because we
> have all our processes configure themselves as process group leaders,
> and we signal the entire process group (c.f. signal_child()), both the
> child process and the parent will process the signal. So we'll end up
> doing a proc_exit() in both. As both are trying to remove themselves
> from the same PGPROC etc entry, that doesn't end well.

Ugh ...

> I don't see how we can solve that properly as long as we use system().

... but I don't see how that's system()'s fault?  Doing the fork()
ourselves wouldn't change anything about that.

> A workaround for the back branches could be to have a test in
> StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and
> not do the proc_exit() if they don't match. We probably should just do
> an _exit() in that case.

Might work.

> OTOH, the current approach only works on systems with setsid(2) support,
> so we probably shouldn't rely so hard on it anyway.

setsid(2) is required since SUSv2, so I'm not sure which systems
are of concern here ... other than Redmond's of course.

            regards, tom lane



Re: Weird failure with latches in curculio on v15

From
Thomas Munro
Date:
On Fri, Feb 3, 2023 at 8:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Ugh, I think I might understand what's happening:
>
> > The signal arrives just after the fork() (within system()). Because we
> > have all our processes configure themselves as process group leaders,
> > and we signal the entire process group (c.f. signal_child()), both the
> > child process and the parent will process the signal. So we'll end up
> > doing a proc_exit() in both. As both are trying to remove themselves
> > from the same PGPROC etc entry, that doesn't end well.
>
> Ugh ...

Yuck, but yeah that makes sense.

> > I don't see how we can solve that properly as long as we use system().
>
> ... but I don't see how that's system()'s fault?  Doing the fork()
> ourselves wouldn't change anything about that.

What if we block signals, fork, then in the child, install the default
SIGTERM handler, then unblock, and then exec the shell?  If SIGTERM is
delivered either before or after exec (but before whatever is loaded
installs a new handler) then the child is terminated, but without
running the handler.  Isn't that what we want here?



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-02 14:39:19 -0800, Nathan Bossart wrote:
> Actually, this still doesn't really explain why we need to exit immediately
> in the SIGTERM handler for restore_command.  We already have handling for
> when the command indicates it exited due to SIGTERM, so it should be no
> problem if the command receives it before the startup process.  And
> HandleStartupProcInterrupts() should exit at an appropriate time after the
> startup process receives SIGTERM.

> My guess was that this is meant to allow breaking out of the system() call,
> but I don't understand why that's important here.  Maybe we could just
> remove this exit-in-SIGTERM-handler business...

I don't think you can, at least not easily. For one, we have no
guarantee that the child process got a signal at all - we don't have a
hard dependency on setsid(). And even if we have setsid(), there's no
guarantee that the executed process reacts to SIGTERM and that the child
didn't create its own process group (and thus isn't reached by the
signal to the process group, sent in signal_child()).

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Thomas Munro
Date:
On Fri, Feb 3, 2023 at 8:35 PM Andres Freund <andres@anarazel.de> wrote:
> we don't have a
> hard dependency on setsid()

FTR There are no Unixes without setsid()...  HAVE_SETSID was only left
in the tree because we were discussing whether to replace it with
!defined(WIN32) or whether that somehow made things more confusing,
but then while trying to figure out what to do about that, I noticed
that Windows *does* have a near-equivalent thing, or IIRC several
things like that, and that kinda stopped me in my tracks.



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-03 02:24:03 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Ugh, I think I might understand what's happening:
> 
> > The signal arrives just after the fork() (within system()). Because we
> > have all our processes configure themselves as process group leaders,
> > and we signal the entire process group (c.f. signal_child()), both the
> > child process and the parent will process the signal. So we'll end up
> > doing a proc_exit() in both. As both are trying to remove themselves
> > from the same PGPROC etc entry, that doesn't end well.
> 
> Ugh ...
> 
> > I don't see how we can solve that properly as long as we use system().
> 
> ... but I don't see how that's system()'s fault?  Doing the fork()
> ourselves wouldn't change anything about that.

If we did the fork ourselves, we'd temporarily change the signal mask
before the fork() and reset it immediately in the parent, but not in the
child. We can't do that with system(), because we don't get control back
early enough - we'd just block signals for the entire duration of
system().

I wonder if this shows a problem with the change in 14 to make pgarch.c
be attached to shared memory. Before that it didn't have to worry about
problems like the above in the archiver, but now we do. It's less severe
than the startup process issue, because we don't have a comparable
signal handler in pgarch, but still.

I'm e.g. not sure that there aren't issues with
procsignal_sigusr1_handler() or such executing in a forked process.


> > A workaround for the back branches could be to have a test in
> > StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and
> > not do the proc_exit() if they don't match. We probably should just do
> > an _exit() in that case.
> 
> Might work.

I wonder if we should add code complaining loudly about such a mismatch
to proc_exit(), in addition to handling it more silently in
StartupProcShutdownHandler().   Also, an assertion in
[Auxiliary]ProcKill that proc->xid == MyProcPid == getpid() seems like a
good idea.


> > OTOH, the current approach only works on systems with setsid(2) support,
> > so we probably shouldn't rely so hard on it anyway.
> 
> setsid(2) is required since SUSv2, so I'm not sure which systems
> are of concern here ... other than Redmond's of course.

I was thinking of windows, yes.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Feb 3, 2023 at 8:35 PM Andres Freund <andres@anarazel.de> wrote:
>> we don't have a hard dependency on setsid()

> FTR There are no Unixes without setsid()...

Yeah.  What I just got done reading in SUSv2 (1997) is
"Derived from the POSIX.1-1988 standard".  We need not
concern ourselves with any systems not having it.

            regards, tom lane



Re: Weird failure with latches in curculio on v15

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-02-03 02:24:03 -0500, Tom Lane wrote:
>> setsid(2) is required since SUSv2, so I'm not sure which systems
>> are of concern here ... other than Redmond's of course.

> I was thinking of windows, yes.

But given the lack of fork(2), Windows requires a completely
different solution anyway, no?

            regards, tom lane



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-03 20:34:36 +1300, Thomas Munro wrote:
> What if we block signals, fork, then in the child, install the default
> SIGTERM handler, then unblock, and then exec the shell?

Yep.  I was momentarily wondering why we'd even need to unblock signals,
but while exec (et al) reset the signal handler, they don't reset the
mask...

We could, for good measure, do PGSharedMemoryDetach() etc. But I don't
think it's quite worth it if we're careful with signals.  However
ClosePostmasterPorts() might be a good idea? I think not doing it might
cause issues like keeping the listen sockets alive after we shut down
postmaster, preventing us from startup up again?

Looks like PR_SET_PDEATHSIG isn't reset across an execve(). But that
actually seems good?


> If SIGTERM is delivered either before or after exec (but before
> whatever is loaded installs a new handler) then the child is
> terminated, but without running the handler.  Isn't that what we want
> here?

Yep, I think so.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-03 02:50:38 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-02-03 02:24:03 -0500, Tom Lane wrote:
> >> setsid(2) is required since SUSv2, so I'm not sure which systems
> >> are of concern here ... other than Redmond's of course.
>
> > I was thinking of windows, yes.
>
> But given the lack of fork(2), Windows requires a completely
> different solution anyway, no?

Not sure it needs to be that different. I think what we basically want
is:

1) Something vaguely popen() shaped that starts a subprocess, while
   being careful about signal handlers, returning the pid of the child
   process. Not sure if we want to redirect stdout/stderr or
   not. Probably not?

2) A blocking wrapper around 1) that takes care to forward fatal signals
   to the subprocess, including in the SIGQUIT case and probably being
   interruptible with query cancels etc in the relevant process types.


Thinking about popen() suggests that we have a similar problem with COPY
FROM PROGRAM as we have in pgarch (i.e. not as bad as the startup
process issue, but still not great, due to
procsignal_sigusr1_handler()).


What's worse, the problem exists for untrusted PLs as well, and
obviously we can't ensure that signals are correctly masked there.

This seems to suggest that we ought to install a MyProcPid != getpid()
like defense in all our signal handlers...

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Thomas Munro
Date:
On Fri, Feb 3, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:
> Thinking about popen() suggests that we have a similar problem with COPY
> FROM PROGRAM as we have in pgarch (i.e. not as bad as the startup
> process issue, but still not great, due to
> procsignal_sigusr1_handler()).

A small mercy: while we promote some kinds of fatal-ish signals to
group level with kill(-PID, ...), we don't do that for SIGUSR1 for
latches or procsignals.



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On February 3, 2023 9:19:23 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote:
>On Fri, Feb 3, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:
>> Thinking about popen() suggests that we have a similar problem with COPY
>> FROM PROGRAM as we have in pgarch (i.e. not as bad as the startup
>> process issue, but still not great, due to
>> procsignal_sigusr1_handler()).
>
>A small mercy: while we promote some kinds of fatal-ish signals to
>group level with kill(-PID, ...), we don't do that for SIGUSR1 for
>latches or procsignals.

Not as bad, but we still do SetLatch() from a bunch of places that would be reached...

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Thu, Feb 02, 2023 at 11:44:22PM -0800, Andres Freund wrote:
> On 2023-02-03 02:24:03 -0500, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>> > A workaround for the back branches could be to have a test in
>> > StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and
>> > not do the proc_exit() if they don't match. We probably should just do
>> > an _exit() in that case.
>> 
>> Might work.
> 
> I wonder if we should add code complaining loudly about such a mismatch
> to proc_exit(), in addition to handling it more silently in
> StartupProcShutdownHandler().   Also, an assertion in
> [Auxiliary]ProcKill that proc->xid == MyProcPid == getpid() seems like a
> good idea.

From the discussion, it sounds like we don't want to depend on the child
process receiving/handling the signal, so we can't get rid of the
break-out-of-system() behavior (at least not in back-branches).  I've put
together some work-in-progress patches for the stopgap/back-branch fix.

0001 is just v1-0001 from upthread.  This moves Pre/PostRestoreCommand to
surround only the call to system().  I think this should get us closer to
pre-v15 behavior.

0002 adds the getpid() check mentioned above to
StartupProcShutdownHandler(), and it adds assertions to proc_exit() and
[Auxiliary]ProcKill().

0003 adds checks for shutdown requests before and after the call to
shell_restore().  IMO the Pre/PostRestoreCommand stuff is an implementation
detail for restore_command, so I think it behooves us to have some
additional shutdown checks that apply even for recovery modules.  This
patch could probably be moved to the recovery modules thread.

Is this somewhat close to what folks had in mind?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Weird failure with latches in curculio on v15

From
Michael Paquier
Date:
On Fri, Feb 03, 2023 at 10:54:17AM -0800, Nathan Bossart wrote:
> 0001 is just v1-0001 from upthread.  This moves Pre/PostRestoreCommand to
> surround only the call to system().  I think this should get us closer to
> pre-v15 behavior.

+   if (exitOnSigterm)
+       PreRestoreCommand();
+
    rc = system(command);
+
+   if (exitOnSigterm)
+       PostRestoreCommand();

I don't really want to let that hanging around on HEAD much longer, so
I'm OK to do that for HEAD, then figure out what needs to be done for
the older issue at hand.

+   /*
+    * PreRestoreCommand() is used to tell the SIGTERM handler for the startup
+    * process that it is okay to proc_exit() right away on SIGTERM.  This is
+    * done for the duration of the system() call because there isn't a good
+    * way to break out while it is executing.  Since we might call proc_exit()
+    * in a signal handler here, it is extremely important that nothing but the
+    * system() call happens between the calls to PreRestoreCommand() and
+    * PostRestoreCommand().  Any additional code must go before or after this
+    * section.
+    */

Still, it seems to me that the large comment block in shell_restore()
ought to be moved to ExecuteRecoveryCommand(), no?  The assumptions
under which one can use exitOnSigterm and failOnSignal could be
completed in the header of the function based on that.
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-03 10:54:17 -0800, Nathan Bossart wrote:
> @@ -146,7 +146,25 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
>       */
>      fflush(NULL);
>      pgstat_report_wait_start(wait_event_info);
> +
> +    /*
> +     * PreRestoreCommand() is used to tell the SIGTERM handler for the startup
> +     * process that it is okay to proc_exit() right away on SIGTERM.  This is
> +     * done for the duration of the system() call because there isn't a good
> +     * way to break out while it is executing.  Since we might call proc_exit()
> +     * in a signal handler here, it is extremely important that nothing but the
> +     * system() call happens between the calls to PreRestoreCommand() and
> +     * PostRestoreCommand().  Any additional code must go before or after this
> +     * section.
> +     */
> +    if (exitOnSigterm)
> +        PreRestoreCommand();
> +
>      rc = system(command);
> +
> +    if (exitOnSigterm)
> +        PostRestoreCommand();
> +
>      pgstat_report_wait_end();
>
>      if (rc != 0)

It's somewhat weird that we now call the startup-process specific
PreRestoreCommand/PostRestoreCommand() in other processes than the
startup process. Gated on a variable that's not immediately obviously
tied to being in the startup process.

I think at least we ought to add comments to PreRestoreCommand /
PostRestoreCommand that they need to be robust against being called
outside of the startup process, and similarly a comment in
ExecuteRecoveryCommand(), explaining that all this stuff just works in
the startup process.


> diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
> index bcd23542f1..503eb1a5a6 100644
> --- a/src/backend/postmaster/startup.c
> +++ b/src/backend/postmaster/startup.c
> @@ -19,6 +19,8 @@
>   */
>  #include "postgres.h"
>  
> +#include <unistd.h>
> +
>  #include "access/xlog.h"
>  #include "access/xlogrecovery.h"
>  #include "access/xlogutils.h"
> @@ -121,7 +123,17 @@ StartupProcShutdownHandler(SIGNAL_ARGS)
>      int            save_errno = errno;
>  
>      if (in_restore_command)
> -        proc_exit(1);
> +    {
> +        /*
> +         * If we are in a child process (e.g., forked by system() in
> +         * shell_restore()), we don't want to call any exit callbacks.  The
> +         * parent will take care of that.
> +         */
> +        if (MyProcPid == (int) getpid())
> +            proc_exit(1);
> +        else
> +            _exit(1);

I think it might be worth adding something like
  const char msg[] = "StartupProcShutdownHandler() called in child process";
  write(STDERR_FILENO, msg, sizeof(msg));
to this path. Otherwise it might end up being a very hard to debug path.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-02 11:06:19 +0900, Michael Paquier wrote:
> > - the error message for a failed restore command seems to have gotten
> >   worse:
> >   could not restore file \"%s\" from archive: %s"
> >   ->
> >   "%s \"%s\": %s", commandName, command
> 
> IMO, we don't lose any context with this method: the command type and
> the command string itself are the bits actually relevant.  Perhaps
> something like that would be more intuitive?  One idea:
> "could not execute command for %s: %s", commandName, command

We do - you now can't identify the filename that failed without parsing
the command, which obviously isn't easily possible generically, because,
well, it's user-configurable.  I like that the command is logged now,
but I think we need the filename be at a predictable position in addition.


> > - shell_* imo is not a good namespace for something called from xlog.c,
> >   xlogarchive.c. I realize the intention is that shell_archive.c is
> >   going to be its own "restore module", but for now it imo looks odd
> 
> shell_restore.c does not sound that bad to me, FWIW.  The parallel
> with the archive counterparts is here.  My recent history is not that
> good when it comes to naming, based on the feedback I received,
> though.

I don't mind shell_restore.c much - after all, the filename is
namespaced by the directory. However, I do mind function names like
shell_restore(), that could also be about restoring what SHELL is set
to, or whatever. And functions aren't namespaced in C.


> > And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
> > doesn't look right:
> > - We now have two places open-coding what BuildRestoreCommand did
> 
> Yeah, BuildRestoreCommand() was just a small wrapper on top of the new
> percentrepl.c, making it rather irrelevant at this stage, IMO.  For
> the two code paths where it was called.

I don't at all agree. Particularly because you didn't even leave a
pointer in each of the places that if you update one, you also need to
update the other.  I don't mind the amount of code it adds, I do mind
that it, without any recognizable reason, implements policy in multiple
places.


> > - The comment moved out of RestoreArchivedFile() doesn't seems less
> >   useful at its new location
> 
> We are talking about that:
> -   /*
> -    * Remember, we rollforward UNTIL the restore fails so failure here is
> -    * just part of the process... that makes it difficult to determine
> -    * whether the restore failed because there isn't an archive to restore,
> -    * or because the administrator has specified the restore program
> -    * incorrectly.  We have to assume the former.
> -    *
> -    * However, if the failure was due to any sort of signal, it's best to
> -    * punt and abort recovery.  (If we "return false" here, upper levels will
> -    * assume that recovery is complete and start up the database!) It's
> -    * essential to abort on child SIGINT and SIGQUIT, because per spec
> -    * system() ignores SIGINT and SIGQUIT while waiting; if we see one of
> -    * those it's a good bet we should have gotten it too.
> -    *
> -    * On SIGTERM, assume we have received a fast shutdown request, and exit
> -    * cleanly. It's pure chance whether we receive the SIGTERM first, or the
> -    * child process. If we receive it first, the signal handler will call
> -    * proc_exit, otherwise we do it here. If we or the child process received
> -    * SIGTERM for any other reason than a fast shutdown request, postmaster
> -    * will perform an immediate shutdown when it sees us exiting
> -    * unexpectedly.
> -    *
> -    * We treat hard shell errors such as "command not found" as fatal, too.
> -    */
> 
> The failure processing is stuck within the way we build and handle the
> command given down to system(), so keeping this in shell_restore.c (or
> whatever name you think would be a better fit) makes sense to me.
> Now, thinking a bit more of this, we could just push the description
> down to ExecuteRecoveryCommand(), that actually does the work,
> adaptinh the comment based on the refactored internals of the
> routine.

Nothing in the shell_restore() comments explains / asserts that it is
only ever to be called in the startup process.  And outside of the
startup process none of the above actually makes sense.

That's kind of my problem with these changes. They try to introduce new
abstraction layers, but don't provide real abstraction, because they're
very tightly bound to the way the functions were called before the
refactoring.  And none of these restrictions are actually documented.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Sat, Feb 04, 2023 at 03:30:29AM -0800, Andres Freund wrote:
> That's kind of my problem with these changes. They try to introduce new
> abstraction layers, but don't provide real abstraction, because they're
> very tightly bound to the way the functions were called before the
> refactoring.  And none of these restrictions are actually documented.

Okay.  Michael, why don't we revert the shell_restore stuff for now?  Once
the archive modules interface changes and the fix for this
SIGTERM-during-system() problem are in, I will work through this feedback
and give recovery modules another try.  I'm still hoping to have recovery
modules ready in time for the v16 feature freeze.

My intent was to improve this code by refactoring and reducing code
duplication, but I seem to have missed the mark.  I am sorry.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Michael Paquier
Date:
On Sat, Feb 04, 2023 at 10:03:54AM -0800, Nathan Bossart wrote:
> Okay.  Michael, why don't we revert the shell_restore stuff for now?  Once
> the archive modules interface changes and the fix for this
> SIGTERM-during-system() problem are in, I will work through this feedback
> and give recovery modules another try.  I'm still hoping to have recovery
> modules ready in time for the v16 feature freeze.

Yes, at this stage a revert of the refactoring with shell_restore.c is
the best path forward.

From the discussion, I got the following things on top of my mind, for
reference:
- Should we include archive_cleanup_command into the recovery modules
at all?  We've discussed offloading that from the checkpointer, and it
makes the failure handling trickier when it comes to unexpected GUC
configurations, for one.  The same may actually apply to
restore_end_command.  Though it is done in the startup process now,
there may be an argument to offload that somewhere else based on the
timing of the end-of-recovery checkpoint.  My opinion on this stuff is
that only including restore_command in the modules would make most
users I know of happy enough as it removes the overhead of the command
invocation from the startup process, if able to replay things fast
enough so as the restore command is the bottleneck.
restore_end_command would be simple enough, but if there is a wish to
redesign the startup process to offload it somewhere else, then the
recovery module makes backward-compatibility concerns harder to think
about in the long-term.
- Do we need to reconsider the assumptions of the startup
process where SIGTERM enforces an immediate shutdown while running
system() for the restore command?  For example, the difference of
behavior when a restore_command uses a system sleep() that does not
react on signals from what I recall?
- Fixing the original issue of this thread may finish by impacting
what you are trying to do in this area, so fixing the original issue
first sounds  like a pre-requirement to me at the end because it may
impact the final design of the modules and their callbacks.  (I have
not looked at all the arguments raised about what to do with ~15,
still it does not look like we have a clear picture here yet.)
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-04 10:03:54 -0800, Nathan Bossart wrote:
> On Sat, Feb 04, 2023 at 03:30:29AM -0800, Andres Freund wrote:
> > That's kind of my problem with these changes. They try to introduce new
> > abstraction layers, but don't provide real abstraction, because they're
> > very tightly bound to the way the functions were called before the
> > refactoring.  And none of these restrictions are actually documented.
> 
> Okay.  Michael, why don't we revert the shell_restore stuff for now?  Once
> the archive modules interface changes and the fix for this
> SIGTERM-during-system() problem are in, I will work through this feedback
> and give recovery modules another try.  I'm still hoping to have recovery
> modules ready in time for the v16 feature freeze.
> 
> My intent was to improve this code by refactoring and reducing code
> duplication, but I seem to have missed the mark.  I am sorry.

FWIW, I think the patches were going roughly the right direction, they
just needs a bit more work.

I don't think we should expose the proc_exit() hack, and its supporting
infrastructure, to the pluggable *_command logic. It's bad enough as-is,
but having to do this stuff within an extension module seems likely to
end badly. There's just way too much action-at-a-distance.

I think Thomas has been hacking on a interruptible system()
replacement. With that, a lot of this ugliness would be resolved.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Sun, Feb 05, 2023 at 09:49:57AM +0900, Michael Paquier wrote:
> - Should we include archive_cleanup_command into the recovery modules
> at all?  We've discussed offloading that from the checkpointer, and it
> makes the failure handling trickier when it comes to unexpected GUC
> configurations, for one.  The same may actually apply to
> restore_end_command.  Though it is done in the startup process now,
> there may be an argument to offload that somewhere else based on the
> timing of the end-of-recovery checkpoint.  My opinion on this stuff is
> that only including restore_command in the modules would make most
> users I know of happy enough as it removes the overhead of the command
> invocation from the startup process, if able to replay things fast
> enough so as the restore command is the bottleneck.
> restore_end_command would be simple enough, but if there is a wish to
> redesign the startup process to offload it somewhere else, then the
> recovery module makes backward-compatibility concerns harder to think
> about in the long-term.

I agree.  I think we ought to first focus on getting the recovery modules
interface and restore_command functionality in place before we take on more
difficult things like archive_cleanup_command.  But I still think the
archive_cleanup_command/recovery_end_command functionality should
eventually be added to recovery modules.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-05 14:19:38 -0800, Nathan Bossart wrote:
> On Sun, Feb 05, 2023 at 09:49:57AM +0900, Michael Paquier wrote:
> > - Should we include archive_cleanup_command into the recovery modules
> > at all?  We've discussed offloading that from the checkpointer, and it
> > makes the failure handling trickier when it comes to unexpected GUC
> > configurations, for one.  The same may actually apply to
> > restore_end_command.  Though it is done in the startup process now,
> > there may be an argument to offload that somewhere else based on the
> > timing of the end-of-recovery checkpoint.  My opinion on this stuff is
> > that only including restore_command in the modules would make most
> > users I know of happy enough as it removes the overhead of the command
> > invocation from the startup process, if able to replay things fast
> > enough so as the restore command is the bottleneck.
> > restore_end_command would be simple enough, but if there is a wish to
> > redesign the startup process to offload it somewhere else, then the
> > recovery module makes backward-compatibility concerns harder to think
> > about in the long-term.
> 
> I agree.  I think we ought to first focus on getting the recovery modules
> interface and restore_command functionality in place before we take on more
> difficult things like archive_cleanup_command.  But I still think the
> archive_cleanup_command/recovery_end_command functionality should
> eventually be added to recovery modules.

I tend not to agree. If you make the API that small, you're IME likely
to end up with something that looks somewhat incoherent once extended.


The more I think about it, the less I am convinced that
one-callback-per-segment, invoked just before needing the file, is the
right approach to address the performance issues of restore_commmand.

The main performance issue isn't the shell invocation overhead, it's
synchronously needing to restore the archive, before replay can
continue. It's also gonna be slow if a restore module copies the segment
from a remote system - the latency is the problem.

The only way the restore module approach can do better, is to
asynchronously restore ahead of the current segment. But for that the
API really isn't suited well. The signature of the relevant callback is:

> +typedef bool (*RecoveryRestoreCB) (const char *file, const char *path,
> +                                   const char *lastRestartPointFileName);


That's not very suited to restoring "ahead of time". You need to parse
file to figure out whether a segment or something else is restored, turn
"file" back into an LSN, figure out where to to store further segments,
somehow hand off to some background worker, etc.

That doesn't strike me as something we want to happen inside multiple
restore libraries.

I think at the very least you'd want to have a separate callback for
restoring segments than for restoring other files. But more likely a
separate callback for each type of file to be restored.

For the timeline history case an parameter indicating that we don't want
to restore the file, just to see if there's a conflict, would make
sense.

For the segment files, we'd likely need a parameter to indicate whether
the restore is random or not.


Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Michael Paquier
Date:
On Sun, Feb 05, 2023 at 09:49:57AM +0900, Michael Paquier wrote:
> Yes, at this stage a revert of the refactoring with shell_restore.c is
> the best path forward.

Done that now, as of 2f6e15a.
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Sun, Feb 05, 2023 at 03:01:57PM -0800, Andres Freund wrote:
> I think at the very least you'd want to have a separate callback for
> restoring segments than for restoring other files. But more likely a
> separate callback for each type of file to be restored.
> 
> For the timeline history case an parameter indicating that we don't want
> to restore the file, just to see if there's a conflict, would make
> sense.

That seems reasonable.

> For the segment files, we'd likely need a parameter to indicate whether
> the restore is random or not.

Wouldn't this approach still require each module to handle restoring ahead
of time?  I agree that the shell overhead isn't the main performance issue,
but it's unclear to me how much of this should be baked into PostgreSQL.  I
mean, we could introduce a GUC that tells us how far ahead to restore and
have a background worker (or multiple background workers) asynchronously
pull files into a staging directory via the callbacks.  Is that the sort of
scope you are envisioning?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-05 15:57:47 -0800, Nathan Bossart wrote:
> > For the segment files, we'd likely need a parameter to indicate whether
> > the restore is random or not.
> 
> Wouldn't this approach still require each module to handle restoring ahead
> of time?

Yes, to some degree at least. I was just describing a few pretty obvious
improvements.

The core code can make that a lot easier though. The problem of where to
store such files can be provided by core code (presumably a separate
directory). A GUC for aggressiveness can be provided. Etc.


> I agree that the shell overhead isn't the main performance issue,
> but it's unclear to me how much of this should be baked into
> PostgreSQL.

I don't know fully either. But just reimplementing all of it in
different modules doesn't seem like a sane approach either. A lot of it
is policy that we need to solve once, centrally.


> I mean, we could introduce a GUC that tells us how far ahead to
> restore and have a background worker (or multiple background workers)
> asynchronously pull files into a staging directory via the callbacks.
> Is that the sort of scope you are envisioning?

Closer, at least.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Sun, Feb 05, 2023 at 04:07:50PM -0800, Andres Freund wrote:
> On 2023-02-05 15:57:47 -0800, Nathan Bossart wrote:
>> I agree that the shell overhead isn't the main performance issue,
>> but it's unclear to me how much of this should be baked into
>> PostgreSQL.
> 
> I don't know fully either. But just reimplementing all of it in
> different modules doesn't seem like a sane approach either. A lot of it
> is policy that we need to solve once, centrally.
> 
>> I mean, we could introduce a GUC that tells us how far ahead to
>> restore and have a background worker (or multiple background workers)
>> asynchronously pull files into a staging directory via the callbacks.
>> Is that the sort of scope you are envisioning?
> 
> Closer, at least.

Got it.  I suspect we'll want to do something similar for archive modules
eventually, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Sun, Feb 5, 2023 at 7:46 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Got it.  I suspect we'll want to do something similar for archive modules
> eventually, too.

+1.

I felt like the archive modules work was a step forward when we did,
because basic_archive does some things that you're not likely to get
right if you do it on your own. And a similar approach to
restore_command might be also be valuable, at least in my opinion.
However, the gains that we can get out of the archive module facility
in its present form do seem to be somewhat limited, for exactly the
kinds of reasons being discussed here.

I kind of wonder whether we ought to try to flip the model around. At
present, the idea is that the archiver is doing its thing and it makes
callbacks into the archive module. But what if we got rid of the
archiver main loop altogether and put the main loop inside of the
archive module, and have it call back to some functions that we
provide? One function could be like char
*pgarch_next_file_to_be_archived_if_there_is_one_ready(void) and the
other could be like void
pgarch_some_file_that_you_gave_me_previously_is_now_fully_archived(char
*which_one). That way, we'd break the tight coupling where you have to
get a unit of work and perform it in full before you can get the next
unit of work. Some variant of this could work on the restore side,
too, I think, although we have less certainty about how much it makes
to prefetch for restore than we do about what needs to be archived.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Wed, Feb 08, 2023 at 10:22:24AM -0500, Robert Haas wrote:
> I felt like the archive modules work was a step forward when we did,
> because basic_archive does some things that you're not likely to get
> right if you do it on your own. And a similar approach to
> restore_command might be also be valuable, at least in my opinion.
> However, the gains that we can get out of the archive module facility
> in its present form do seem to be somewhat limited, for exactly the
> kinds of reasons being discussed here.

I'm glad to hear that there is interest in taking this stuff to the next
level.  I'm currently planning to first get the basic API in place for
recovery modules like we have for archive modules, but I'm hoping to
position it so that it leads naturally to asynchronous, parallel, and/or
batching approaches down the road (v17?).

> I kind of wonder whether we ought to try to flip the model around. At
> present, the idea is that the archiver is doing its thing and it makes
> callbacks into the archive module. But what if we got rid of the
> archiver main loop altogether and put the main loop inside of the
> archive module, and have it call back to some functions that we
> provide? One function could be like char
> *pgarch_next_file_to_be_archived_if_there_is_one_ready(void) and the
> other could be like void
> pgarch_some_file_that_you_gave_me_previously_is_now_fully_archived(char
> *which_one). That way, we'd break the tight coupling where you have to
> get a unit of work and perform it in full before you can get the next
> unit of work. Some variant of this could work on the restore side,
> too, I think, although we have less certainty about how much it makes
> to prefetch for restore than we do about what needs to be archived.

I think this could be a good approach if we decide not to bake too much
into PostgreSQL itself (e.g., such as creating multiple archive workers
that each call out to the module).  Archive module authors would
effectively need to write their own archiver processes.  That sounds super
flexible, but it also sounds like it might be harder to get right.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Wed, Feb 8, 2023 at 12:43 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I think this could be a good approach if we decide not to bake too much
> into PostgreSQL itself (e.g., such as creating multiple archive workers
> that each call out to the module).  Archive module authors would
> effectively need to write their own archiver processes.  That sounds super
> flexible, but it also sounds like it might be harder to get right.

Yep. That's a problem, and I'm certainly open to better ideas.

However, if we assume that the archive module is likely to be doing
something like juggling a bunch of file descriptors over which it is
speaking HTTP, what other model works, really? It might be juggling
those file descriptors indirectly, or it might be relying on an
intermediate library like curl or something from Amazon that talks to
S3 or whatever, but only it knows what resources it's juggling, or
what functions it needs to call to manage them. On the other hand, we
don't really need a lot from it. We need it to CHECK_FOR_INTERRUPTS()
and handle that without leaking resources or breaking the world in
some way, and we sort of need it to, you know, actually archive stuff,
but apart from that I guess it can do what it likes (unless I'm
missing some other important function of the archiver?).

It's probably a good idea if the archiver function returns when it's
fully caught up and there's no more work to do. Then we could handle
decisions about hibernation in the core code, rather than having every
archive module invent its own way of doing that. But when there's work
happening, as far as I can see, the archive module needs to have
control pretty nearly all the time, or it's not going to be able to do
anything clever.

Always happy to hear if you see it differently....

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Wed, Feb 08, 2023 at 04:24:15PM -0500, Robert Haas wrote:
> On Wed, Feb 8, 2023 at 12:43 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I think this could be a good approach if we decide not to bake too much
>> into PostgreSQL itself (e.g., such as creating multiple archive workers
>> that each call out to the module).  Archive module authors would
>> effectively need to write their own archiver processes.  That sounds super
>> flexible, but it also sounds like it might be harder to get right.
> 
> Yep. That's a problem, and I'm certainly open to better ideas.
> 
> However, if we assume that the archive module is likely to be doing
> something like juggling a bunch of file descriptors over which it is
> speaking HTTP, what other model works, really? It might be juggling
> those file descriptors indirectly, or it might be relying on an
> intermediate library like curl or something from Amazon that talks to
> S3 or whatever, but only it knows what resources it's juggling, or
> what functions it needs to call to manage them. On the other hand, we
> don't really need a lot from it. We need it to CHECK_FOR_INTERRUPTS()
> and handle that without leaking resources or breaking the world in
> some way, and we sort of need it to, you know, actually archive stuff,
> but apart from that I guess it can do what it likes (unless I'm
> missing some other important function of the archiver?).
> 
> It's probably a good idea if the archiver function returns when it's
> fully caught up and there's no more work to do. Then we could handle
> decisions about hibernation in the core code, rather than having every
> archive module invent its own way of doing that. But when there's work
> happening, as far as I can see, the archive module needs to have
> control pretty nearly all the time, or it's not going to be able to do
> anything clever.
> 
> Always happy to hear if you see it differently....

These are all good points.  Perhaps there could be a base archiver
implementation that shell_archive uses (and that other modules could use if
desired, which might be important for backward compatibility with the
existing callbacks).  But if you want to do something fancier than
archiving sequentially, you could write your own.

In any case, I'm not really wedded to any particular approach at the
moment, so I am likewise open to better ideas.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Michael Paquier
Date:
On Wed, Feb 08, 2023 at 02:25:54PM -0800, Nathan Bossart wrote:
> These are all good points.  Perhaps there could be a base archiver
> implementation that shell_archive uses (and that other modules could use if
> desired, which might be important for backward compatibility with the
> existing callbacks).  But if you want to do something fancier than
> archiving sequentially, you could write your own.

Which is basically the kind of things you can already achieve with a
background worker and a module of your own?

> In any case, I'm not really wedded to any particular approach at the
> moment, so I am likewise open to better ideas.

I am not sure how much we should try to move from core into the
modules when it comes to the current archiver process, with how much
control you'd like to give to users.  It also looks like to me that
this is the kind of problem where we would not have the correct
callback design until someone comes in and develops a solution that
would shape around it.  On top of that, this is the kind of things
achievable with just a bgworker, and perhaps simpler as the parallel
state can just be maintained in it, which is what the archiver process
is about at the end?  Or were there some restrictions in the bgworker
APIs that would not fit with what an archiver process should do?
Perhaps these restrictions, if any, are what we'd better try to lift
first?
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Thu, Feb 09, 2023 at 08:56:24AM +0900, Michael Paquier wrote:
> On Wed, Feb 08, 2023 at 02:25:54PM -0800, Nathan Bossart wrote:
>> These are all good points.  Perhaps there could be a base archiver
>> implementation that shell_archive uses (and that other modules could use if
>> desired, which might be important for backward compatibility with the
>> existing callbacks).  But if you want to do something fancier than
>> archiving sequentially, you could write your own.
> 
> Which is basically the kind of things you can already achieve with a
> background worker and a module of your own?

IMO one of the big pieces that's missing is a way to get the next N files
to archive.  Right now, you'd have to trawl through archive_status on your
own if you wanted to batch/parallelize.  I think one advantage of what
Robert is suggesting is that we could easily provide a supported way to get
the next set of files to archive, and we can asynchronously mark them
"done".  Otherwise, each module has to implement this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Wed, Feb 8, 2023 at 7:24 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Thu, Feb 09, 2023 at 08:56:24AM +0900, Michael Paquier wrote:
> > On Wed, Feb 08, 2023 at 02:25:54PM -0800, Nathan Bossart wrote:
> >> These are all good points.  Perhaps there could be a base archiver
> >> implementation that shell_archive uses (and that other modules could use if
> >> desired, which might be important for backward compatibility with the
> >> existing callbacks).  But if you want to do something fancier than
> >> archiving sequentially, you could write your own.
> >
> > Which is basically the kind of things you can already achieve with a
> > background worker and a module of your own?
>
> IMO one of the big pieces that's missing is a way to get the next N files
> to archive.  Right now, you'd have to trawl through archive_status on your
> own if you wanted to batch/parallelize.  I think one advantage of what
> Robert is suggesting is that we could easily provide a supported way to get
> the next set of files to archive, and we can asynchronously mark them
> "done".  Otherwise, each module has to implement this.

Right.

I think that we could certainly, as Michael suggests, have people
provide their own background worker rather than having the archiver
invoke the user-supplied code directly. As long as the functions that
you need in order to get the necessary information can be called from
some other process, that's fine. The only difficulty I see is that if
the archiving is happening from a separate background worker rather
than from the archiver, then what is the archiver doing? We could
somehow arrange to not run the archiver process at all, or I guess to
just sit there and have it do nothing. Or, we can decide not to have a
separate background worker and just have the archiver call the
user-supplied core directly. I kind of like that approach at the
moment; it seems more elegant to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think that we could certainly, as Michael suggests, have people
> provide their own background worker rather than having the archiver
> invoke the user-supplied code directly. As long as the functions that
> you need in order to get the necessary information can be called from
> some other process, that's fine. The only difficulty I see is that if
> the archiving is happening from a separate background worker rather
> than from the archiver, then what is the archiver doing? We could
> somehow arrange to not run the archiver process at all, or I guess to
> just sit there and have it do nothing. Or, we can decide not to have a
> separate background worker and just have the archiver call the
> user-supplied core directly. I kind of like that approach at the
> moment; it seems more elegant to me.

I'm fairly concerned about the idea of making it common for people
to write their own main loop for the archiver.  That means that, if
we have a bug fix that requires the archiver to do X, we will not
just be patching our own code but trying to get an indeterminate
set of third parties to add the fix to their code.

If we think we need primitives to let the archiver hooks get all
the pending files, or whatever, by all means add those.  But don't
cede fundamental control of the archiver.  The hooks need to be
decoration on a framework we provide, not the framework themselves.

            regards, tom lane



Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Thu, Feb 9, 2023 at 10:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm fairly concerned about the idea of making it common for people
> to write their own main loop for the archiver.  That means that, if
> we have a bug fix that requires the archiver to do X, we will not
> just be patching our own code but trying to get an indeterminate
> set of third parties to add the fix to their code.

I don't know what kind of bug we could really have in the main loop
that would be common to every implementation. They're probably all
going to check for interrupts, do some work, and then wait for I/O on
some things by calling select() or some equivalent. But the work, and
the wait for the I/O, would be different for every implementation. I
would anticipate that the amount of common code would be nearly zero.

Imagine two archive modules, one of which archives files via HTTP and
the other of which archives them via SSH. They need to do a lot of the
same things, but the code is going to be totally different. When the
HTTP archiver module needs to open a new connection, it's going to
call some libcurl function. When the SSH archiver module needs to do
the same thing, it's going to call some libssh function. It seems
quite likely that the HTTP implementation would want to juggle
multiple connections in parallel, but the SSH implementation might not
want to do that, or its logic for determining how many connections to
open might be completely different based on the behavior of that
protocol vs. the other protocol. Once either implementation has sent
as much data it can over the connections it has open, it needs to wait
for those sockets to become write-ready or, possibly, read-ready.
There again, each one will be calling into a different library to do
that. It could be that in this particular case, but would be waiting
for a set of file descriptors, and we could provide some framework for
waiting on a set of file descriptors provided by the module. But you
could also have some other archiver implementation that is, say,
waiting for a process to terminate rather than for a file descriptor
to become ready for I/O.

> If we think we need primitives to let the archiver hooks get all
> the pending files, or whatever, by all means add those.  But don't
> cede fundamental control of the archiver.  The hooks need to be
> decoration on a framework we provide, not the framework themselves.

I don't quite see how you can make asynchronous and parallel archiving
work if the archiver process only calls into the archive module at
times that it chooses. That would mean that the module has to return
control to the archiver when it's in the middle of archiving one or
more files -- and then I don't see how it can get control back at the
appropriate time. Do you have a thought about that?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Thu, Feb 09, 2023 at 11:12:21AM -0500, Robert Haas wrote:
> On Thu, Feb 9, 2023 at 10:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If we think we need primitives to let the archiver hooks get all
>> the pending files, or whatever, by all means add those.  But don't
>> cede fundamental control of the archiver.  The hooks need to be
>> decoration on a framework we provide, not the framework themselves.
> 
> I don't quite see how you can make asynchronous and parallel archiving
> work if the archiver process only calls into the archive module at
> times that it chooses. That would mean that the module has to return
> control to the archiver when it's in the middle of archiving one or
> more files -- and then I don't see how it can get control back at the
> appropriate time. Do you have a thought about that?

I've been thinking about this, actually.  I'm wondering if we could provide
a list of files to the archiving callback (configurable via a variable in
ArchiveModuleState), and then have the callback return a list of files that
are archived.  (Or maybe we just put the list of files that need archiving
in ArchiveModuleState.)  The returned list could include files that were
sent to the callback previously.  The archive module would be responsible
for creating background worker(s) (if desired), dispatching files
to-be-archived to its background worker(s), and gathering the list of
archived files to return.

This is admittedly half-formed, but I'm tempted to hack something together
quickly to see whether it might be viable.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-09 11:12:21 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2023 at 10:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'm fairly concerned about the idea of making it common for people
> > to write their own main loop for the archiver.  That means that, if
> > we have a bug fix that requires the archiver to do X, we will not
> > just be patching our own code but trying to get an indeterminate
> > set of third parties to add the fix to their code.

I'm somewhat concerned about that too, but perhaps from a different
angle. First, I think we don't do our users a service by defaulting the
in-core implementation to something that doesn't scale to even a moderately
busy server. Second, I doubt we'll get the API for any of this right, without
an acutual user that does something more complicated than restoring one-by-one
in a blocking manner.


> I don't know what kind of bug we could really have in the main loop
> that would be common to every implementation. They're probably all
> going to check for interrupts, do some work, and then wait for I/O on
> some things by calling select() or some equivalent. But the work, and
> the wait for the I/O, would be different for every implementation. I
> would anticipate that the amount of common code would be nearly zero.

I don't think it's that hard to imagine problems. To be reasonably fast, a
decent restore implementation will have to 'restore ahead'. Which also
provides ample things to go wrong. E.g.

- WAL source is switched, restore module needs to react to that, but doesn't,
  we end up lots of wasted work, or worse, filename conflicts
- recovery follows a timeline, restore module doesn't catch on quickly enough
- end of recovery happens, restore just continues on


> > If we think we need primitives to let the archiver hooks get all
> > the pending files, or whatever, by all means add those.  But don't
> > cede fundamental control of the archiver.  The hooks need to be
> > decoration on a framework we provide, not the framework themselves.
>
> I don't quite see how you can make asynchronous and parallel archiving
> work if the archiver process only calls into the archive module at
> times that it chooses. That would mean that the module has to return
> control to the archiver when it's in the middle of archiving one or
> more files -- and then I don't see how it can get control back at the
> appropriate time. Do you have a thought about that?

I don't think archiver is the hard part, that already has a dedicated
process, and it also has something of a queuing system already. The startup
process imo is the complicated one...

If we had a 'restorer' process, startup fed some sort of a queue with things
to restore in the near future, it might be more realistic to do something you
describe?

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
Here is a new version of the stopgap/back-branch fix for restore_command.
This is more or less a rebased version of v4 with an added stderr message
as Andres suggested upthread.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Thu, Feb 9, 2023 at 10:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I've been thinking about this, actually.  I'm wondering if we could provide
> a list of files to the archiving callback (configurable via a variable in
> ArchiveModuleState), and then have the callback return a list of files that
> are archived.  (Or maybe we just put the list of files that need archiving
> in ArchiveModuleState.)  The returned list could include files that were
> sent to the callback previously.  The archive module would be responsible
> for creating background worker(s) (if desired), dispatching files
> to-be-archived to its background worker(s), and gathering the list of
> archived files to return.

Hmm. So in this design, the archiver doesn't really do the archiving
any more, because the interface makes that impossible. It has to use a
separate background worker process for that, full stop.

I don't think that's a good design. It's fine if some people want to
implement it that way, but it shouldn't be forced by the interface.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Fri, Feb 10, 2023 at 12:59 AM Andres Freund <andres@anarazel.de> wrote:
> I'm somewhat concerned about that too, but perhaps from a different
> angle. First, I think we don't do our users a service by defaulting the
> in-core implementation to something that doesn't scale to even a moderately
> busy server.

+1.

> Second, I doubt we'll get the API for any of this right, without
> an acutual user that does something more complicated than restoring one-by-one
> in a blocking manner.

Fair.

> I don't think it's that hard to imagine problems. To be reasonably fast, a
> decent restore implementation will have to 'restore ahead'. Which also
> provides ample things to go wrong. E.g.
>
> - WAL source is switched, restore module needs to react to that, but doesn't,
>   we end up lots of wasted work, or worse, filename conflicts
> - recovery follows a timeline, restore module doesn't catch on quickly enough
> - end of recovery happens, restore just continues on

I don't see how you can prevent those things from happening. If the
restore process is working in some way that requires an event loop,
and I think that will be typical for any kind of remote archiving,
then either it has control most of the time, so the event loop can be
run inside the restore process, or, as Nathan proposes, we don't let
the archiver have control and it needs to run that restore process in
a separate background worker. The hazards that you mention here exist
either way. If the event loop is running inside the restore process,
it can decide not to call the functions that we provide in a timely
fashion and thus fail to react as it should. If the event loop runs
inside a separate background worker, then that process can fail to be
responsive in precisely the same way. Fundamentally, if the author of
a restore module writes code to have multiple I/Os in flight at the
same time and does not write code to cancel those I/Os if something
changes, then such cancellation will not occur. That remains true no
matter which process is performing the I/O.

> > I don't quite see how you can make asynchronous and parallel archiving
> > work if the archiver process only calls into the archive module at
> > times that it chooses. That would mean that the module has to return
> > control to the archiver when it's in the middle of archiving one or
> > more files -- and then I don't see how it can get control back at the
> > appropriate time. Do you have a thought about that?
>
> I don't think archiver is the hard part, that already has a dedicated
> process, and it also has something of a queuing system already. The startup
> process imo is the complicated one...
>
> If we had a 'restorer' process, startup fed some sort of a queue with things
> to restore in the near future, it might be more realistic to do something you
> describe?

Some kind of queueing system might be a useful part of the interface,
and a dedicated restorer process does sound like a good idea. But the
archiver doesn't have this solved, precisely because you have to
archive a single file, return control, and wait to be invoked again
for the next file. That does not scale.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-16 15:18:57 +0530, Robert Haas wrote:
> On Fri, Feb 10, 2023 at 12:59 AM Andres Freund <andres@anarazel.de> wrote:
> > I don't think it's that hard to imagine problems. To be reasonably fast, a
> > decent restore implementation will have to 'restore ahead'. Which also
> > provides ample things to go wrong. E.g.
> >
> > - WAL source is switched, restore module needs to react to that, but doesn't,
> >   we end up lots of wasted work, or worse, filename conflicts
> > - recovery follows a timeline, restore module doesn't catch on quickly enough
> > - end of recovery happens, restore just continues on
> 
> I don't see how you can prevent those things from happening. If the
> restore process is working in some way that requires an event loop,
> and I think that will be typical for any kind of remote archiving,
> then either it has control most of the time, so the event loop can be
> run inside the restore process, or, as Nathan proposes, we don't let
> the archiver have control and it needs to run that restore process in
> a separate background worker. The hazards that you mention here exist
> either way. If the event loop is running inside the restore process,
> it can decide not to call the functions that we provide in a timely
> fashion and thus fail to react as it should. If the event loop runs
> inside a separate background worker, then that process can fail to be
> responsive in precisely the same way. Fundamentally, if the author of
> a restore module writes code to have multiple I/Os in flight at the
> same time and does not write code to cancel those I/Os if something
> changes, then such cancellation will not occur. That remains true no
> matter which process is performing the I/O.

IDK. I think we can make that easier or harder. Right now the proposed API
doesn't provide anything to allow to address this.


> > > I don't quite see how you can make asynchronous and parallel archiving
> > > work if the archiver process only calls into the archive module at
> > > times that it chooses. That would mean that the module has to return
> > > control to the archiver when it's in the middle of archiving one or
> > > more files -- and then I don't see how it can get control back at the
> > > appropriate time. Do you have a thought about that?
> >
> > I don't think archiver is the hard part, that already has a dedicated
> > process, and it also has something of a queuing system already. The startup
> > process imo is the complicated one...
> >
> > If we had a 'restorer' process, startup fed some sort of a queue with things
> > to restore in the near future, it might be more realistic to do something you
> > describe?
> 
> Some kind of queueing system might be a useful part of the interface,
> and a dedicated restorer process does sound like a good idea. But the
> archiver doesn't have this solved, precisely because you have to
> archive a single file, return control, and wait to be invoked again
> for the next file. That does not scale.

But there's nothing inherent in that. We know for certain which files we're
going to archive. And we don't need to work one-by-one. The archiver could
just start multiple subprocesses at the same time. All the blocking it does
right now are artificially imposed by the use of system(). We could instead
just use something popen() like and have a configurable number of processes
running at the same time.

What I was trying to point out was that the work a "restorer" process has to
do is more speculative, because we don't know when we'll promote, whether
we'll follow a timeline increase, whether the to-be-restored WAL already
exists. That's solvable, but a bunch of the relevant work ought to be solved
in core core code, instead of just in archive modules.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Thu, Feb 16, 2023 at 03:08:14PM +0530, Robert Haas wrote:
> On Thu, Feb 9, 2023 at 10:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I've been thinking about this, actually.  I'm wondering if we could provide
>> a list of files to the archiving callback (configurable via a variable in
>> ArchiveModuleState), and then have the callback return a list of files that
>> are archived.  (Or maybe we just put the list of files that need archiving
>> in ArchiveModuleState.)  The returned list could include files that were
>> sent to the callback previously.  The archive module would be responsible
>> for creating background worker(s) (if desired), dispatching files
>> to-be-archived to its background worker(s), and gathering the list of
>> archived files to return.
> 
> Hmm. So in this design, the archiver doesn't really do the archiving
> any more, because the interface makes that impossible. It has to use a
> separate background worker process for that, full stop.
> 
> I don't think that's a good design. It's fine if some people want to
> implement it that way, but it shouldn't be forced by the interface.

I don't think it would force you to use a background worker, but if you
wanted to, the tools would be available.  At least, that is the intent.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Thu, Feb 16, 2023 at 10:28 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> > Hmm. So in this design, the archiver doesn't really do the archiving
> > any more, because the interface makes that impossible. It has to use a
> > separate background worker process for that, full stop.
> >
> > I don't think that's a good design. It's fine if some people want to
> > implement it that way, but it shouldn't be forced by the interface.
>
> I don't think it would force you to use a background worker, but if you
> wanted to, the tools would be available.  At least, that is the intent.

I'm 100% amenable to somebody demonstrating how that is super easy,
barely an inconvenience. But I think we would need to see some code
showing at least what the API is going to look like, and ideally a
sample implementation, in order for me to be convinced of that. What I
suspect is that if somebody tries to do that they are going to find
that the core API has to be quite opinionated about how the archive
module has to do things, which I think is not what we want. But if
that turns out to be false, cool!

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Thu, Feb 16, 2023 at 10:02 PM Andres Freund <andres@anarazel.de> wrote:
> But there's nothing inherent in that. We know for certain which files we're
> going to archive. And we don't need to work one-by-one. The archiver could
> just start multiple subprocesses at the same time.

But what if it doesn't want to start multiple processes, just
multiplex within a single process?

> What I was trying to point out was that the work a "restorer" process has to
> do is more speculative, because we don't know when we'll promote, whether
> we'll follow a timeline increase, whether the to-be-restored WAL already
> exists. That's solvable, but a bunch of the relevant work ought to be solved
> in core core code, instead of just in archive modules.

Yep, I can see that there are some things to figure out there, and I
agree that they should be figured out in the core code.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-18 15:51:06 +0530, Robert Haas wrote:
> On Thu, Feb 16, 2023 at 10:02 PM Andres Freund <andres@anarazel.de> wrote:
> > But there's nothing inherent in that. We know for certain which files we're
> > going to archive. And we don't need to work one-by-one. The archiver could
> > just start multiple subprocesses at the same time.
> 
> But what if it doesn't want to start multiple processes, just
> multiplex within a single process?

To me that seems even simpler? Nothing but the archiver is supposed to create
.done files and nothing is supposed to remove .ready files without archiver
having created the .done files.  So the archiver process can scan
archive_status until its done or until N archives have been collected, and
then process them at once?  Only the creation of the .done files would be
serial, but I don't think that's commonly a problem (and could be optimized as
well, by creating multiple files and then fsyncing them in a second pass,
avoiding N filesystem journal flushes).

Maybe I am misunderstanding what you see as the problem?

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Sun, Feb 19, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:
> To me that seems even simpler? Nothing but the archiver is supposed to create
> .done files and nothing is supposed to remove .ready files without archiver
> having created the .done files.  So the archiver process can scan
> archive_status until its done or until N archives have been collected, and
> then process them at once?  Only the creation of the .done files would be
> serial, but I don't think that's commonly a problem (and could be optimized as
> well, by creating multiple files and then fsyncing them in a second pass,
> avoiding N filesystem journal flushes).
>
> Maybe I am misunderstanding what you see as the problem?

Well right now the archiver process calls ArchiveFileCB when there's a
file ready for archiving, and that process is supposed to archive the
whole thing before returning. That pretty obviously seems to preclude
having more than one file being archived at the same time. What
callback structure do you have in mind to allow for that?

I mean, my idea was to basically just have one big callback:
ArchiverModuleMainLoopCB(). Which wouldn't return, or perhaps, would
only return when archiving was totally caught up and there was nothing
more to do right now. And then that callback could call functions like
AreThereAnyMoreFilesIShouldBeArchivingAndIfYesWhatIsTheNextOne(). So
it would call that function and it would find out about a file and
start an HTTP session or whatever and then call that function again
and start another HTTP session for the second file and so on until it
had as much concurrency as it wanted. And then when it hit the
concurrency limit, it would wait until at least one HTTP request
finished. At that point it would call
HeyEverybodyISuccessfullyArchivedAWalFile(), after which it could
again ask for the next file and start a request for that one and so on
and so forth.

I don't really understand what the other possible model is here,
honestly. Right now, control remains within the archive module for the
entire time that a file is being archived. If we generalize the model
to allow multiple files to be in the process of being archived at the
same time, the archive module is going to need to have control as long
as >= 1 of them are in progress, at least AFAICS. If you have some
other idea how it would work, please explain it to me...

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Weird failure with latches in curculio on v15[

From
Michael Paquier
Date:
On Sun, Feb 19, 2023 at 08:06:24PM +0530, Robert Haas wrote:
> I mean, my idea was to basically just have one big callback:
> ArchiverModuleMainLoopCB(). Which wouldn't return, or perhaps, would
> only return when archiving was totally caught up and there was nothing
> more to do right now. And then that callback could call functions like
> AreThereAnyMoreFilesIShouldBeArchivingAndIfYesWhatIsTheNextOne(). So
> it would call that function and it would find out about a file and
> start an HTTP session or whatever and then call that function again
> and start another HTTP session for the second file and so on until it
> had as much concurrency as it wanted. And then when it hit the
> concurrency limit, it would wait until at least one HTTP request
> finished. At that point it would call
> HeyEverybodyISuccessfullyArchivedAWalFile(), after which it could
> again ask for the next file and start a request for that one and so on
> and so forth.

This archiving implementation is not completely impossible with the
current API infrastructure, either?  If you consider the archiving as
a two-step process where segments are first copied into a cheap,
reliable area.  Then these could be pushed in block in a more remote
area like a S3 bucket?  Of course this depends on other things like
the cluster structure, but redundancy can be added with standby
archiving, as well.

I am not sure exactly how many requirements we want to push into a
callback, to be honest, and surely more requirements pushed to the
callback increases the odds of implementation mistakes, like a full
loop.  There already many ways to get it wrong with archiving, like
missing a flush of the archived segment before the callback returns to
ensure its durability..
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15[

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Sun, Feb 19, 2023 at 08:06:24PM +0530, Robert Haas wrote:
> > I mean, my idea was to basically just have one big callback:
> > ArchiverModuleMainLoopCB(). Which wouldn't return, or perhaps, would
> > only return when archiving was totally caught up and there was nothing
> > more to do right now. And then that callback could call functions like
> > AreThereAnyMoreFilesIShouldBeArchivingAndIfYesWhatIsTheNextOne(). So
> > it would call that function and it would find out about a file and
> > start an HTTP session or whatever and then call that function again
> > and start another HTTP session for the second file and so on until it
> > had as much concurrency as it wanted. And then when it hit the
> > concurrency limit, it would wait until at least one HTTP request
> > finished. At that point it would call
> > HeyEverybodyISuccessfullyArchivedAWalFile(), after which it could
> > again ask for the next file and start a request for that one and so on
> > and so forth.
>
> This archiving implementation is not completely impossible with the
> current API infrastructure, either?  If you consider the archiving as
> a two-step process where segments are first copied into a cheap,
> reliable area.  Then these could be pushed in block in a more remote
> area like a S3 bucket?  Of course this depends on other things like
> the cluster structure, but redundancy can be added with standby
> archiving, as well.

Surely it can't be too cheap as it needs to be reliable..  We have
looked at this before (copying to a queue area before copying with a
separate process off-system) and it simply isn't great and requires more
work than you really want to do if you can help it and for no real
benefit.

> I am not sure exactly how many requirements we want to push into a
> callback, to be honest, and surely more requirements pushed to the
> callback increases the odds of implementation mistakes, like a full
> loop.  There already many ways to get it wrong with archiving, like
> missing a flush of the archived segment before the callback returns to
> ensure its durability..

Without any actual user of any of this it's surprising to me how much
effort has been put into it.  Have I missed the part where someone has
said they're actually implementing an archive library that we can look
at and see how it works and how the archive library and the core system
could work better together..?

We (pgbackrest) are generally interested in the idea to reduce the
startup time, but that's not really a big issue for us currently and so
it hasn't really risen up to the level of being something we're working
on, not to mention that if it keeps changing each release then it's just
going to end up being more work for us for a feature that doesn't gain
us all that much.

Now, all that said, at least in initial discussions, we expect the
pgbackrest archive_library to look very similar to how we handle
archive_command and async archiving today- when called if there's
multiple WAL files to process then we fork an async process off and it
goes and spawns multiple processes and does its work to move the WAL
files to the off-system storage and when we are called via
archive_command we just check a status flag to see if that WAL has been
archived yet by the async process or not.  If not and there's no async
process running then we'll start a new one (starting a new async process
periodically actually makes things a lot easier to test for us too,
which is why we don't just have an async process running around forever-
the startup time typically isn't that big of a deal), if there is a
status flag then we return whatever it says, and if the async process is
running and no status flag yet then we wait.

Once we have that going then perhaps there could be some interesting
iteration between pgbackrest and the core code to improve things, but
all this discussion and churn feels more likely to put folks off of
trying to implement something using this approach than the opposite,
unless someone in this discussion is actually working on an archive
library, but that isn't the impression I've gotten, at least (though if
there is such a work in progress out there, I'd love to see it!).

Thanks,

Stephen

Attachment

Re: Weird failure with latches in curculio on v15

From
Michael Paquier
Date:
On Tue, Feb 14, 2023 at 09:47:55AM -0800, Nathan Bossart wrote:
> Here is a new version of the stopgap/back-branch fix for restore_command.
> This is more or less a rebased version of v4 with an added stderr message
> as Andres suggested upthread.

So, this thread has moved around many subjects, still we did not get
to the core of the issue which is what we should try to do to avoid
sporadic failures like what the top of the thread is mentioning.

Perhaps beginning a new thread with a patch and a summary would be
better at this stage?  Another thing I am wondering is if it could be
possible to test that rather reliably.  I have been playing with a few
scenarios like holding the system() call for a bit with hardcoded
sleep()s, without much success.  I'll try harder on that part..  It's
been mentioned as well that we could just move away from system() in
the long-term.
--
Michael

Attachment

Re: Weird failure with latches in curculio on v15

From
Thomas Munro
Date:
On Tue, Feb 21, 2023 at 1:03 PM Michael Paquier <michael@paquier.xyz> wrote:
> Perhaps beginning a new thread with a patch and a summary would be
> better at this stage?  Another thing I am wondering is if it could be
> possible to test that rather reliably.  I have been playing with a few
> scenarios like holding the system() call for a bit with hardcoded
> sleep()s, without much success.  I'll try harder on that part..  It's
> been mentioned as well that we could just move away from system() in
> the long-term.

I've been experimenting with ideas for master, which I'll start a new
thread about.  Actually I was already thinking about this before this
broken signal handler stuff came up, because it was already
unacceptable that all these places that are connected to shared memory
ignore interrupts for unbounded time while a shell script/whatever
runs.  At first I thought it would be relatively simple to replace
system() with something that has a latch wait loop (though there are
details to argue about, like whether you want to allow interrupts that
throw, and if so, how you clean up the subprocess, which have several
plausible solutions).  But once I started looking at the related
popen-based stuff where you want to communicate with the subprocess
(for example COPY FROM PROGRAM), I realised that it needs more
analysis and work: that stuff is currently entirely based on stdio
FILE (that is, fread() and fwrite()), but it's not really possible (at
least portably) to make that nonblocking, and in fact it's a pretty
terrible interface in terms of error reporting in general.  I've been
sketching/thinking about a new module called 'subprocess', with a
couple of ways to start processes, and interact with them via
WaitEventSet and direct pipe I/O; or if buffering is needed, it'd be
our own, not <stdio.h>'s.  But don't let me stop anyone else proposing
ideas.



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote:
> On Tue, Feb 14, 2023 at 09:47:55AM -0800, Nathan Bossart wrote:
>> Here is a new version of the stopgap/back-branch fix for restore_command.
>> This is more or less a rebased version of v4 with an added stderr message
>> as Andres suggested upthread.
> 
> So, this thread has moved around many subjects, still we did not get
> to the core of the issue which is what we should try to do to avoid
> sporadic failures like what the top of the thread is mentioning.
> 
> Perhaps beginning a new thread with a patch and a summary would be
> better at this stage?  Another thing I am wondering is if it could be
> possible to test that rather reliably.  I have been playing with a few
> scenarios like holding the system() call for a bit with hardcoded
> sleep()s, without much success.  I'll try harder on that part..  It's
> been mentioned as well that we could just move away from system() in
> the long-term.

I'm happy to create a new thread if needed, but I can't tell if there is
any interest in this stopgap/back-branch fix.  Perhaps we should just jump
straight to the long-term fix that Thomas is looking into.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Thomas Munro
Date:
On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote:
> > Perhaps beginning a new thread with a patch and a summary would be
> > better at this stage?  Another thing I am wondering is if it could be
> > possible to test that rather reliably.  I have been playing with a few
> > scenarios like holding the system() call for a bit with hardcoded
> > sleep()s, without much success.  I'll try harder on that part..  It's
> > been mentioned as well that we could just move away from system() in
> > the long-term.
>
> I'm happy to create a new thread if needed, but I can't tell if there is
> any interest in this stopgap/back-branch fix.  Perhaps we should just jump
> straight to the long-term fix that Thomas is looking into.

Unfortunately the latch-friendly subprocess module proposal I was
talking about would be for 17.  I may post a thread fairly soon with
design ideas + list of problems and decision points as I see them, and
hopefully some sketch code, but it won't be a proposal for [/me checks
calendar] next week's commitfest and probably wouldn't be appropriate
in a final commitfest anyway, and I also have some other existing
stuff to clear first.  So please do continue with the stopgap ideas.

BTW Here's an idea (untested) about how to reproduce the problem.  You
could copy the source from a system() implementation, call it
doomed_system(), and insert kill(-getppid(), SIGQUIT) in between
sigprocmask(SIG_SETMASK, &omask, NULL) and exec*().  Parent and self
will handle the signal and both reach the proc_exit().

The systems that failed are running code like this:

https://github.com/openbsd/src/blob/master/lib/libc/stdlib/system.c
https://github.com/DragonFlyBSD/DragonFlyBSD/blob/master/lib/libc/stdlib/system.c

I'm pretty sure these other implementations could fail in just the
same way (they restore the handler before unblocking, so can run it
just before exec() replaces the image):

https://github.com/freebsd/freebsd-src/blob/main/lib/libc/stdlib/system.c
https://github.com/lattera/glibc/blob/master/sysdeps/posix/system.c

The glibc one is a bit busier and, huh, has a lock (I guess maybe
deadlockable if proc_exit() also calls system(), but hopefully it
doesn't), and uses fork() instead of vfork() but I don't think that's
a material difference here (with fork(), parent and child run
concurrently, while with vfork() the parent is suspended until the
child exists or execs, and then processes its pending signals).



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Wed, Feb 22, 2023 at 09:48:10PM +1300, Thomas Munro wrote:
> On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote:
>> > Perhaps beginning a new thread with a patch and a summary would be
>> > better at this stage?  Another thing I am wondering is if it could be
>> > possible to test that rather reliably.  I have been playing with a few
>> > scenarios like holding the system() call for a bit with hardcoded
>> > sleep()s, without much success.  I'll try harder on that part..  It's
>> > been mentioned as well that we could just move away from system() in
>> > the long-term.
>>
>> I'm happy to create a new thread if needed, but I can't tell if there is
>> any interest in this stopgap/back-branch fix.  Perhaps we should just jump
>> straight to the long-term fix that Thomas is looking into.
> 
> Unfortunately the latch-friendly subprocess module proposal I was
> talking about would be for 17.  I may post a thread fairly soon with
> design ideas + list of problems and decision points as I see them, and
> hopefully some sketch code, but it won't be a proposal for [/me checks
> calendar] next week's commitfest and probably wouldn't be appropriate
> in a final commitfest anyway, and I also have some other existing
> stuff to clear first.  So please do continue with the stopgap ideas.

I've created a new thread for the stopgap fix [0].

[0] https://postgr.es/m/20230223231503.GA743455%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Andres Freund
Date:
Hi,

On 2023-02-19 20:06:24 +0530, Robert Haas wrote:
> On Sun, Feb 19, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:
> > To me that seems even simpler? Nothing but the archiver is supposed to create
> > .done files and nothing is supposed to remove .ready files without archiver
> > having created the .done files.  So the archiver process can scan
> > archive_status until its done or until N archives have been collected, and
> > then process them at once?  Only the creation of the .done files would be
> > serial, but I don't think that's commonly a problem (and could be optimized as
> > well, by creating multiple files and then fsyncing them in a second pass,
> > avoiding N filesystem journal flushes).
> >
> > Maybe I am misunderstanding what you see as the problem?
> 
> Well right now the archiver process calls ArchiveFileCB when there's a
> file ready for archiving, and that process is supposed to archive the
> whole thing before returning. That pretty obviously seems to preclude
> having more than one file being archived at the same time. What
> callback structure do you have in mind to allow for that?

TBH, I think the current archive and restore module APIs aren't useful. I
think it was a mistake to add archive modules without having demonstrated that
one can do something useful with them that the restore_command didn't already
do. If anything, archive modules have made it harder to improve archiving
performance via concurrency.

My point was that it's easy to have multiple archive commands in process at
the same time, because we already have a queuing system, and that
archive_command is entire compatible with doing that, because running multiple
subprocesses is pretty trivial. It wasn't that the archive API is suitable for
that.


> I mean, my idea was to basically just have one big callback:
> ArchiverModuleMainLoopCB(). Which wouldn't return, or perhaps, would
> only return when archiving was totally caught up and there was nothing
> more to do right now. And then that callback could call functions like
> AreThereAnyMoreFilesIShouldBeArchivingAndIfYesWhatIsTheNextOne(). So
> it would call that function and it would find out about a file and
> start an HTTP session or whatever and then call that function again
> and start another HTTP session for the second file and so on until it
> had as much concurrency as it wanted. And then when it hit the
> concurrency limit, it would wait until at least one HTTP request
> finished. At that point it would call
> HeyEverybodyISuccessfullyArchivedAWalFile(), after which it could
> again ask for the next file and start a request for that one and so on
> and so forth.

> I don't really understand what the other possible model is here,
> honestly. Right now, control remains within the archive module for the
> entire time that a file is being archived. If we generalize the model
> to allow multiple files to be in the process of being archived at the
> same time, the archive module is going to need to have control as long
> as >= 1 of them are in progress, at least AFAICS. If you have some
> other idea how it would work, please explain it to me...

I don't think that a main loop approach is the only viable one. It might be
the most likely to succeed one though. As an alternative, consider something
like

struct ArchiveFileState {
   int fd;
   enum WaitFor { READ, WRITE, CONNECT };
   void *file_private;
}

typedef bool (*ArchiveFileStartCB)(ArchiveModuleState *state,
   ArchiveFileState *file_state,
   const char *file, const char *path);

typedef bool (*ArchiveFileContinueCB)(ArchiveModuleState *state,
   ArchiveFileState *file_state);

An archive module could open an HTTP connection, do IO until it's blocked, put
the fd in file_state, return. The main loop could do big event loop around all
of the file descriptors and whenever any of FDs signal IO is ready, call
ArchiveFileContinueCB() for that file.

I don't know if that's better than ArchiverModuleMainLoopCB(). I can see both
advantages and disadvantages.

Greetings,

Andres Freund



Re: Weird failure with latches in curculio on v15

From
Nathan Bossart
Date:
On Sat, Feb 25, 2023 at 11:00:31AM -0800, Andres Freund wrote:
> TBH, I think the current archive and restore module APIs aren't useful. I
> think it was a mistake to add archive modules without having demonstrated that
> one can do something useful with them that the restore_command didn't already
> do. If anything, archive modules have made it harder to improve archiving
> performance via concurrency.

I must respectfully disagree that this work is useless.  Besides the
performance and security benefits of not shelling out for every WAL file,
I've found it very useful to be able to use the standard module framework
to develop archive modules.  It's relatively easy to make use of GUCs,
background workers, compression, etc.  Of course, there is room for
improvement in areas like concurrency support as you rightly point out, but
I don't think that makes the current state worthless.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Weird failure with latches in curculio on v15

From
Stephen Frost
Date:
Greetings,

* Nathan Bossart (nathandbossart@gmail.com) wrote:
> On Sat, Feb 25, 2023 at 11:00:31AM -0800, Andres Freund wrote:
> > TBH, I think the current archive and restore module APIs aren't useful. I
> > think it was a mistake to add archive modules without having demonstrated that
> > one can do something useful with them that the restore_command didn't already
> > do. If anything, archive modules have made it harder to improve archiving
> > performance via concurrency.
>
> I must respectfully disagree that this work is useless.  Besides the
> performance and security benefits of not shelling out for every WAL file,
> I've found it very useful to be able to use the standard module framework
> to develop archive modules.  It's relatively easy to make use of GUCs,
> background workers, compression, etc.  Of course, there is room for
> improvement in areas like concurrency support as you rightly point out, but
> I don't think that makes the current state worthless.

Would be great to see these archive modules, perhaps it would help show
how this functionality is useful and what could be done in core to make
things easier for the archive module.

Thanks,

Stephen

Attachment

Re: Weird failure with latches in curculio on v15

From
Robert Haas
Date:
On Tue, Feb 28, 2023 at 11:29 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> On Sat, Feb 25, 2023 at 11:00:31AM -0800, Andres Freund wrote:
> > TBH, I think the current archive and restore module APIs aren't useful. I
> > think it was a mistake to add archive modules without having demonstrated that
> > one can do something useful with them that the restore_command didn't already
> > do. If anything, archive modules have made it harder to improve archiving
> > performance via concurrency.
>
> I must respectfully disagree that this work is useless.  Besides the
> performance and security benefits of not shelling out for every WAL file,
> I've found it very useful to be able to use the standard module framework
> to develop archive modules.  It's relatively easy to make use of GUCs,
> background workers, compression, etc.  Of course, there is room for
> improvement in areas like concurrency support as you rightly point out, but
> I don't think that makes the current state worthless.

I also disagree with Andres. The status quo ante was that we did not
provide any way of doing archiving correctly even to a directory on
the local machine. We could only recommend silly things like 'cp' that
are incorrect in multiple ways. basic_archive isn't the most wonderful
thing ever, and its deficiencies are more obvious to me now than they
were when I committed it. But it's better than recommending a shell
command that doesn't even work.

--
Robert Haas
EDB: http://www.enterprisedb.com