Thread: BUG #17577: pg_ctl promote is not preemptive in archive recovery

BUG #17577: pg_ctl promote is not preemptive in archive recovery

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17577
Logged by:          Daniel Farina
Email address:      daniel@fdr.io
PostgreSQL version: 14.4
Operating system:   AlmaLinux 8.6
Description:

Reproduction:

1) Back up a new, empty initdb server
2) Run pgbench in mixed mode for a while to generate WAL, possibly for
   a long time.
3) Set up a replica

   (Side note: In my case, though it may or may not be important, I
   also have a primary_conninfo defined and standby.signal. The
   primary_conninfo is not used, however, as the server never catches
   up enough to do that before I run pg_ctl promote).
4) Wait for consistency. It should take a short while given the backup
   is of an empty database.
5) try to run pg_ctl promote while the server is in archive restore
6) it will block until timeout, and not promote until restore_command exits
abnormally

Other notes:

upon running pg_ctl promote, the "promote" file is written, but the
"server has received promote request" message is not written to the
logs.

A workaround:

Killing the restore_command, i.e. injecting a non-zero exit code, will
cause postgres to print the "has received promote request" message and
go through promotion.

Probable cause: something is not checking for pg_ctl promote having
been run as often as it should when WAL is being sourced from
restore_command, but it does get checked when postgres does its
expected actions when receiving a non-zero exit code,
e.g. checking whether it should switch to streaming.


Re: BUG #17577: pg_ctl promote is not preemptive in archive recovery

From
"David G. Johnston"
Date:
On Fri, Aug 5, 2022 at 12:10 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      17577
Logged by:          Daniel Farina
Email address:      daniel@fdr.io
PostgreSQL version: 14.4
Operating system:   AlmaLinux 8.6
Description:       


5) try to run pg_ctl promote while the server is in archive restore
6) it will block until timeout, and not promote until restore_command exits
abnormally


On what basis are you considering this a bug?  Or, IOW, what do you expect to happen?  It doesn't seem possible for the promotion to actually happen as the server knows additional WAL must exist that it hasn't yet restored since all attempts to restore WAL have succeeded.

David J.

Re: BUG #17577: pg_ctl promote is not preemptive in archive recovery

From
Daniel Farina
Date:
On Fri, Aug 5, 2022 at 12:21 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Fri, Aug 5, 2022 at 12:10 PM PG Bug reporting form <noreply@postgresql.org> wrote:
>>
>> The following bug has been logged on the website:
>>
>> Bug reference:      17577
>> Logged by:          Daniel Farina
>> Email address:      daniel@fdr.io
>> PostgreSQL version: 14.4
>> Operating system:   AlmaLinux 8.6
>> Description:
>>
>>
>> 5) try to run pg_ctl promote while the server is in archive restore
>> 6) it will block until timeout, and not promote until restore_command exits
>> abnormally
>>
>
> On what basis are you considering this a bug?  Or, IOW, what do you expect to happen?  It doesn't seem possible for
thepromotion to actually happen as the server knows additional WAL must exist that it hasn't yet restored since all
attemptsto restore WAL have succeeded. 

pg_ctl promote should have consistent behavior regardless of WAL
transport. If I (or a computer program of mine) is issuing pg_ctl
promote, I mean for it to happen now, that's how it happens with
streaming, and in the case of streaming, the amount of WAL that can
eventually come into existence is practically unbounded.



Re: BUG #17577: pg_ctl promote is not preemptive in archive recovery

From
Kyotaro Horiguchi
Date:
At Fri, 5 Aug 2022 13:01:19 -0700, Daniel Farina <daniel@fdr.io> wrote in 
> On Fri, Aug 5, 2022 at 12:21 PM David G. Johnston
> <david.g.johnston@gmail.com> wrote:
> > On what basis are you considering this a bug?  Or, IOW, what do you expect to happen?  It doesn't seem possible for
thepromotion to actually happen as the server knows additional WAL must exist that it hasn't yet restored since all
attemptsto restore WAL have succeeded.
 
> 
> pg_ctl promote should have consistent behavior regardless of WAL
> transport. If I (or a computer program of mine) is issuing pg_ctl
> promote, I mean for it to happen now, that's how it happens with
> streaming, and in the case of streaming, the amount of WAL that can
> eventually come into existence is practically unbounded.

pg_ctl just commands or prompts server to do that.  The server
responds to the commands at its convenience.  It works the same way
for start/stop/restart and maybe some other subcommands.  If
something's going wrong on the server, there's cases it cannot fulfill
the order.  For example, regarding to streaming, if walreceiver
process is hanging for some reason, pg_ctl promote waits for the
server to promote but eventually will time out while the server cannot
promote.

This is that kind of behavior by design, which is not a bug.  Of
course, we're open for someone coming up with a good improvement of
those behviors.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #17577: pg_ctl promote is not preemptive in archive recovery

From
Daniel Farina
Date:
On Sun, Aug 7, 2022 at 7:07 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Fri, 5 Aug 2022 13:01:19 -0700, Daniel Farina <daniel@fdr.io> wrote in
> > On Fri, Aug 5, 2022 at 12:21 PM David G. Johnston
> > <david.g.johnston@gmail.com> wrote:
> > > On what basis are you considering this a bug?  Or, IOW, what do you expect to happen?  It doesn't seem possible
forthe promotion to actually happen as the server knows additional WAL must exist that it hasn't yet restored since all
attemptsto restore WAL have succeeded. 
> >
> > pg_ctl promote should have consistent behavior regardless of WAL
> > transport. If I (or a computer program of mine) is issuing pg_ctl
> > promote, I mean for it to happen now, that's how it happens with
> > streaming, and in the case of streaming, the amount of WAL that can
> > eventually come into existence is practically unbounded.
>
> pg_ctl just commands or prompts server to do that.  The server
> responds to the commands at its convenience.  It works the same way
> for start/stop/restart and maybe some other subcommands.

I mean, sure, there's also CHECK_FOR_INTERRUPTS(), so yes, the server
does things at its convenience...but it's a rationale that borders on
the tautological.

Here are some questions:

1) How sure is the present company that this behavior was always the
case, going back to 8.4 or so?
2) What is the recommended method if I am satisfied with the current
recovery progress of a database and wish to promote?
3) Doesn't promote work promptly when the server  is streaming? If it
does, why should the behavior be so dramatically different when it is
in archive recovery?