Thread: BUG #17577: pg_ctl promote is not preemptive in archive recovery
The following bug has been logged on the website: Bug reference: 17577 Logged by: Daniel Farina Email address: daniel@fdr.io PostgreSQL version: 14.4 Operating system: AlmaLinux 8.6 Description: Reproduction: 1) Back up a new, empty initdb server 2) Run pgbench in mixed mode for a while to generate WAL, possibly for a long time. 3) Set up a replica (Side note: In my case, though it may or may not be important, I also have a primary_conninfo defined and standby.signal. The primary_conninfo is not used, however, as the server never catches up enough to do that before I run pg_ctl promote). 4) Wait for consistency. It should take a short while given the backup is of an empty database. 5) try to run pg_ctl promote while the server is in archive restore 6) it will block until timeout, and not promote until restore_command exits abnormally Other notes: upon running pg_ctl promote, the "promote" file is written, but the "server has received promote request" message is not written to the logs. A workaround: Killing the restore_command, i.e. injecting a non-zero exit code, will cause postgres to print the "has received promote request" message and go through promotion. Probable cause: something is not checking for pg_ctl promote having been run as often as it should when WAL is being sourced from restore_command, but it does get checked when postgres does its expected actions when receiving a non-zero exit code, e.g. checking whether it should switch to streaming.
On Fri, Aug 5, 2022 at 12:10 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:
Bug reference: 17577
Logged by: Daniel Farina
Email address: daniel@fdr.io
PostgreSQL version: 14.4
Operating system: AlmaLinux 8.6
Description:
5) try to run pg_ctl promote while the server is in archive restore
6) it will block until timeout, and not promote until restore_command exits
abnormally
On what basis are you considering this a bug? Or, IOW, what do you expect to happen? It doesn't seem possible for the promotion to actually happen as the server knows additional WAL must exist that it hasn't yet restored since all attempts to restore WAL have succeeded.
David J.
On Fri, Aug 5, 2022 at 12:21 PM David G. Johnston <david.g.johnston@gmail.com> wrote: > > On Fri, Aug 5, 2022 at 12:10 PM PG Bug reporting form <noreply@postgresql.org> wrote: >> >> The following bug has been logged on the website: >> >> Bug reference: 17577 >> Logged by: Daniel Farina >> Email address: daniel@fdr.io >> PostgreSQL version: 14.4 >> Operating system: AlmaLinux 8.6 >> Description: >> >> >> 5) try to run pg_ctl promote while the server is in archive restore >> 6) it will block until timeout, and not promote until restore_command exits >> abnormally >> > > On what basis are you considering this a bug? Or, IOW, what do you expect to happen? It doesn't seem possible for thepromotion to actually happen as the server knows additional WAL must exist that it hasn't yet restored since all attemptsto restore WAL have succeeded. pg_ctl promote should have consistent behavior regardless of WAL transport. If I (or a computer program of mine) is issuing pg_ctl promote, I mean for it to happen now, that's how it happens with streaming, and in the case of streaming, the amount of WAL that can eventually come into existence is practically unbounded.
At Fri, 5 Aug 2022 13:01:19 -0700, Daniel Farina <daniel@fdr.io> wrote in > On Fri, Aug 5, 2022 at 12:21 PM David G. Johnston > <david.g.johnston@gmail.com> wrote: > > On what basis are you considering this a bug? Or, IOW, what do you expect to happen? It doesn't seem possible for thepromotion to actually happen as the server knows additional WAL must exist that it hasn't yet restored since all attemptsto restore WAL have succeeded. > > pg_ctl promote should have consistent behavior regardless of WAL > transport. If I (or a computer program of mine) is issuing pg_ctl > promote, I mean for it to happen now, that's how it happens with > streaming, and in the case of streaming, the amount of WAL that can > eventually come into existence is practically unbounded. pg_ctl just commands or prompts server to do that. The server responds to the commands at its convenience. It works the same way for start/stop/restart and maybe some other subcommands. If something's going wrong on the server, there's cases it cannot fulfill the order. For example, regarding to streaming, if walreceiver process is hanging for some reason, pg_ctl promote waits for the server to promote but eventually will time out while the server cannot promote. This is that kind of behavior by design, which is not a bug. Of course, we're open for someone coming up with a good improvement of those behviors. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sun, Aug 7, 2022 at 7:07 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Fri, 5 Aug 2022 13:01:19 -0700, Daniel Farina <daniel@fdr.io> wrote in > > On Fri, Aug 5, 2022 at 12:21 PM David G. Johnston > > <david.g.johnston@gmail.com> wrote: > > > On what basis are you considering this a bug? Or, IOW, what do you expect to happen? It doesn't seem possible forthe promotion to actually happen as the server knows additional WAL must exist that it hasn't yet restored since all attemptsto restore WAL have succeeded. > > > > pg_ctl promote should have consistent behavior regardless of WAL > > transport. If I (or a computer program of mine) is issuing pg_ctl > > promote, I mean for it to happen now, that's how it happens with > > streaming, and in the case of streaming, the amount of WAL that can > > eventually come into existence is practically unbounded. > > pg_ctl just commands or prompts server to do that. The server > responds to the commands at its convenience. It works the same way > for start/stop/restart and maybe some other subcommands. I mean, sure, there's also CHECK_FOR_INTERRUPTS(), so yes, the server does things at its convenience...but it's a rationale that borders on the tautological. Here are some questions: 1) How sure is the present company that this behavior was always the case, going back to 8.4 or so? 2) What is the recommended method if I am satisfied with the current recovery progress of a database and wish to promote? 3) Doesn't promote work promptly when the server is streaming? If it does, why should the behavior be so dramatically different when it is in archive recovery?