Re: Requiring recovery.signal or standby.signal when recovering with a backup_label - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
Date
Msg-id ZUBM6BNQnEh7lzIM@paquier.xyz
Whole thread Raw
In response to Re: Requiring recovery.signal or standby.signal when recovering with a backup_label  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
List pgsql-hackers
On Mon, Oct 30, 2023 at 01:55:13PM -0400, Robert Haas wrote:
> I would encourage some caution here.

Thanks for chiming here.

> In a vacuum, I'm in favor of this, and for the same reasons as you,
> namely, that the huge pile of Booleans that we use to control recovery
> is confusing, and it's difficult to make sure that all the code paths
> are adequately tested, and I think some of the things that actually
> work here are not documented.

Yep, same feeling here.

> But in practice, I think there is a possibility of something like this
> backfiring very hard. Notice that the first two people who commented
> on the thread saw the error and immediately removed backup_label even
> though that's 100% wrong. It shows how utterly willing users are to
> remove backup_label for any reason or no reason at all. If we convert
> cases where things would have worked into cases where people nuke
> backup_label and then it appears to work, we're going to be worse off
> in the long run, no matter how crazy the idea of removing backup_label
> may seem to us.

As far as I know, there's one paragraph in the docs that implies this
mode without giving an actual hint that this may be OK or not, so
shrug:
https://www.postgresql.org/docs/devel/continuous-archiving.html#BACKUP-TIPS
"As with base backups, the easiest way to produce a standalone hot
backup is to use the pg_basebackup tool. If you include the -X
parameter when calling it, all the write-ahead log required to use the
backup will be included in the backup automatically, and no special
action is required to restore the backup."

And a few lines down we imply to use restore_command, something that
we check is set only if recovery.signal is set.  See additionally
validateRecoveryParameters(), where the comments imply that
InArchiveRecovery would be set only when there's a restore command.

As you're telling me, and I've considered that as an option as well,
perhaps we should just consider the presence of a backup_label file
with no .signal files as a synonym of crash recovery?  In the recovery
path, currently the essence of the problem is when we do
InArchiveRecovery=true, but ArchiveRecoveryRequested=false, meaning
that it should do archive recovery but we don't want it, and that does
not really make sense.  The rest of the code sort of implies that this
is not a suported combination.  So basically, my suggestion here, is
to just replay WAL up to the end of what's in your local pg_wal/ and
hope for the best, without TLI jumps, except that we'd do nothing.
Doing a pg_basebackup -X stream followed by a restart would work fine
with that, because all the WAL is here.

A point of contention is if we'd better be stricter about satisfying
backupEndPoint in such a case, but the redo code only wants to do
something here when ArchiveRecoveryRequested is set (aka there's a
.signal file set), and we would not want a TLI jump at the end of
recovery, so I don't see an argument with caring about backupEndPoint
in this case.

At the end, I'm OK as long as ArchiveRecoveryRequested=false
InArchiveRecovery=true does not exist anymore, because it's much
easier to get what's going on with the redo path, IMHO.

(I have a patch at hand to show the idea, will post it with a reply to
Andres' message.)

> Also, Andres just recently mentioned to me that he uses this procedure
> of starting a server with a backup_label but no recovery.signal or
> standby.signal file regularly, and thinks other people do too. I was
> surprised, since I've never done that, except maybe when I was a noob
> and didn't have a clue. But Andres is far from a noob.

At this stage, that's basically at your own risk, as the code thinks
it's OK to force what's basically archive-recovery-without-being-it.
So it basically works, but it can also easily backfire, as well..
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: small erreport bug over partitioned table pgrowlocks module
Next
From: Michael Paquier
Date:
Subject: Re: Requiring recovery.signal or standby.signal when recovering with a backup_label