Requiring recovery.signal or standby.signal when recovering with a backup_label - Mailing list pgsql-hackers

From Michael Paquier
Subject Requiring recovery.signal or standby.signal when recovering with a backup_label
Date
Msg-id ZArVOMifjzE7f8W7@paquier.xyz
Whole thread Raw
Responses Re: Requiring recovery.signal or standby.signal when recovering with a backup_label  (Michael Paquier <michael@paquier.xyz>)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Hi all,

This is a follow-up of the point I have made a few weeks ago on this
thread of pgsql-bugs about $subject:
https://www.postgresql.org/message-id/Y/Q/17rpYS7YGbIt@paquier.xyz
https://www.postgresql.org/message-id/Y/v0c+3W89NBT/if@paquier.xyz

Here is a short summary of what I think is incorrect, and what I'd
like to do to improve things moving forward, this pointing to a
simple solution..

While looking at the so-said thread, I have dug into the recovery code
to see what looks like an incorrect assumption behind the two boolean
flags named ArchiveRecoveryRequested and InArchiveRecovery that we
have in xlogrecovery.c to control the behavior of archive recovery in
the startup process.  For information, as of HEAD, these two are
described as follows:
/*
 * When ArchiveRecoveryRequested is set, archive recovery was requested,
 * ie. signal files were present.  When InArchiveRecovery is set, we are
 * currently recovering using offline XLOG archives.  These variables are only
 * valid in the startup process.
 *
 * When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're
 * currently performing crash recovery using only XLOG files in pg_wal, but
 * will switch to using offline XLOG archives as soon as we reach the end of
 * WAL in pg_wal.
 */
bool       ArchiveRecoveryRequested = false;
bool       InArchiveRecovery = false;

When you read this text alone, its assumptions are simple.  When the
startup process finds a recovery.signal or a standby.signal, we switch
ArchiveRecoveryRequested to true.  If there is a standby.signal,
InArchiveRecovery would be immediately set to true before beginning
the redo loop.  If we begin redo with a recovery.signal,
ArchiveRecoveryRequested = true and InArchiveRecovery = false, crash
recovery happens first, consuming all the WAL in pg_wal/, then we'd
move on with archive recovery.

Now comes the problem of the other thread, which is what happens when
you use a backup_label *without* a recovery.signal or a
standby.signal.  In this case, as currently coded, it is possible to
enforce ArchiveRecoveryRequested = false and later InArchiveRecovery =
true.  Not setting ArchiveRecoveryRequested has a couple of undesired
effect.  First, this skips some initialization steps that may be
needed at a later point in recovery.  The thread quoted above has
reported one aspect of that: we miss some hot-standby related
intialization that can reflect if replaying enough WAL that a restart
point could happen.  Depending on the amount of data copied into
pg_wal/ before starting a node with only a backup_label it may also be
possible that a consistent point has been reached, where restart
points would be legit.  A second Kiss Cool effect (understands who
can), is that we miss the initialization of the recoveryWakeupLatch.
A third effect is that some code paths can use GUC values related to
recovery without ArchiveRecoveryRequested being around, one example
seems to be hot_standby whose default is true.

It is worth noting the end of FinishWalRecovery(), that includes this
part:
    if (ArchiveRecoveryRequested)
    {
        /*
         * We are no longer in archive recovery state.
         *
         * We are now done reading the old WAL.  Turn off archive fetching if
         * it was active.
         */
        Assert(InArchiveRecovery);
        InArchiveRecovery = false;

I have been pondering for a few weeks now about what kind of
definition would suit to a cluster having a backup_label file without
a signal file added, which is documented as required by the docs in
the HA section as well as pg_rewind.  It is true that there could be a
point to allow such a configuration so as a node recovers without a
TLI jump, but I cannot find appealing this case, as well, as a node
could just happily overwrite WAL segments in the archives on an
existing timeline, potentially corruption other nodes writing on the
same TLI.  There are a few other recovery scenarios where one copies
directly WAL segments into pg_wal/ that can lead to a lot of weird
inconsistencies as well, one being the report of the thread of
pgsql-hackers.

At the end, I'd like to think that we should just require
a recovery.signal or a standby.signal if we have a backup_label file,
and even enforce this rule at the end of recovery for some sanity
checks.  I don't think that we can just enforce
ArchiveRecoveryRequested in this path, either, as a backup_label would
be renamed to .old once the control file knows up to which LSN it
needs to replay to reach consistency and if an end-of-backup record is
required.  That's not something that can be reasonably backpatched, as
it could disturb some recovery workflows, even if these are kind of in
a dangerous spot, IMO, so I would like to propose that only on HEAD
for 16~ because the recovery code has never really considered this
combination of ArchiveRecoveryRequested and InArchiveRecovery.

While digging into that, I have found one TAP test of pg_basebackup
that was doing recovery with just a backup_label file, with a
restore_command already set.  A second case was in pg_rewind, were we
have a node without standby.signal, still it uses a primary_conninfo.

Attached is a patch on the lines of what I am thinking about.  This
reworks a bit some of the actions at the beginning of the startup
process:
- Report the set of LOGs showing the state of the node after reading
the backup_label.
- Enforce a rule in ShutdownWalRecovery() and document the
restriction.
- Add a check with the signal files after finding a backup_label
file.
- The validation checks on the recovery parameters are applied (aka
restore_command required with recovery.signal, or a primary_conninfo
required on standby for streaming, etc.).

My apologies for the long message, but this deserves some attention,
IMHO.

So, any thoughts?
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Next
From: Peter Eisentraut
Date:
Subject: Re: Transparent column encryption