Re: pg_veryfybackup can fail with a valid backup for TLI > 1 - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: pg_veryfybackup can fail with a valid backup for TLI > 1
Date
Msg-id YR3/eijQC2cDYCtD@paquier.xyz
Whole thread Raw
In response to pg_veryfybackup can fail with a valid backup for TLI > 1  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: pg_veryfybackup can fail with a valid backup for TLI > 1  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
On Wed, Aug 18, 2021 at 02:30:31PM +0900, Kyotaro Horiguchi wrote:
> The "Start-LSN" above is the beginning of timeline 2, not the
> backup-start LSN. The segment had been removed by the checkpoint.

Good catch.  That's broken, and I find cleaner the logic to compare
the TLI numbers rather than the start or end position of the TLI to
check if a TLI should have a parent one.

> The comment for AddWALInfoToBackupManifest() says:
> > * Add information about the WAL that will need to be replayed when restoring
> > * this backup to the manifest.

Yes, backup manifests include one array with all the timelines that
require coverage.  This case would only impact backups taken from
standbys, which have timeline jumps while the backup is taken with
parent nodes promoted.  That's easy enough to test with a chain of
standbys, some pgbench and pg_basebackup --max-rate to slow down the
whole.

> So I concluded that it's a thinko.
>
> Please see the attached.  It needs to be back-patched to 13 but 13
> doesn't accept the patch as is due to wording chages in TAP tests.

+       if (starttli == entry->tli)
            tl_beginptr = startptr;
+       else
+       {
+           tl_beginptr = entry->begin;

I would add a comment here.  This does not need to be complicated,
say:
"If this timeline entry matches with the timeline on which the backup
started, WAL needs to be checked from the start LSN of the backup.  If
this entry refers to a newer timeline, WAL needs to be checked since
the beginning of this timeline, so use the LSN where the timeline
began."

The TAP test is fancy, I like it.  So let's keep it.

+# base-backup runs a checkpoint but just to make sure.
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();
CHECKPOINT;');
I don't think you need this part to get a failure as the checkpoint
from the base backup will be sufficient.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Michael Paquier
Date:
Subject: Re: elog.c query_id support vs shutdown