pg_veryfybackup can fail with a valid backup for TLI > 1 - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject pg_veryfybackup can fail with a valid backup for TLI > 1
Date
Msg-id 20210818.143031.1867083699202617521.horikyota.ntt@gmail.com
Whole thread Raw
Responses Re: pg_veryfybackup can fail with a valid backup for TLI > 1  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Hello.

pg_veryfybackup can fail with a valid backup when the backup was taken
from TLI > 1.

=====
# initdb
$ pg_ctl start (just to make sure)
$ pg_ctl stop
$ touch $PGDATA/standby.signal
$ pg_ctl start
$ pg_ctl promote
$ psql
postgres=# select pg_switch_wal();
 pg_switch_wal 
---------------
 0/14FE340
(1 row)
postgres=# checkpoint;
postgres=# ^D
$ pg_basebackup -Fp -h /tmp -D tmpbk
$ pg_veryfybackup tmpbk
(thinking.. thiking.. zzz.. for several seconds)
pg_waldump: fatal: could not find file "000000020000000000000001": No such file or directory
pg_verifybackup: error: WAL parsing failed for timeline 2
=====

This is bacause the backup_manifest has the wrong values as below.

> "WAL-Ranges": [
> { "Timeline": 2, "Start-LSN": "0/14FE248", "End-LSN": "0/3000100" }
> ],

The "Start-LSN" above is the beginning of timeline 2, not the
backup-start LSN. The segment had been removed by the checkpoint.


The comment for AddWALInfoToBackupManifest() says:
> * Add information about the WAL that will need to be replayed when restoring
> * this backup to the manifest.

So I concluded that it's a thinko.

Please see the attached.  It needs to be back-patched to 13 but 13
doesn't accept the patch as is due to wording chages in TAP tests.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c9c299091ccae694a9c1d58c616407e55be53f6f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 18 Aug 2021 11:43:53 +0900
Subject: [PATCH v1] Fix basebackup to generate correct WAL-Ranges info.

While WAL-Ranges in backup_manifest is supposed to have the WAL range
to apply at recovery. it actually starts from the beginning of the
oldest timeline if the starting TLI > 1.  As the result
pg_verifybackup may result in a false failure for a valid
backup. Adjust the logic to make WAL-Ranges start with the actual
start LSN.

Backpatch to PG13 where this feature has been introduced.
---
 src/backend/replication/backup_manifest.c | 10 +++++-----
 src/bin/pg_verifybackup/t/007_wal.pl      | 18 +++++++++++++++++-
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index 8882444025..d7a8584124 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -255,11 +255,11 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
                     errmsg("expected end timeline %u but found timeline %u",
                            starttli, entry->tli));
 
-        if (!XLogRecPtrIsInvalid(entry->begin))
-            tl_beginptr = entry->begin;
-        else
-        {
+        if (starttli == entry->tli)
             tl_beginptr = startptr;
+        else
+        {
+            tl_beginptr = entry->begin;
 
             /*
              * If we reach a TLI that has no valid beginning LSN, there can't
@@ -267,7 +267,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
              * better have arrived at the expected starting TLI. If not,
              * something's gone horribly wrong.
              */
-            if (starttli != entry->tli)
+            if (XLogRecPtrIsInvalid(entry->begin))
                 ereport(ERROR,
                         errmsg("expected start timeline %u but found timeline %u",
                                starttli, entry->tli));
diff --git a/src/bin/pg_verifybackup/t/007_wal.pl b/src/bin/pg_verifybackup/t/007_wal.pl
index adf60fb755..bc52f0d1d9 100644
--- a/src/bin/pg_verifybackup/t/007_wal.pl
+++ b/src/bin/pg_verifybackup/t/007_wal.pl
@@ -10,7 +10,7 @@ use Config;
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 9;
 
 # Start up the server and take a backup.
 my $primary = PostgresNode->new('primary');
@@ -59,3 +59,19 @@ command_fails_like(
     [ 'pg_verifybackup', $backup_path ],
     qr/WAL parsing failed for timeline 1/,
     'corrupt WAL file causes failure');
+
+#
+# Check that WAL-Ranges has the correct values when TLI > 1
+$primary->stop;
+$primary->append_conf('standby.signal');
+$primary->start;
+$primary->promote;
+# base-backup runs a checkpoint but just to make sure.
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal(); CHECKPOINT;');
+
+my $backup_path2 = $primary->backup_dir . '/test_tli';
+$primary->command_ok([ 'pg_basebackup', '-D', $backup_path2, '--no-sync' ],
+    "base backup 2 ok");
+# Should pass this case
+command_ok([ 'pg_verifybackup', $backup_path2 ],
+    'verifybackup should pass for a valid backup on timeline > 1');
-- 
2.27.0

From 59371c3a1fd6bb504362b00fc3511a791f7d1fb4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 18 Aug 2021 13:58:15 +0900
Subject: [PATCH v1] Fix basebackup to generate correct WAL-Ranges info.

While WAL-Ranges in backup_manifest is supposed to have the WAL range
to apply at recovery. it starts from the beginning of the oldest
timeline if the starting TLI > 1.  As the result pg_verifybackup may
result in a false failure for a valid backup. Adjust the logic to make
the WAL-Ranges starts with the actual start LSN.

Backpatch to PG13 where this feature has been introduced.
---
 src/backend/replication/backup_manifest.c | 10 +++++-----
 src/bin/pg_verifybackup/t/007_wal.pl      | 18 +++++++++++++++++-
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index 1e07d99e64..14edda5cf4 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -236,11 +236,11 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
                     errmsg("expected end timeline %u but found timeline %u",
                            starttli, entry->tli));
 
-        if (!XLogRecPtrIsInvalid(entry->begin))
-            tl_beginptr = entry->begin;
-        else
-        {
+        if (starttli == entry->tli)
             tl_beginptr = startptr;
+        else
+        {
+            tl_beginptr = entry->begin;
 
             /*
              * If we reach a TLI that has no valid beginning LSN, there can't
@@ -248,7 +248,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
              * better have arrived at the expected starting TLI. If not,
              * something's gone horribly wrong.
              */
-            if (starttli != entry->tli)
+            if (!XLogRecPtrIsInvalid(entry->begin))
                 ereport(ERROR,
                         errmsg("expected start timeline %u but found timeline %u",
                                starttli, entry->tli));
diff --git a/src/bin/pg_verifybackup/t/007_wal.pl b/src/bin/pg_verifybackup/t/007_wal.pl
index 56d536675c..b47ddf20e1 100644
--- a/src/bin/pg_verifybackup/t/007_wal.pl
+++ b/src/bin/pg_verifybackup/t/007_wal.pl
@@ -7,7 +7,7 @@ use Config;
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 9;
 
 # Start up the server and take a backup.
 my $master = get_new_node('master');
@@ -56,3 +56,19 @@ command_fails_like(
     [ 'pg_verifybackup', $backup_path ],
     qr/WAL parsing failed for timeline 1/,
     'corrupt WAL file causes failure');
+
+#
+# Check that WAL-Ranges has the correct values when TLI > 1
+$master->stop;
+$master->append_conf('standby.signal');
+$master->start;
+$master->promote;
+# base-backup runs a checkpoint but just to make sure.
+$master->safe_psql('postgres', 'SELECT pg_switch_wal(); CHECKPOINT;');
+
+my $backup_path2 = $master->backup_dir . '/test_tli';
+$master->command_ok([ 'pg_basebackup', '-D', $backup_path2, '--no-sync' ],
+    "base backup 2 ok");
+# Should pass this case
+command_ok([ 'pg_verifybackup', $backup_path2 ],
+    'verifybackup should pass for a valid backup on timeline > 1');
-- 
2.27.0


pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Amit Kapila
Date:
Subject: Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o