Thread: pg_veryfybackup can fail with a valid backup for TLI > 1

pg_veryfybackup can fail with a valid backup for TLI > 1

From
Kyotaro Horiguchi
Date:
Hello.

pg_veryfybackup can fail with a valid backup when the backup was taken
from TLI > 1.

=====
# initdb
$ pg_ctl start (just to make sure)
$ pg_ctl stop
$ touch $PGDATA/standby.signal
$ pg_ctl start
$ pg_ctl promote
$ psql
postgres=# select pg_switch_wal();
 pg_switch_wal 
---------------
 0/14FE340
(1 row)
postgres=# checkpoint;
postgres=# ^D
$ pg_basebackup -Fp -h /tmp -D tmpbk
$ pg_veryfybackup tmpbk
(thinking.. thiking.. zzz.. for several seconds)
pg_waldump: fatal: could not find file "000000020000000000000001": No such file or directory
pg_verifybackup: error: WAL parsing failed for timeline 2
=====

This is bacause the backup_manifest has the wrong values as below.

> "WAL-Ranges": [
> { "Timeline": 2, "Start-LSN": "0/14FE248", "End-LSN": "0/3000100" }
> ],

The "Start-LSN" above is the beginning of timeline 2, not the
backup-start LSN. The segment had been removed by the checkpoint.


The comment for AddWALInfoToBackupManifest() says:
> * Add information about the WAL that will need to be replayed when restoring
> * this backup to the manifest.

So I concluded that it's a thinko.

Please see the attached.  It needs to be back-patched to 13 but 13
doesn't accept the patch as is due to wording chages in TAP tests.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c9c299091ccae694a9c1d58c616407e55be53f6f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 18 Aug 2021 11:43:53 +0900
Subject: [PATCH v1] Fix basebackup to generate correct WAL-Ranges info.

While WAL-Ranges in backup_manifest is supposed to have the WAL range
to apply at recovery. it actually starts from the beginning of the
oldest timeline if the starting TLI > 1.  As the result
pg_verifybackup may result in a false failure for a valid
backup. Adjust the logic to make WAL-Ranges start with the actual
start LSN.

Backpatch to PG13 where this feature has been introduced.
---
 src/backend/replication/backup_manifest.c | 10 +++++-----
 src/bin/pg_verifybackup/t/007_wal.pl      | 18 +++++++++++++++++-
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index 8882444025..d7a8584124 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -255,11 +255,11 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
                     errmsg("expected end timeline %u but found timeline %u",
                            starttli, entry->tli));
 
-        if (!XLogRecPtrIsInvalid(entry->begin))
-            tl_beginptr = entry->begin;
-        else
-        {
+        if (starttli == entry->tli)
             tl_beginptr = startptr;
+        else
+        {
+            tl_beginptr = entry->begin;
 
             /*
              * If we reach a TLI that has no valid beginning LSN, there can't
@@ -267,7 +267,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
              * better have arrived at the expected starting TLI. If not,
              * something's gone horribly wrong.
              */
-            if (starttli != entry->tli)
+            if (XLogRecPtrIsInvalid(entry->begin))
                 ereport(ERROR,
                         errmsg("expected start timeline %u but found timeline %u",
                                starttli, entry->tli));
diff --git a/src/bin/pg_verifybackup/t/007_wal.pl b/src/bin/pg_verifybackup/t/007_wal.pl
index adf60fb755..bc52f0d1d9 100644
--- a/src/bin/pg_verifybackup/t/007_wal.pl
+++ b/src/bin/pg_verifybackup/t/007_wal.pl
@@ -10,7 +10,7 @@ use Config;
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 9;
 
 # Start up the server and take a backup.
 my $primary = PostgresNode->new('primary');
@@ -59,3 +59,19 @@ command_fails_like(
     [ 'pg_verifybackup', $backup_path ],
     qr/WAL parsing failed for timeline 1/,
     'corrupt WAL file causes failure');
+
+#
+# Check that WAL-Ranges has the correct values when TLI > 1
+$primary->stop;
+$primary->append_conf('standby.signal');
+$primary->start;
+$primary->promote;
+# base-backup runs a checkpoint but just to make sure.
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal(); CHECKPOINT;');
+
+my $backup_path2 = $primary->backup_dir . '/test_tli';
+$primary->command_ok([ 'pg_basebackup', '-D', $backup_path2, '--no-sync' ],
+    "base backup 2 ok");
+# Should pass this case
+command_ok([ 'pg_verifybackup', $backup_path2 ],
+    'verifybackup should pass for a valid backup on timeline > 1');
-- 
2.27.0

From 59371c3a1fd6bb504362b00fc3511a791f7d1fb4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 18 Aug 2021 13:58:15 +0900
Subject: [PATCH v1] Fix basebackup to generate correct WAL-Ranges info.

While WAL-Ranges in backup_manifest is supposed to have the WAL range
to apply at recovery. it starts from the beginning of the oldest
timeline if the starting TLI > 1.  As the result pg_verifybackup may
result in a false failure for a valid backup. Adjust the logic to make
the WAL-Ranges starts with the actual start LSN.

Backpatch to PG13 where this feature has been introduced.
---
 src/backend/replication/backup_manifest.c | 10 +++++-----
 src/bin/pg_verifybackup/t/007_wal.pl      | 18 +++++++++++++++++-
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index 1e07d99e64..14edda5cf4 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -236,11 +236,11 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
                     errmsg("expected end timeline %u but found timeline %u",
                            starttli, entry->tli));
 
-        if (!XLogRecPtrIsInvalid(entry->begin))
-            tl_beginptr = entry->begin;
-        else
-        {
+        if (starttli == entry->tli)
             tl_beginptr = startptr;
+        else
+        {
+            tl_beginptr = entry->begin;
 
             /*
              * If we reach a TLI that has no valid beginning LSN, there can't
@@ -248,7 +248,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
              * better have arrived at the expected starting TLI. If not,
              * something's gone horribly wrong.
              */
-            if (starttli != entry->tli)
+            if (!XLogRecPtrIsInvalid(entry->begin))
                 ereport(ERROR,
                         errmsg("expected start timeline %u but found timeline %u",
                                starttli, entry->tli));
diff --git a/src/bin/pg_verifybackup/t/007_wal.pl b/src/bin/pg_verifybackup/t/007_wal.pl
index 56d536675c..b47ddf20e1 100644
--- a/src/bin/pg_verifybackup/t/007_wal.pl
+++ b/src/bin/pg_verifybackup/t/007_wal.pl
@@ -7,7 +7,7 @@ use Config;
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 9;
 
 # Start up the server and take a backup.
 my $master = get_new_node('master');
@@ -56,3 +56,19 @@ command_fails_like(
     [ 'pg_verifybackup', $backup_path ],
     qr/WAL parsing failed for timeline 1/,
     'corrupt WAL file causes failure');
+
+#
+# Check that WAL-Ranges has the correct values when TLI > 1
+$master->stop;
+$master->append_conf('standby.signal');
+$master->start;
+$master->promote;
+# base-backup runs a checkpoint but just to make sure.
+$master->safe_psql('postgres', 'SELECT pg_switch_wal(); CHECKPOINT;');
+
+my $backup_path2 = $master->backup_dir . '/test_tli';
+$master->command_ok([ 'pg_basebackup', '-D', $backup_path2, '--no-sync' ],
+    "base backup 2 ok");
+# Should pass this case
+command_ok([ 'pg_verifybackup', $backup_path2 ],
+    'verifybackup should pass for a valid backup on timeline > 1');
-- 
2.27.0


Re: pg_veryfybackup can fail with a valid backup for TLI > 1

From
Michael Paquier
Date:
On Wed, Aug 18, 2021 at 02:30:31PM +0900, Kyotaro Horiguchi wrote:
> The "Start-LSN" above is the beginning of timeline 2, not the
> backup-start LSN. The segment had been removed by the checkpoint.

Good catch.  That's broken, and I find cleaner the logic to compare
the TLI numbers rather than the start or end position of the TLI to
check if a TLI should have a parent one.

> The comment for AddWALInfoToBackupManifest() says:
> > * Add information about the WAL that will need to be replayed when restoring
> > * this backup to the manifest.

Yes, backup manifests include one array with all the timelines that
require coverage.  This case would only impact backups taken from
standbys, which have timeline jumps while the backup is taken with
parent nodes promoted.  That's easy enough to test with a chain of
standbys, some pgbench and pg_basebackup --max-rate to slow down the
whole.

> So I concluded that it's a thinko.
>
> Please see the attached.  It needs to be back-patched to 13 but 13
> doesn't accept the patch as is due to wording chages in TAP tests.

+       if (starttli == entry->tli)
            tl_beginptr = startptr;
+       else
+       {
+           tl_beginptr = entry->begin;

I would add a comment here.  This does not need to be complicated,
say:
"If this timeline entry matches with the timeline on which the backup
started, WAL needs to be checked from the start LSN of the backup.  If
this entry refers to a newer timeline, WAL needs to be checked since
the beginning of this timeline, so use the LSN where the timeline
began."

The TAP test is fancy, I like it.  So let's keep it.

+# base-backup runs a checkpoint but just to make sure.
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();
CHECKPOINT;');
I don't think you need this part to get a failure as the checkpoint
from the base backup will be sufficient.
--
Michael

Attachment

Re: pg_veryfybackup can fail with a valid backup for TLI > 1

From
Kyotaro Horiguchi
Date:
Thank you for the comment.

At Thu, 19 Aug 2021 15:51:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Wed, Aug 18, 2021 at 02:30:31PM +0900, Kyotaro Horiguchi wrote:
> > The "Start-LSN" above is the beginning of timeline 2, not the
> > backup-start LSN. The segment had been removed by the checkpoint.
> 
> Good catch.  That's broken, and I find cleaner the logic to compare
> the TLI numbers rather than the start or end position of the TLI to
> check if a TLI should have a parent one.
> 
> > The comment for AddWALInfoToBackupManifest() says:
> > > * Add information about the WAL that will need to be replayed when restoring
> > > * this backup to the manifest.
> 
> Yes, backup manifests include one array with all the timelines that
> require coverage.  This case would only impact backups taken from
> standbys, which have timeline jumps while the backup is taken with
> parent nodes promoted.  That's easy enough to test with a chain of
> standbys, some pgbench and pg_basebackup --max-rate to slow down the
> whole. 

(I was very annoyed that pg_basbackup responds to --max-rate=1 only by
"out of range" without informing of the valid range..)

That's looks like a domino falling. I had the following result, which
looks fine.

"WAL-Ranges": [
{ "Timeline": 6, "Start-LSN": "0/630C7E8", "End-LSN": "0/630C850" },
{ "Timeline": 5, "Start-LSN": "0/630C780", "End-LSN": "0/630C7E8" },
{ "Timeline": 4, "Start-LSN": "0/630C718", "End-LSN": "0/630C780" },
{ "Timeline": 3, "Start-LSN": "0/630C6B0", "End-LSN": "0/630C718" },
{ "Timeline": 2, "Start-LSN": "0/5000028", "End-LSN": "0/630C6B0" }
],

00000006.history:
1    0/173F268    no recovery target specified
2    0/630C6B0    no recovery target specified
3    0/630C718    no recovery target specified
4    0/630C780    no recovery target specified
5    0/630C7E8    no recovery target specified

> > So I concluded that it's a thinko.
> > 
> > Please see the attached.  It needs to be back-patched to 13 but 13
> > doesn't accept the patch as is due to wording chages in TAP tests.
> 
> +       if (starttli == entry->tli)
>             tl_beginptr = startptr;
> +       else
> +       {
> +           tl_beginptr = entry->begin;
> 
> I would add a comment here.  This does not need to be complicated,
> say:
> "If this timeline entry matches with the timeline on which the backup
> started, WAL needs to be checked from the start LSN of the backup.  If
> this entry refers to a newer timeline, WAL needs to be checked since
> the beginning of this timeline, so use the LSN where the timeline
> began."

Thanks for your help! I wanted to write something like that
there. Added it as-is.

> The TAP test is fancy, I like it.  So let's keep it.
> 
> +# base-backup runs a checkpoint but just to make sure.
> +$primary->safe_psql('postgres', 'SELECT pg_switch_wal();
> CHECKPOINT;');
> I don't think you need this part to get a failure as the checkpoint
> from the base backup will be sufficient.

Yes, it is just redandunt as (I tried to) said in the comment. (I
meant by the "just to make sure" "just to let readers of the test code
notice that a checkpoint is crucial there", but it doesn't seem
working:() Instead, I added a comment just before pg_basebackup.

# base-backup below runs a checkpoint, which removes the first segment
# of the current timeline

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 8a75697715d21d84d16e9ad478d1f7b6dd389a49 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 18 Aug 2021 11:43:53 +0900
Subject: [PATCH v2] Fix basebackup to generate correct WAL-Ranges info.

While WAL-Ranges in backup_manifest is supposed to have the WAL range
to apply at recovery. it actually starts from the beginning of the
oldest timeline if the starting TLI > 1.  As the result
pg_verifybackup may result in a false failure for a valid
backup. Adjust the logic to make WAL-Ranges start with the actual
start LSN.

Backpatch to PG13 where this feature has been introduced.
---
 src/backend/replication/backup_manifest.c | 17 ++++++++++++-----
 src/bin/pg_verifybackup/t/007_wal.pl      | 18 +++++++++++++++++-
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index 921de47a6c..e9862a825f 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -251,11 +251,18 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
                     errmsg("expected end timeline %u but found timeline %u",
                            starttli, entry->tli));
 
-        if (!XLogRecPtrIsInvalid(entry->begin))
-            tl_beginptr = entry->begin;
-        else
-        {
+        /*
+         * If this timeline entry matches with the timeline on which the backup
+         * started, WAL needs to be checked from the start LSN of the backup.
+         * If this entry refers to a newer timeline, WAL needs to be checked
+         * since the beginning of this timeline, so use the LSN where the
+         * timeline began.
+         */
+        if (starttli == entry->tli)
             tl_beginptr = startptr;
+        else
+        {
+            tl_beginptr = entry->begin;
 
             /*
              * If we reach a TLI that has no valid beginning LSN, there can't
@@ -263,7 +270,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
              * better have arrived at the expected starting TLI. If not,
              * something's gone horribly wrong.
              */
-            if (starttli != entry->tli)
+            if (XLogRecPtrIsInvalid(entry->begin))
                 ereport(ERROR,
                         errmsg("expected start timeline %u but found timeline %u",
                                starttli, entry->tli));
diff --git a/src/bin/pg_verifybackup/t/007_wal.pl b/src/bin/pg_verifybackup/t/007_wal.pl
index adf60fb755..510982d4de 100644
--- a/src/bin/pg_verifybackup/t/007_wal.pl
+++ b/src/bin/pg_verifybackup/t/007_wal.pl
@@ -10,7 +10,7 @@ use Config;
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 9;
 
 # Start up the server and take a backup.
 my $primary = PostgresNode->new('primary');
@@ -59,3 +59,19 @@ command_fails_like(
     [ 'pg_verifybackup', $backup_path ],
     qr/WAL parsing failed for timeline 1/,
     'corrupt WAL file causes failure');
+
+#
+# Check that WAL-Ranges has the correct values when TLI > 1
+$primary->stop;
+$primary->append_conf('standby.signal');
+$primary->start;
+$primary->promote;
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+my $backup_path2 = $primary->backup_dir . '/test_tli';
+# base-backup below runs a checkpoint, which removes the first segment
+# of the current timeline
+$primary->command_ok([ 'pg_basebackup', '-D', $backup_path2, '--no-sync' ],
+    "base backup 2 ok");
+# Should pass this case
+command_ok([ 'pg_verifybackup', $backup_path2 ],
+    'verifybackup should pass for a valid backup on timeline > 1');
-- 
2.27.0

From a309bff820b9c3a1cf178388eef4bf229888cb1e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 18 Aug 2021 13:58:15 +0900
Subject: [PATCH v2] Fix basebackup to generate correct WAL-Ranges info.

While WAL-Ranges in backup_manifest is supposed to have the WAL range
to apply at recovery. it starts from the beginning of the oldest
timeline if the starting TLI > 1.  As the result pg_verifybackup may
result in a false failure for a valid backup. Adjust the logic to make
the WAL-Ranges starts with the actual start LSN.

Backpatch to PG13 where this feature has been introduced.
---
 src/backend/replication/backup_manifest.c | 17 ++++++++++++-----
 src/bin/pg_verifybackup/t/007_wal.pl      | 18 +++++++++++++++++-
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index 1e07d99e64..a672690c7d 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -236,11 +236,18 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
                     errmsg("expected end timeline %u but found timeline %u",
                            starttli, entry->tli));
 
-        if (!XLogRecPtrIsInvalid(entry->begin))
-            tl_beginptr = entry->begin;
-        else
-        {
+        /*
+         * If this timeline entry matches with the timeline on which the backup
+         * started, WAL needs to be checked from the start LSN of the backup.
+         * If this entry refers to a newer timeline, WAL needs to be checked
+         * since the beginning of this timeline, so use the LSN where the
+         * timeline began.
+         */
+        if (starttli == entry->tli)
             tl_beginptr = startptr;
+        else
+        {
+            tl_beginptr = entry->begin;
 
             /*
              * If we reach a TLI that has no valid beginning LSN, there can't
@@ -248,7 +255,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
              * better have arrived at the expected starting TLI. If not,
              * something's gone horribly wrong.
              */
-            if (starttli != entry->tli)
+            if (!XLogRecPtrIsInvalid(entry->begin))
                 ereport(ERROR,
                         errmsg("expected start timeline %u but found timeline %u",
                                starttli, entry->tli));
diff --git a/src/bin/pg_verifybackup/t/007_wal.pl b/src/bin/pg_verifybackup/t/007_wal.pl
index 56d536675c..ae35e035eb 100644
--- a/src/bin/pg_verifybackup/t/007_wal.pl
+++ b/src/bin/pg_verifybackup/t/007_wal.pl
@@ -7,7 +7,7 @@ use Config;
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 9;
 
 # Start up the server and take a backup.
 my $master = get_new_node('master');
@@ -56,3 +56,19 @@ command_fails_like(
     [ 'pg_verifybackup', $backup_path ],
     qr/WAL parsing failed for timeline 1/,
     'corrupt WAL file causes failure');
+
+#
+# Check that WAL-Ranges has the correct values when TLI > 1
+$master->stop;
+$master->append_conf('standby.signal');
+$master->start;
+$master->promote;
+$master->safe_psql('postgres', 'SELECT pg_switch_wal()');
+my $backup_path2 = $master->backup_dir . '/test_tli';
+# base-backup below runs a checkpoint, which removes the first segment
+# of the current timeline
+$master->command_ok([ 'pg_basebackup', '-D', $backup_path2, '--no-sync' ],
+    "base backup 2 ok");
+# Should pass this case
+command_ok([ 'pg_verifybackup', $backup_path2 ],
+    'verifybackup should pass for a valid backup on timeline > 1');
-- 
2.27.0


Re: pg_veryfybackup can fail with a valid backup for TLI > 1

From
Michael Paquier
Date:
On Fri, Aug 20, 2021 at 03:33:37PM +0900, Kyotaro Horiguchi wrote:
> That's looks like a domino falling. I had the following result, which
> looks fine.
>
> "WAL-Ranges": [
> { "Timeline": 6, "Start-LSN": "0/630C7E8", "End-LSN": "0/630C850" },
> { "Timeline": 5, "Start-LSN": "0/630C780", "End-LSN": "0/630C7E8" },
> { "Timeline": 4, "Start-LSN": "0/630C718", "End-LSN": "0/630C780" },
> { "Timeline": 3, "Start-LSN": "0/630C6B0", "End-LSN": "0/630C718" },
> { "Timeline": 2, "Start-LSN": "0/5000028", "End-LSN": "0/630C6B0" }
> ],
>
> 00000006.history:
> 1    0/173F268    no recovery target specified
> 2    0/630C6B0    no recovery target specified
> 3    0/630C718    no recovery target specified
> 4    0/630C780    no recovery target specified
> 5    0/630C7E8    no recovery target specified

And your backup_label shows 0/5000028 as start LSN for the backup,
right?  I see the same result.
--
Michael

Attachment

Re: pg_veryfybackup can fail with a valid backup for TLI > 1

From
Kyotaro Horiguchi
Date:
At Fri, 20 Aug 2021 16:23:56 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Fri, Aug 20, 2021 at 03:33:37PM +0900, Kyotaro Horiguchi wrote:
> > That's looks like a domino falling. I had the following result, which
> > looks fine.
> > 
> > "WAL-Ranges": [
> > { "Timeline": 6, "Start-LSN": "0/630C7E8", "End-LSN": "0/630C850" },
> > { "Timeline": 5, "Start-LSN": "0/630C780", "End-LSN": "0/630C7E8" },
> > { "Timeline": 4, "Start-LSN": "0/630C718", "End-LSN": "0/630C780" },
> > { "Timeline": 3, "Start-LSN": "0/630C6B0", "End-LSN": "0/630C718" },
> > { "Timeline": 2, "Start-LSN": "0/5000028", "End-LSN": "0/630C6B0" }
> > ],
> > 
> > 00000006.history:
> > 1    0/173F268    no recovery target specified
> > 2    0/630C6B0    no recovery target specified
> > 3    0/630C718    no recovery target specified
> > 4    0/630C780    no recovery target specified
> > 5    0/630C7E8    no recovery target specified
> 
> And your backup_label shows 0/5000028 as start LSN for the backup,
> right?  I see the same result.

Yes, backup_label looks correct.

backup_label (extract):
START WAL LOCATION: 0/5000028 (file 000000020000000000000005)
CHECKPOINT LOCATION: 0/5000060
START TIMELINE: 2

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_veryfybackup can fail with a valid backup for TLI > 1

From
Michael Paquier
Date:
On Fri, Aug 20, 2021 at 04:47:15PM +0900, Kyotaro Horiguchi wrote:
> Yes, backup_label looks correct.
>
> backup_label (extract):
> START WAL LOCATION: 0/5000028 (file 000000020000000000000005)
> CHECKPOINT LOCATION: 0/5000060
> START TIMELINE: 2

Okay.  I have worked on that today, did more manual tests, and applied
this fix down to 13.  The cherry-pick from 14 to 13 only required a
s/$master/$primary/ in the test, which was straight-forward.  Your
patch for 13 did that though:
-                       if (starttli != entry->tli)
+                       if (!XLogRecPtrIsInvalid(entry->begin))

So it would have caused a failure with parent TLIs that have a correct
begin location, but we expect the opposite.  The patch for 14/HEAD had
that right.
--
Michael

Attachment

Re: pg_veryfybackup can fail with a valid backup for TLI > 1

From
Kyotaro Horiguchi
Date:
At Mon, 23 Aug 2021 15:46:37 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Fri, Aug 20, 2021 at 04:47:15PM +0900, Kyotaro Horiguchi wrote:
> > Yes, backup_label looks correct.
> > 
> > backup_label (extract):
> > START WAL LOCATION: 0/5000028 (file 000000020000000000000005)
> > CHECKPOINT LOCATION: 0/5000060
> > START TIMELINE: 2
> 
> Okay.  I have worked on that today, did more manual tests, and applied
> this fix down to 13.  The cherry-pick from 14 to 13 only required a
> s/$master/$primary/ in the test, which was straight-forward.  Your
> patch for 13 did that though:
> -                       if (starttli != entry->tli)
> +                       if (!XLogRecPtrIsInvalid(entry->begin))
> 
> So it would have caused a failure with parent TLIs that have a correct
> begin location, but we expect the opposite.  The patch for 14/HEAD had
> that right.

Mmm. Sorry for the silly mistake, and thanks for commiting this!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center