pg_veryfybackup can fail with a valid backup for TLI > 1 - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | pg_veryfybackup can fail with a valid backup for TLI > 1 |
Date | |
Msg-id | 20210818.143031.1867083699202617521.horikyota.ntt@gmail.com Whole thread Raw |
Responses |
Re: pg_veryfybackup can fail with a valid backup for TLI > 1
|
List | pgsql-hackers |
Hello. pg_veryfybackup can fail with a valid backup when the backup was taken from TLI > 1. ===== # initdb $ pg_ctl start (just to make sure) $ pg_ctl stop $ touch $PGDATA/standby.signal $ pg_ctl start $ pg_ctl promote $ psql postgres=# select pg_switch_wal(); pg_switch_wal --------------- 0/14FE340 (1 row) postgres=# checkpoint; postgres=# ^D $ pg_basebackup -Fp -h /tmp -D tmpbk $ pg_veryfybackup tmpbk (thinking.. thiking.. zzz.. for several seconds) pg_waldump: fatal: could not find file "000000020000000000000001": No such file or directory pg_verifybackup: error: WAL parsing failed for timeline 2 ===== This is bacause the backup_manifest has the wrong values as below. > "WAL-Ranges": [ > { "Timeline": 2, "Start-LSN": "0/14FE248", "End-LSN": "0/3000100" } > ], The "Start-LSN" above is the beginning of timeline 2, not the backup-start LSN. The segment had been removed by the checkpoint. The comment for AddWALInfoToBackupManifest() says: > * Add information about the WAL that will need to be replayed when restoring > * this backup to the manifest. So I concluded that it's a thinko. Please see the attached. It needs to be back-patched to 13 but 13 doesn't accept the patch as is due to wording chages in TAP tests. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From c9c299091ccae694a9c1d58c616407e55be53f6f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Wed, 18 Aug 2021 11:43:53 +0900 Subject: [PATCH v1] Fix basebackup to generate correct WAL-Ranges info. While WAL-Ranges in backup_manifest is supposed to have the WAL range to apply at recovery. it actually starts from the beginning of the oldest timeline if the starting TLI > 1. As the result pg_verifybackup may result in a false failure for a valid backup. Adjust the logic to make WAL-Ranges start with the actual start LSN. Backpatch to PG13 where this feature has been introduced. --- src/backend/replication/backup_manifest.c | 10 +++++----- src/bin/pg_verifybackup/t/007_wal.pl | 18 +++++++++++++++++- 2 files changed, 22 insertions(+), 6 deletions(-) diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c index 8882444025..d7a8584124 100644 --- a/src/backend/replication/backup_manifest.c +++ b/src/backend/replication/backup_manifest.c @@ -255,11 +255,11 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr, errmsg("expected end timeline %u but found timeline %u", starttli, entry->tli)); - if (!XLogRecPtrIsInvalid(entry->begin)) - tl_beginptr = entry->begin; - else - { + if (starttli == entry->tli) tl_beginptr = startptr; + else + { + tl_beginptr = entry->begin; /* * If we reach a TLI that has no valid beginning LSN, there can't @@ -267,7 +267,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr, * better have arrived at the expected starting TLI. If not, * something's gone horribly wrong. */ - if (starttli != entry->tli) + if (XLogRecPtrIsInvalid(entry->begin)) ereport(ERROR, errmsg("expected start timeline %u but found timeline %u", starttli, entry->tli)); diff --git a/src/bin/pg_verifybackup/t/007_wal.pl b/src/bin/pg_verifybackup/t/007_wal.pl index adf60fb755..bc52f0d1d9 100644 --- a/src/bin/pg_verifybackup/t/007_wal.pl +++ b/src/bin/pg_verifybackup/t/007_wal.pl @@ -10,7 +10,7 @@ use Config; use File::Path qw(rmtree); use PostgresNode; use TestLib; -use Test::More tests => 7; +use Test::More tests => 9; # Start up the server and take a backup. my $primary = PostgresNode->new('primary'); @@ -59,3 +59,19 @@ command_fails_like( [ 'pg_verifybackup', $backup_path ], qr/WAL parsing failed for timeline 1/, 'corrupt WAL file causes failure'); + +# +# Check that WAL-Ranges has the correct values when TLI > 1 +$primary->stop; +$primary->append_conf('standby.signal'); +$primary->start; +$primary->promote; +# base-backup runs a checkpoint but just to make sure. +$primary->safe_psql('postgres', 'SELECT pg_switch_wal(); CHECKPOINT;'); + +my $backup_path2 = $primary->backup_dir . '/test_tli'; +$primary->command_ok([ 'pg_basebackup', '-D', $backup_path2, '--no-sync' ], + "base backup 2 ok"); +# Should pass this case +command_ok([ 'pg_verifybackup', $backup_path2 ], + 'verifybackup should pass for a valid backup on timeline > 1'); -- 2.27.0 From 59371c3a1fd6bb504362b00fc3511a791f7d1fb4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Wed, 18 Aug 2021 13:58:15 +0900 Subject: [PATCH v1] Fix basebackup to generate correct WAL-Ranges info. While WAL-Ranges in backup_manifest is supposed to have the WAL range to apply at recovery. it starts from the beginning of the oldest timeline if the starting TLI > 1. As the result pg_verifybackup may result in a false failure for a valid backup. Adjust the logic to make the WAL-Ranges starts with the actual start LSN. Backpatch to PG13 where this feature has been introduced. --- src/backend/replication/backup_manifest.c | 10 +++++----- src/bin/pg_verifybackup/t/007_wal.pl | 18 +++++++++++++++++- 2 files changed, 22 insertions(+), 6 deletions(-) diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c index 1e07d99e64..14edda5cf4 100644 --- a/src/backend/replication/backup_manifest.c +++ b/src/backend/replication/backup_manifest.c @@ -236,11 +236,11 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr, errmsg("expected end timeline %u but found timeline %u", starttli, entry->tli)); - if (!XLogRecPtrIsInvalid(entry->begin)) - tl_beginptr = entry->begin; - else - { + if (starttli == entry->tli) tl_beginptr = startptr; + else + { + tl_beginptr = entry->begin; /* * If we reach a TLI that has no valid beginning LSN, there can't @@ -248,7 +248,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr, * better have arrived at the expected starting TLI. If not, * something's gone horribly wrong. */ - if (starttli != entry->tli) + if (!XLogRecPtrIsInvalid(entry->begin)) ereport(ERROR, errmsg("expected start timeline %u but found timeline %u", starttli, entry->tli)); diff --git a/src/bin/pg_verifybackup/t/007_wal.pl b/src/bin/pg_verifybackup/t/007_wal.pl index 56d536675c..b47ddf20e1 100644 --- a/src/bin/pg_verifybackup/t/007_wal.pl +++ b/src/bin/pg_verifybackup/t/007_wal.pl @@ -7,7 +7,7 @@ use Config; use File::Path qw(rmtree); use PostgresNode; use TestLib; -use Test::More tests => 7; +use Test::More tests => 9; # Start up the server and take a backup. my $master = get_new_node('master'); @@ -56,3 +56,19 @@ command_fails_like( [ 'pg_verifybackup', $backup_path ], qr/WAL parsing failed for timeline 1/, 'corrupt WAL file causes failure'); + +# +# Check that WAL-Ranges has the correct values when TLI > 1 +$master->stop; +$master->append_conf('standby.signal'); +$master->start; +$master->promote; +# base-backup runs a checkpoint but just to make sure. +$master->safe_psql('postgres', 'SELECT pg_switch_wal(); CHECKPOINT;'); + +my $backup_path2 = $master->backup_dir . '/test_tli'; +$master->command_ok([ 'pg_basebackup', '-D', $backup_path2, '--no-sync' ], + "base backup 2 ok"); +# Should pass this case +command_ok([ 'pg_verifybackup', $backup_path2 ], + 'verifybackup should pass for a valid backup on timeline > 1'); -- 2.27.0
pgsql-hackers by date: