Re: Crash by targetted recovery - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Crash by targetted recovery
Date
Msg-id 20200310.145900.21675815166709370.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Crash by targetted recovery  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: Crash by targetted recovery
List pgsql-hackers
At Tue, 10 Mar 2020 10:50:52 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> Pushed the v5-0001-Tidy-up-XLogSource-usage.patch!

Thanks!

> Regarding the remaining patch adding the regression test,

I didn't seriously inteneded it to be in the tree.

> +$result =
> + $node_standby->safe_psql('postgres', "SELECT
> pg_last_wal_replay_lsn()");
> +my ($seg, $off) = split('/', $result);
> +my $target = sprintf("$seg/%08X", (hex($off) / $segsize + 1) *
> $segsize);
> 
> What happens if "off" part gets the upper limit and "seg" part needs
> to be incremented? What happens if pg_last_wal_replay_lsn() advances
> very much (e.g., because of autovacuum) beyond the segment boundary
> until the standby restarts? Of course, these situations very rarely
> happen,
> but I'd like to avoid adding such not stable test if possible.

In the first place the "seg" is "fileno". Honestly I don't think the
test doesn't reach to fileno boundary but I did in the attached. Since
perl complains over-32bit integer arithmetic as incomptible, the
calculation gets a bit odd shape to avoid over-32bit arithmetic.

For the second point, which seems more likely to happen, I added the
VACUUM/pg_switch_wal() sequence then wait standby for catch up, before
doing the test.

Does it make sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 89621911e09df73f0f3a63501bb41ffb37edef05 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 10 Mar 2020 14:43:29 +0900
Subject: [PATCH v5] TAP test for a crash bug

Add a test for targetted promotion happend at segment boundary.
---
 src/test/recovery/t/003_recovery_targets.pl | 50 ++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index fd14bab208..6b319bc891 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 9;
+use Test::More tests => 10;
 use Time::HiRes qw(usleep);
 
 # Create and test a standby from given backup, with a certain recovery target.
@@ -167,3 +167,51 @@ foreach my $i (0..100)
 $logfile = slurp_file($node_standby->logfile());
 ok($logfile =~ qr/FATAL:  recovery ended before configured recovery target was reached/,
     'recovery end before target reached is a fatal error');
+
+# Corner case where targetted promotion happens on segment boundary
+$node_standby = get_new_node('standby_9');
+$node_standby->init_from_backup($node_master, 'my_backup',
+                                has_restoring => 1, has_streaming => 1);
+$node_standby->start;
+## make sure replication stays at the beginning of a segment
+$node_master->safe_psql('postgres', "VACUUM; SELECT pg_switch_wal();");
+my $catch_up_lsn =
+  $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+my $caughtup_query =
+  "SELECT '$catch_up_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
+$node_standby->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for standby to catch up";
+## calculate the next segment boundary
+my $result =
+  $node_standby->safe_psql('postgres', "SHOW wal_segment_size");
+die "unknown format of wal_segment_size: $result\n"
+  if ($result !~ /^([0-9]+)MB$/);
+my $segsize = $1 * 1024 * 1024;
+$result =
+  $node_standby->safe_psql('postgres', "SELECT pg_last_wal_replay_lsn()");
+my ($fileno, $off) = split('/', $result);
+my ($newoff, $newfileno) = (hex($off), hex($fileno));
+my $newsegnum = int($newoff / $segsize) + 1;
+## treat segment-overflow, avoiding over-32bit arithmetic, segsize is
+## a power of 2 larger than 1MB.
+if ($newsegnum * ($segsize >> 1) >= 0x80000000)
+{
+    $newfileno += ($newsegnum * ($segsize >> 1)) >> 31;
+    $newsegnum %= (0x80000000 / ($segsize >> 1));
+}
+my $target = sprintf("%X/%08X", $newfileno, $newsegnum * $segsize);
+## the standby stops just after the next segment boundary
+$node_standby->stop;
+$node_standby->append_conf('postgresql.conf', qq(
+recovery_target_inclusive=no
+recovery_target_lsn='$target'
+recovery_target_action='promote'
+));
+$node_standby->start;
+## do targetted promote
+$node_master->safe_psql('postgres', "CREATE TABLE t(); DROP TABLE t;");
+$node_master->safe_psql('postgres', "SELECT pg_switch_wal(); CHECKPOINT;");
+$caughtup_query = "SELECT NOT pg_is_in_recovery()";
+$node_standby->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for standby to promote";
+pass("targetted promotion on segment bounary");
-- 
2.18.2


pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: logical copy_replication_slot issues
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: shared-memory based stats collector