Crash by targetted recovery - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Crash by targetted recovery |
Date | |
Msg-id | 20200227.124830.2197604521555566121.horikyota.ntt@gmail.com Whole thread Raw |
Responses |
Re: Crash by targetted recovery
|
List | pgsql-hackers |
Hello. We found that targetted promotion can cause an assertion failure. The attached TAP test causes that. > TRAP: FailedAssertion("StandbyMode", File: "xlog.c", Line: 12078) After recovery target is reached, StartupXLOG turns off standby mode then refetches the last record. If the last record starts from the previous WAL segment, the assertion failure is triggered. The wrong point is that StartupXLOG does random access fetching while WaitForWALToBecomeAvailable is thinking it is still in streaming. I think if it is called with random access mode, WaitForWALToBecomeAvailable should move to XLOG_FROM_ARCHIVE even though it is thinking that it is still reading from stream. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 8cb817a43226a0d60dd62c6205219a2f0c807b9e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Wed, 26 Feb 2020 20:41:11 +0900 Subject: [PATCH 1/2] TAP test for a crash bug --- src/test/recovery/t/003_recovery_targets.pl | 32 +++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl index fd14bab208..8e71788981 100644 --- a/src/test/recovery/t/003_recovery_targets.pl +++ b/src/test/recovery/t/003_recovery_targets.pl @@ -167,3 +167,35 @@ foreach my $i (0..100) $logfile = slurp_file($node_standby->logfile()); ok($logfile =~ qr/FATAL: recovery ended before configured recovery target was reached/, 'recovery end before target reached is a fatal error'); + +############ +# Edge case where targetted promotion happens on segment boundary +$node_standby = get_new_node('standby_9'); +$node_standby->init_from_backup($node_master, 'my_backup', + has_restoring => 1, has_streaming => 1); +$node_standby->start; +## read wal_segment_size +my $result = + $node_standby->safe_psql('postgres', "SHOW wal_segment_size"); +die "unknown format of wal_segment_size: $result\n" + if ($result !~ /^([0-9]+)MB$/); +my $segsize = $1 * 1024 * 1024; +## stop just before the next segment boundary +$result = + $node_standby->safe_psql('postgres', "SELECT pg_last_wal_replay_lsn()"); +my ($seg, $off) = split('/', $result); +my $target = sprintf("$seg/%08X", (hex($off) / $segsize + 1) * $segsize); + +$node_standby->stop; +$node_standby->append_conf('postgresql.conf', qq( +recovery_target_inclusive=no +recovery_target_lsn='$target' +recovery_target_action='promote' +)); +$node_standby->start; +## do targetted promote +$node_master->safe_psql('postgres', "CREATE TABLE t(); DROP TABLE t;"); +$node_master->safe_psql('postgres', "SELECT pg_switch_wal(); CHECKPOINT;"); +my $caughtup_query = "SELECT NOT pg_is_in_recovery()"; +$node_standby->poll_query_until('postgres', $caughtup_query) + or die "Timed out while waiting for standby to promote"; -- 2.18.2 From 424729eafeeab8c96074f4c8d5b87d3a23cbf73a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Wed, 26 Feb 2020 20:41:37 +0900 Subject: [PATCH 2/2] Fix a crash bug of targetted promotion. After recovery target is reached, StartupXLOG turns off standby mode then refetches the last record. If the last record starts from the previous segment at the time, WaitForWALToBecomeAvailable crashes with assertion failure. WaitForWALToBecomeAvailable should move back to XLOG_FROM_ARCHIVE if random access is specified while streaming. --- src/backend/access/transam/xlog.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d19408b3be..41ed916342 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11831,7 +11831,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * 1. Read from either archive or pg_wal (XLOG_FROM_ARCHIVE), or just * pg_wal (XLOG_FROM_PG_WAL) * 2. Check trigger file - * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM) + * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM). + * Random access mode rewinds the state machine to 1. * 4. Rescan timelines * 5. Sleep wal_retrieve_retry_interval milliseconds, and loop back to 1. * @@ -11846,7 +11847,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, */ if (!InArchiveRecovery) currentSource = XLOG_FROM_PG_WAL; - else if (currentSource == 0) + else if (currentSource == 0 || randAccess) currentSource = XLOG_FROM_ARCHIVE; for (;;) -- 2.18.2
pgsql-hackers by date: