Re: A failure of standby to follow timeline switch - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: A failure of standby to follow timeline switch
Date
Msg-id 20210104.120610.767727791822348683.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: A failure of standby to follow timeline switch  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: A failure of standby to follow timeline switch  (Fujii Masao <masao.fujii@oss.nttdata.com>)
List pgsql-hackers
At Sat, 26 Dec 2020 02:15:06 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2020/12/25 12:03, Kyotaro Horiguchi wrote:
> > Thank you for looking this.
> > At Thu, 24 Dec 2020 15:33:04 +0900, Fujii Masao
> > <masao.fujii@oss.nttdata.com> wrote in
> >> When I applied two patches in the master branch and
> >> ran "make check-world", I got the following error.
> >>
> >> ============== creating database "contrib_regression" ==============
> >> # Looks like you planned 37 tests but ran 36.
> >> # Looks like your test exited with 255 just after 36.
> >> t/001_stream_rep.pl ..................
> >> Dubious, test returned 255 (wstat 65280, 0xff00)
> >> Failed 1/37 subtests
> >> ...
> >> Test Summary Report
> >> -------------------
> >> t/001_stream_rep.pl                (Wstat: 65280 Tests: 36 Failed: 0)
> >>    Non-zero exit status: 255
> >>    Parse errors: Bad plan.  You planned 37 tests but ran 36.
> >> Files=21, Tests=239, 302 wallclock secs ( 0.10 usr 0.05 sys + 41.69
> >> cusr 39.84 csys = 81.68 CPU)
> >> Result: FAIL
> >> make[2]: *** [check] Error 1
> >> make[1]: *** [check-recovery-recurse] Error 2
> >> make[1]: *** Waiting for unfinished jobs....
> >> t/070_dropuser.pl ......... ok
> > Mmm. I retried that and saw it succeed (with 0002 applied).
> > If I modified "user Test::More tests => 37" to 38 in the perl file, I
> > got a similar result.
> 
> What happens if you run make check-world with -j 4? When I ran that,
> the test failed. But with -j 1, the test finished with success. I'm
> not sure
> why this happened, though..

Maybe this is it.

+    usleep(100_000);

If the script doesn't find the expected log line, it reaches the
usleep and bark that "Undefined subroutine &main::usleep called...". I
thought I tested that path but perhaps I overlooked the error. "use
Time::HiRes" is needed.

The attached is the fixed version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 9853edb792f81953567b68da0b79baa57a8c518c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 9 Dec 2020 17:21:49 +0900
Subject: [PATCH v2 1/2] Add a new test to detect a replication bug

---
 src/test/recovery/t/001_stream_rep.pl | 65 ++++++++++++++++++++++++++-
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 9e31a53de7..13b6e00da4 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -2,8 +2,9 @@
 use strict;
 use warnings;
 use PostgresNode;
+use Time::HiRes qw(usleep);
 use TestLib;
-use Test::More tests => 36;
+use Test::More tests => 37;
 
 # Initialize primary node
 my $node_primary = get_new_node('primary');
@@ -409,3 +410,65 @@ ok( ($phys_restart_lsn_pre cmp $phys_restart_lsn_post) == 0,
 my $primary_data = $node_primary->data_dir;
 ok(!-f "$primary_data/pg_wal/$segment_removed",
     "WAL segment $segment_removed recycled after physical slot advancing");
+
+#
+# Check if timeline-increment works while reading a historic timeline.
+my $node_primary_2 = get_new_node('primary_2');
+$node_primary_2->init(allows_streaming => 1);
+$node_primary_2->enable_archiving; # needed to make .paritial segment
+$node_primary_2->start;
+$node_primary_2->backup($backup_name);
+my $node_standby_3 = get_new_node('standby_3');
+$node_standby_3->init_from_backup($node_primary_2, $backup_name,
+                                  has_streaming => 1);
+$node_primary_2->stop;
+$node_primary_2->set_standby_mode; # increment primary timeline
+$node_primary_2->start;
+$node_primary_2->promote;
+my $logstart = get_log_size($node_standby_3);
+$node_standby_3->start;
+
+my $success = 0;
+for (my $i = 0 ; $i < 1000; $i++)
+{
+    if (find_in_log(
+            $node_standby_3,
+            "requested WAL segment [0-9A-F]+ has already been removed",
+            $logstart))
+    {
+        last;
+    }
+    elsif (find_in_log(
+            $node_standby_3,
+            "End of WAL reached on timeline",
+               $logstart))
+    {
+        $success = 1;
+        last;
+    }
+    usleep(100_000);
+}
+
+ok($success, 'Timeline increment while reading a historic timeline');
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0

From 6c70a07de75b78830c296819824f2bc4ecfb4677 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 9 Dec 2020 17:22:41 +0900
Subject: [PATCH v2 2/2] Fix a bug of timeline-tracking while sending a
 historic timeline

Since PG13, we should track a timeline switch while sending a historic
timeline running physical replication, but WalSndSegmentOpen fails to
do that. It is a thinko of 709d003fbd.
---
 src/backend/replication/walsender.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fe0d368a35..8545c6c423 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2491,7 +2491,7 @@ WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
         XLogSegNo    endSegNo;
 
         XLByteToSeg(sendTimeLineValidUpto, endSegNo, state->segcxt.ws_segsize);
-        if (state->seg.ws_segno == endSegNo)
+        if (nextSegNo == endSegNo)
             *tli_p = sendTimeLineNextTLI;
     }
 
-- 
2.27.0


pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: zstd compression for pg_dump
Next
From: "Tang, Haiying"
Date:
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist