Re: A failure of standby to follow timeline switch - Mailing list pgsql-hackers
| From | Kyotaro Horiguchi |
|---|---|
| Subject | Re: A failure of standby to follow timeline switch |
| Date | |
| Msg-id | 20210104.120610.767727791822348683.horikyota.ntt@gmail.com Whole thread Raw |
| In response to | Re: A failure of standby to follow timeline switch (Fujii Masao <masao.fujii@oss.nttdata.com>) |
| Responses |
Re: A failure of standby to follow timeline switch
|
| List | pgsql-hackers |
At Sat, 26 Dec 2020 02:15:06 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
>
>
> On 2020/12/25 12:03, Kyotaro Horiguchi wrote:
> > Thank you for looking this.
> > At Thu, 24 Dec 2020 15:33:04 +0900, Fujii Masao
> > <masao.fujii@oss.nttdata.com> wrote in
> >> When I applied two patches in the master branch and
> >> ran "make check-world", I got the following error.
> >>
> >> ============== creating database "contrib_regression" ==============
> >> # Looks like you planned 37 tests but ran 36.
> >> # Looks like your test exited with 255 just after 36.
> >> t/001_stream_rep.pl ..................
> >> Dubious, test returned 255 (wstat 65280, 0xff00)
> >> Failed 1/37 subtests
> >> ...
> >> Test Summary Report
> >> -------------------
> >> t/001_stream_rep.pl (Wstat: 65280 Tests: 36 Failed: 0)
> >> Non-zero exit status: 255
> >> Parse errors: Bad plan. You planned 37 tests but ran 36.
> >> Files=21, Tests=239, 302 wallclock secs ( 0.10 usr 0.05 sys + 41.69
> >> cusr 39.84 csys = 81.68 CPU)
> >> Result: FAIL
> >> make[2]: *** [check] Error 1
> >> make[1]: *** [check-recovery-recurse] Error 2
> >> make[1]: *** Waiting for unfinished jobs....
> >> t/070_dropuser.pl ......... ok
> > Mmm. I retried that and saw it succeed (with 0002 applied).
> > If I modified "user Test::More tests => 37" to 38 in the perl file, I
> > got a similar result.
>
> What happens if you run make check-world with -j 4? When I ran that,
> the test failed. But with -j 1, the test finished with success. I'm
> not sure
> why this happened, though..
Maybe this is it.
+ usleep(100_000);
If the script doesn't find the expected log line, it reaches the
usleep and bark that "Undefined subroutine &main::usleep called...". I
thought I tested that path but perhaps I overlooked the error. "use
Time::HiRes" is needed.
The attached is the fixed version.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From 9853edb792f81953567b68da0b79baa57a8c518c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 9 Dec 2020 17:21:49 +0900
Subject: [PATCH v2 1/2] Add a new test to detect a replication bug
---
src/test/recovery/t/001_stream_rep.pl | 65 ++++++++++++++++++++++++++-
1 file changed, 64 insertions(+), 1 deletion(-)
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 9e31a53de7..13b6e00da4 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -2,8 +2,9 @@
use strict;
use warnings;
use PostgresNode;
+use Time::HiRes qw(usleep);
use TestLib;
-use Test::More tests => 36;
+use Test::More tests => 37;
# Initialize primary node
my $node_primary = get_new_node('primary');
@@ -409,3 +410,65 @@ ok( ($phys_restart_lsn_pre cmp $phys_restart_lsn_post) == 0,
my $primary_data = $node_primary->data_dir;
ok(!-f "$primary_data/pg_wal/$segment_removed",
"WAL segment $segment_removed recycled after physical slot advancing");
+
+#
+# Check if timeline-increment works while reading a historic timeline.
+my $node_primary_2 = get_new_node('primary_2');
+$node_primary_2->init(allows_streaming => 1);
+$node_primary_2->enable_archiving; # needed to make .paritial segment
+$node_primary_2->start;
+$node_primary_2->backup($backup_name);
+my $node_standby_3 = get_new_node('standby_3');
+$node_standby_3->init_from_backup($node_primary_2, $backup_name,
+ has_streaming => 1);
+$node_primary_2->stop;
+$node_primary_2->set_standby_mode; # increment primary timeline
+$node_primary_2->start;
+$node_primary_2->promote;
+my $logstart = get_log_size($node_standby_3);
+$node_standby_3->start;
+
+my $success = 0;
+for (my $i = 0 ; $i < 1000; $i++)
+{
+ if (find_in_log(
+ $node_standby_3,
+ "requested WAL segment [0-9A-F]+ has already been removed",
+ $logstart))
+ {
+ last;
+ }
+ elsif (find_in_log(
+ $node_standby_3,
+ "End of WAL reached on timeline",
+ $logstart))
+ {
+ $success = 1;
+ last;
+ }
+ usleep(100_000);
+}
+
+ok($success, 'Timeline increment while reading a historic timeline');
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+ my ($node) = @_;
+
+ return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+ my ($node, $pat, $off) = @_;
+
+ $off = 0 unless defined $off;
+ my $log = TestLib::slurp_file($node->logfile);
+ return 0 if (length($log) <= $off);
+
+ $log = substr($log, $off);
+
+ return $log =~ m/$pat/;
+}
--
2.27.0
From 6c70a07de75b78830c296819824f2bc4ecfb4677 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 9 Dec 2020 17:22:41 +0900
Subject: [PATCH v2 2/2] Fix a bug of timeline-tracking while sending a
historic timeline
Since PG13, we should track a timeline switch while sending a historic
timeline running physical replication, but WalSndSegmentOpen fails to
do that. It is a thinko of 709d003fbd.
---
src/backend/replication/walsender.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fe0d368a35..8545c6c423 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2491,7 +2491,7 @@ WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
XLogSegNo endSegNo;
XLByteToSeg(sendTimeLineValidUpto, endSegNo, state->segcxt.ws_segsize);
- if (state->seg.ws_segno == endSegNo)
+ if (nextSegNo == endSegNo)
*tli_p = sendTimeLineNextTLI;
}
--
2.27.0
pgsql-hackers by date: