Re: A failure of standby to follow timeline switch - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: A failure of standby to follow timeline switch |
Date | |
Msg-id | 20210104.120610.767727791822348683.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: A failure of standby to follow timeline switch (Fujii Masao <masao.fujii@oss.nttdata.com>) |
Responses |
Re: A failure of standby to follow timeline switch
|
List | pgsql-hackers |
At Sat, 26 Dec 2020 02:15:06 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2020/12/25 12:03, Kyotaro Horiguchi wrote: > > Thank you for looking this. > > At Thu, 24 Dec 2020 15:33:04 +0900, Fujii Masao > > <masao.fujii@oss.nttdata.com> wrote in > >> When I applied two patches in the master branch and > >> ran "make check-world", I got the following error. > >> > >> ============== creating database "contrib_regression" ============== > >> # Looks like you planned 37 tests but ran 36. > >> # Looks like your test exited with 255 just after 36. > >> t/001_stream_rep.pl .................. > >> Dubious, test returned 255 (wstat 65280, 0xff00) > >> Failed 1/37 subtests > >> ... > >> Test Summary Report > >> ------------------- > >> t/001_stream_rep.pl (Wstat: 65280 Tests: 36 Failed: 0) > >> Non-zero exit status: 255 > >> Parse errors: Bad plan. You planned 37 tests but ran 36. > >> Files=21, Tests=239, 302 wallclock secs ( 0.10 usr 0.05 sys + 41.69 > >> cusr 39.84 csys = 81.68 CPU) > >> Result: FAIL > >> make[2]: *** [check] Error 1 > >> make[1]: *** [check-recovery-recurse] Error 2 > >> make[1]: *** Waiting for unfinished jobs.... > >> t/070_dropuser.pl ......... ok > > Mmm. I retried that and saw it succeed (with 0002 applied). > > If I modified "user Test::More tests => 37" to 38 in the perl file, I > > got a similar result. > > What happens if you run make check-world with -j 4? When I ran that, > the test failed. But with -j 1, the test finished with success. I'm > not sure > why this happened, though.. Maybe this is it. + usleep(100_000); If the script doesn't find the expected log line, it reaches the usleep and bark that "Undefined subroutine &main::usleep called...". I thought I tested that path but perhaps I overlooked the error. "use Time::HiRes" is needed. The attached is the fixed version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 9853edb792f81953567b68da0b79baa57a8c518c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 9 Dec 2020 17:21:49 +0900 Subject: [PATCH v2 1/2] Add a new test to detect a replication bug --- src/test/recovery/t/001_stream_rep.pl | 65 ++++++++++++++++++++++++++- 1 file changed, 64 insertions(+), 1 deletion(-) diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl index 9e31a53de7..13b6e00da4 100644 --- a/src/test/recovery/t/001_stream_rep.pl +++ b/src/test/recovery/t/001_stream_rep.pl @@ -2,8 +2,9 @@ use strict; use warnings; use PostgresNode; +use Time::HiRes qw(usleep); use TestLib; -use Test::More tests => 36; +use Test::More tests => 37; # Initialize primary node my $node_primary = get_new_node('primary'); @@ -409,3 +410,65 @@ ok( ($phys_restart_lsn_pre cmp $phys_restart_lsn_post) == 0, my $primary_data = $node_primary->data_dir; ok(!-f "$primary_data/pg_wal/$segment_removed", "WAL segment $segment_removed recycled after physical slot advancing"); + +# +# Check if timeline-increment works while reading a historic timeline. +my $node_primary_2 = get_new_node('primary_2'); +$node_primary_2->init(allows_streaming => 1); +$node_primary_2->enable_archiving; # needed to make .paritial segment +$node_primary_2->start; +$node_primary_2->backup($backup_name); +my $node_standby_3 = get_new_node('standby_3'); +$node_standby_3->init_from_backup($node_primary_2, $backup_name, + has_streaming => 1); +$node_primary_2->stop; +$node_primary_2->set_standby_mode; # increment primary timeline +$node_primary_2->start; +$node_primary_2->promote; +my $logstart = get_log_size($node_standby_3); +$node_standby_3->start; + +my $success = 0; +for (my $i = 0 ; $i < 1000; $i++) +{ + if (find_in_log( + $node_standby_3, + "requested WAL segment [0-9A-F]+ has already been removed", + $logstart)) + { + last; + } + elsif (find_in_log( + $node_standby_3, + "End of WAL reached on timeline", + $logstart)) + { + $success = 1; + last; + } + usleep(100_000); +} + +ok($success, 'Timeline increment while reading a historic timeline'); + +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = TestLib::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0 From 6c70a07de75b78830c296819824f2bc4ecfb4677 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 9 Dec 2020 17:22:41 +0900 Subject: [PATCH v2 2/2] Fix a bug of timeline-tracking while sending a historic timeline Since PG13, we should track a timeline switch while sending a historic timeline running physical replication, but WalSndSegmentOpen fails to do that. It is a thinko of 709d003fbd. --- src/backend/replication/walsender.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index fe0d368a35..8545c6c423 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2491,7 +2491,7 @@ WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo, XLogSegNo endSegNo; XLByteToSeg(sendTimeLineValidUpto, endSegNo, state->segcxt.ws_segsize); - if (state->seg.ws_segno == endSegNo) + if (nextSegNo == endSegNo) *tli_p = sendTimeLineNextTLI; } -- 2.27.0
pgsql-hackers by date: