Re: A failure of standby to follow timeline switch - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: A failure of standby to follow timeline switch
Date
Msg-id 20210107.163236.776031941315368203.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: A failure of standby to follow timeline switch  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: A failure of standby to follow timeline switch  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
At Thu, 7 Jan 2021 11:55:33 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/01/05 17:26, Kyotaro Horiguchi wrote:
> > At Mon, 4 Jan 2021 19:00:21 +0900, Fujii Masao <masao.fujii@oss.nttdata.com>
> > wrote in
> >>
> >>
> >> On 2021/01/04 12:06, Kyotaro Horiguchi wrote:
> >>> At Sat, 26 Dec 2020 02:15:06 +0900, Fujii Masao
> >>> <masao.fujii@oss.nttdata.com> wrote in
> >>>>
> >>>> On 2020/12/25 12:03, Kyotaro Horiguchi wrote:
> >>> The attached is the fixed version.
> >>
> >> Thanks for updating the patches!
> >>
> >>> In the first patch, the test added to 001_stream_rep.pl involves two
> >>> copied functions related to server-log investigation from
> >>> 019_repslot_limit.pl.
> >>
> >> So you're planning to define them commonly in TestLib.pm or elsewhere?
> > Yeah.. That's correct. Newly added as the first patch.
> > While making that change, I extended the interface of slurp_file to
> > allow reading from arbitrary position.
> 
> Is this extension really helpful for current use case?
> At least I'd like to avoid back-patching this since it's an exntesion...

Yeah, I felt a hesitattion about it a bit. It's less useful assuming
that log files won't get so large. Removed in this version.

>          OsFHandleOpen(my $fh = IO::Handle->new(), $fHandle, 'r')
>            or croak "could not read \"$filename\": $^E\n";
> +        seek($fh, $from, 0)
> +          or croak "could not seek \"$filename\" to $from: $^E\n";
> 
> I'm not familiar with this area, but SetFilePointer() is more suitable
> rather than seek()?

SetFilePointer() works for a native handle, IO::Handle->new()
here. seek() works on $fh, a perl handle.  If ReadFile is used later
SetFilePointer() might be needed separately.

Anyway, it is removed.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 39f284c2109b63e444eb8d437fe51ae3268d305b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 5 Jan 2021 13:34:36 +0900
Subject: [PATCH v4 1/3] Move TAP log-searching feature to common modules

Some private functions in 019_repslot_limit.pl will be used in other
tests. Move them to common modules so that they are available to the
new tests.
---
 src/test/perl/PostgresNode.pm             | 36 ++++++++++++++++++++++
 src/test/recovery/t/019_replslot_limit.pl | 37 ++++-------------------
 2 files changed, 42 insertions(+), 31 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..d78e9f93fb 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2223,6 +2223,42 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->current_log_position()
+
+Return the current position of server log.
+
+=cut
+
+sub current_log_position
+{
+    my $self = shift;
+
+    return (stat $self->logfile)[7];
+}
+
+=pod
+
+=item $node->find_in_log($pattern, $startpos)
+
+Returns whether the $pattern occurs after $startpos in the server log.
+
+=cut
+
+sub find_in_log
+{
+    my ($self, $pattern, $startpos) = @_;
+
+    $startpos = 0 unless defined $startpos;
+    my $log = TestLib::slurp_file($self->logfile);
+    return 0 if (length($log) <= $startpos);
+
+    $log = substr($log, $startpos);
+
+    return $log =~ m/$pattern/;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 20f4a1bbc3..b298d9992f 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -165,19 +165,17 @@ $node_primary->wait_for_catchup($node_standby, 'replay', $start_lsn);
 
 $node_standby->stop;
 
-ok( !find_in_log(
-        $node_standby,
-        "requested WAL segment [0-9A-F]+ has already been removed"),
+ok( !$node_standby->find_in_log(
+         "requested WAL segment [0-9A-F]+ has already been removed"),
     'check that required WAL segments are still available');
 
 # Advance WAL again, the slot loses the oldest segment.
-my $logstart = get_log_size($node_primary);
+my $logstart = $node_primary->current_log_position();
 advance_wal($node_primary, 7);
 $node_primary->safe_psql('postgres', "CHECKPOINT;");
 
 # WARNING should be issued
-ok( find_in_log(
-        $node_primary,
+ok( $node_primary->find_in_log(
         "invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds max_slot_wal_keep_size",
         $logstart),
     'check that the warning is logged');
@@ -190,14 +188,13 @@ is($result, "rep1|f|t|lost|",
     'check that the slot became inactive and the state "lost" persists');
 
 # The standby no longer can connect to the primary
-$logstart = get_log_size($node_standby);
+$logstart = $node_standby->current_log_position();
 $node_standby->start;
 
 my $failed = 0;
 for (my $i = 0; $i < 10000; $i++)
 {
-    if (find_in_log(
-            $node_standby,
+    if ($node_standby->find_in_log(
             "requested WAL segment [0-9A-F]+ has already been removed",
             $logstart))
     {
@@ -264,25 +261,3 @@ sub advance_wal
     }
     return;
 }
-
-# return the size of logfile of $node in bytes
-sub get_log_size
-{
-    my ($node) = @_;
-
-    return (stat $node->logfile)[7];
-}
-
-# find $pat in logfile of $node after $off-th byte
-sub find_in_log
-{
-    my ($node, $pat, $off) = @_;
-
-    $off = 0 unless defined $off;
-    my $log = TestLib::slurp_file($node->logfile);
-    return 0 if (length($log) <= $off);
-
-    $log = substr($log, $off);
-
-    return $log =~ m/$pat/;
-}
-- 
2.27.0

From 97d3cfbad959ca889f26fe0d666566340161bd39 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 9 Dec 2020 17:21:49 +0900
Subject: [PATCH v4 2/3] New test for timeline-tracking of walsender

Walsender should track timeline changes while sending a historic
timeline. Add a test for it.
---
 src/test/recovery/t/001_stream_rep.pl | 41 ++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 9e31a53de7..e78e98fb43 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -2,8 +2,9 @@
 use strict;
 use warnings;
 use PostgresNode;
+use Time::HiRes qw(usleep);
 use TestLib;
-use Test::More tests => 36;
+use Test::More tests => 37;
 
 # Initialize primary node
 my $node_primary = get_new_node('primary');
@@ -409,3 +410,41 @@ ok( ($phys_restart_lsn_pre cmp $phys_restart_lsn_post) == 0,
 my $primary_data = $node_primary->data_dir;
 ok(!-f "$primary_data/pg_wal/$segment_removed",
     "WAL segment $segment_removed recycled after physical slot advancing");
+
+#
+# Check if timeline-increment works while reading a historic timeline.
+my $node_primary_2 = get_new_node('primary_2');
+# archiving is needed to create .paritial segment
+$node_primary_2->init(allows_streaming => 1, has_archiving => 1);
+$node_primary_2->start;
+$node_primary_2->backup($backup_name);
+my $node_standby_3 = get_new_node('standby_3');
+$node_standby_3->init_from_backup($node_primary_2, $backup_name,
+                                  has_streaming => 1);
+$node_primary_2->stop;
+$node_primary_2->set_standby_mode; # increment primary timeline
+$node_primary_2->start;
+$node_primary_2->promote;
+my $logstart = $node_standby_3->current_log_position();
+$node_standby_3->start;
+
+my $success = 0;
+for (my $i = 0 ; $i < 1000; $i++)
+{
+    if ($node_standby_3->find_in_log(
+            "requested WAL segment [0-9A-F]+ has already been removed",
+            $logstart))
+    {
+        last;
+    }
+    elsif ($node_standby_3->find_in_log(
+            "End of WAL reached on timeline",
+               $logstart))
+    {
+        $success = 1;
+        last;
+    }
+    usleep(100_000);
+}
+
+ok($success, 'Timeline increment while reading a historic timeline');
-- 
2.27.0

From 29decc6dc2ccea5ef4e2c8b92f9c2b65a3898fcb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 9 Dec 2020 17:22:41 +0900
Subject: [PATCH v4 3/3] Fix timeline-tracking failure while sending a historic
 timeline

Walsender should track timeline switches while sending a historic
timeline. Regain that behavior, which was broken in PG13, by a thinko
of 709d003fbd.
---
 src/backend/replication/walsender.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fe0d368a35..8545c6c423 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2491,7 +2491,7 @@ WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
         XLogSegNo    endSegNo;
 
         XLByteToSeg(sendTimeLineValidUpto, endSegNo, state->segcxt.ws_segsize);
-        if (state->seg.ws_segno == endSegNo)
+        if (nextSegNo == endSegNo)
             *tli_p = sendTimeLineNextTLI;
     }
 
-- 
2.27.0

From 933dd946ca547b7de2dfed84807cd9d871c12f6f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 5 Jan 2021 13:34:36 +0900
Subject: [PATCH] Fix timeline-tracking failure while sending a historic
 timeline

Walsender should track timeline switches while sending a historic
timeline. Regain that behavior, which was broken in PG13, by a thinko
of 709d003fbd. Backpatch to PG13.
---
 src/backend/replication/walsender.c       |  2 +-
 src/test/perl/PostgresNode.pm             | 36 ++++++++++++++++++++
 src/test/recovery/t/001_stream_rep.pl     | 41 ++++++++++++++++++++++-
 src/test/recovery/t/019_replslot_limit.pl | 37 ++++----------------
 4 files changed, 83 insertions(+), 33 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7f87eb7f19..04f6c3ebb4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2478,7 +2478,7 @@ WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
         XLogSegNo    endSegNo;
 
         XLByteToSeg(sendTimeLineValidUpto, endSegNo, state->segcxt.ws_segsize);
-        if (state->seg.ws_segno == endSegNo)
+        if (nextSegNo == endSegNo)
             *tli_p = sendTimeLineNextTLI;
     }
 
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 980f1f1533..a08c71b549 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2138,6 +2138,42 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->current_log_position()
+
+Return the current position of server log.
+
+=cut
+
+sub current_log_position
+{
+    my $self = shift;
+
+    return (stat $self->logfile)[7];
+}
+
+=pod
+
+=item $node->find_in_log($pattern, $startpos)
+
+Returns whether the $pattern occurs after $startpos in the server log.
+
+=cut
+
+sub find_in_log
+{
+    my ($self, $pattern, $startpos) = @_;
+
+    $startpos = 0 unless defined $startpos;
+    my $log = TestLib::slurp_file($self->logfile);
+    return 0 if (length($log) <= $startpos);
+
+    $log = substr($log, $startpos);
+
+    return $log =~ m/$pattern/;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 778f11b28b..8d2b24fe55 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -2,8 +2,9 @@
 use strict;
 use warnings;
 use PostgresNode;
+use Time::HiRes qw(usleep);
 use TestLib;
-use Test::More tests => 36;
+use Test::More tests => 37;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -409,3 +410,41 @@ ok( ($phys_restart_lsn_pre cmp $phys_restart_lsn_post) == 0,
 my $master_data = $node_master->data_dir;
 ok(!-f "$master_data/pg_wal/$segment_removed",
     "WAL segment $segment_removed recycled after physical slot advancing");
+
+#
+# Check if timeline-increment works while reading a historic timeline.
+my $node_primary_2 = get_new_node('primary_2');
+# archiving is needed to create .paritial segment
+$node_primary_2->init(allows_streaming => 1, has_archiving => 1);
+$node_primary_2->start;
+$node_primary_2->backup($backup_name);
+my $node_standby_3 = get_new_node('standby_3');
+$node_standby_3->init_from_backup($node_primary_2, $backup_name,
+                                  has_streaming => 1);
+$node_primary_2->stop;
+$node_primary_2->set_standby_mode; # increment primary timeline
+$node_primary_2->start;
+$node_primary_2->promote;
+my $logstart = $node_standby_3->current_log_position();
+$node_standby_3->start;
+
+my $success = 0;
+for (my $i = 0 ; $i < 1000; $i++)
+{
+    if ($node_standby_3->find_in_log(
+            "requested WAL segment [0-9A-F]+ has already been removed",
+            $logstart))
+    {
+        last;
+    }
+    elsif ($node_standby_3->find_in_log(
+            "End of WAL reached on timeline",
+               $logstart))
+    {
+        $success = 1;
+        last;
+    }
+    usleep(100_000);
+}
+
+ok($success, 'Timeline increment while reading a historic timeline');
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index a7231dcd47..8b3c5de057 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -165,19 +165,17 @@ $node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
 
 $node_standby->stop;
 
-ok( !find_in_log(
-        $node_standby,
-        "requested WAL segment [0-9A-F]+ has already been removed"),
+ok( !$node_standby->find_in_log(
+         "requested WAL segment [0-9A-F]+ has already been removed"),
     'check that required WAL segments are still available');
 
 # Advance WAL again, the slot loses the oldest segment.
-my $logstart = get_log_size($node_master);
+my $logstart = $node_master->current_log_position();
 advance_wal($node_master, 7);
 $node_master->safe_psql('postgres', "CHECKPOINT;");
 
 # WARNING should be issued
-ok( find_in_log(
-        $node_master,
+ok( $node_master->find_in_log(
         "invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds max_slot_wal_keep_size",
         $logstart),
     'check that the warning is logged');
@@ -190,14 +188,13 @@ is($result, "rep1|f|t|lost|",
     'check that the slot became inactive and the state "lost" persists');
 
 # The standby no longer can connect to the master
-$logstart = get_log_size($node_standby);
+$logstart = $node_standby->current_log_position();
 $node_standby->start;
 
 my $failed = 0;
 for (my $i = 0; $i < 10000; $i++)
 {
-    if (find_in_log(
-            $node_standby,
+    if ($node_standby->find_in_log(
             "requested WAL segment [0-9A-F]+ has already been removed",
             $logstart))
     {
@@ -264,25 +261,3 @@ sub advance_wal
     }
     return;
 }
-
-# return the size of logfile of $node in bytes
-sub get_log_size
-{
-    my ($node) = @_;
-
-    return (stat $node->logfile)[7];
-}
-
-# find $pat in logfile of $node after $off-th byte
-sub find_in_log
-{
-    my ($node, $pat, $off) = @_;
-
-    $off = 0 unless defined $off;
-    my $log = TestLib::slurp_file($node->logfile);
-    return 0 if (length($log) <= $off);
-
-    $log = substr($log, $off);
-
-    return $log =~ m/$pat/;
-}
-- 
2.27.0


pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: CheckpointLock needed in CreateCheckPoint()?
Next
From: Dilip Kumar
Date:
Subject: Re: CheckpointLock needed in CreateCheckPoint()?