Re: A failure of standby to follow timeline switch - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: A failure of standby to follow timeline switch
Date
Msg-id 20210105.172602.1507192651696938626.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: A failure of standby to follow timeline switch  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: A failure of standby to follow timeline switch  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: A failure of standby to follow timeline switch  (Fujii Masao <masao.fujii@oss.nttdata.com>)
List pgsql-hackers
At Mon, 4 Jan 2021 19:00:21 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/01/04 12:06, Kyotaro Horiguchi wrote:
> > At Sat, 26 Dec 2020 02:15:06 +0900, Fujii Masao
> > <masao.fujii@oss.nttdata.com> wrote in
> >>
> >> On 2020/12/25 12:03, Kyotaro Horiguchi wrote:
> > The attached is the fixed version.
> 
> Thanks for updating the patches!
> 
> > In the first patch, the test added to 001_stream_rep.pl involves two
> > copied functions related to server-log investigation from
> > 019_repslot_limit.pl.
> 
> So you're planning to define them commonly in TestLib.pm or elsewhere?

Yeah.. That's correct. Newly added as the first patch.

While making that change, I extended the interface of slurp_file to
allow reading from arbitrary position. I attached this as a separate
patch just for clarifying the changeset.

The existing messages for open() and OSHandleOpen() look somewhat
strange after patching since they are not really "read" errors, but
they're harmless. (It successfully ran also on Windows10)

The first hunk below is a fix for a forgotten line-feed.

         my $fHandle = createFile($filename, "r", "rwd")
-          or croak "could not open \"$filename\": $^E";
+          or croak "could not open \"$filename\": $^E\n";
         OsFHandleOpen(my $fh = IO::Handle->new(), $fHandle, 'r')
           or croak "could not read \"$filename\": $^E\n";
+        seek($fh, $from, 0)
+          or croak "could not seek \"$filename\" to $from: $^E\n";


> +$node_primary_2->init(allows_streaming => 1);
> +$node_primary_2->enable_archiving; # needed to make .paritial segment
> 
> Isn't it better to use has_archiving flag in init() instead of doing
> enable_archiving, like other tests do?

Agreed. Fixed 0002 (formerly 0001).

> 0002 looks good to me.

Thanks. The attached is the revised patchset. 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 41169fdadeecc04c20e537cc4e8f68da1a6b3f94 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 5 Jan 2021 13:34:36 +0900
Subject: [PATCH v3 1/3] Move TAP log-searching feature to common modules

Some private functions in 019_repslot_limit.pl will be used in other
tests. Move them to common modules so that they are available to the
new tests.
---
 src/test/perl/PostgresNode.pm             | 33 ++++++++++++++++++++
 src/test/perl/TestLib.pm                  | 16 +++++++---
 src/test/recovery/t/019_replslot_limit.pl | 37 ++++-------------------
 3 files changed, 51 insertions(+), 35 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..b09273efe5 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2223,6 +2223,39 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->current_log_position()
+
+Return the current position of server log.
+
+=cut
+
+sub current_log_position
+{
+    my $self = shift;
+
+    return (stat $self->logfile)[7];
+}
+
+=pod
+
+=item $node->find_in_log($pattern, $startpos)
+
+Returns whether the $pattern occurs after $startpos in the server log.
+
+=cut
+
+sub find_in_log
+{
+    my ($self, $pattern, $startpos) = @_;
+
+    $startpos = 0 unless defined $startpos;
+    my $log = TestLib::slurp_file($self->logfile, $startpos);
+
+    return $log =~ m/$pattern/;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 1baf6bd001..8751f48dfe 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -430,30 +430,38 @@ sub slurp_dir
 
 =pod
 
-=item slurp_file(filename)
+=item slurp_file(filename, pos)
 
-Return the full contents of the specified file.
+Return the contents after pos of the specified file.
+Reutrns the full contents if pos is omitted.
 
 =cut
 
 sub slurp_file
 {
-    my ($filename) = @_;
+    my ($filename, $from) = @_;
     local $/;
     my $contents;
+
+    $from = 0 unless defined $from;
+
     if ($Config{osname} ne 'MSWin32')
     {
         open(my $in, '<', $filename)
           or croak "could not read \"$filename\": $!";
+        seek($in, $from, 0)
+          or croak "could not seek \"$filename\" to $from: $!";
         $contents = <$in>;
         close $in;
     }
     else
     {
         my $fHandle = createFile($filename, "r", "rwd")
-          or croak "could not open \"$filename\": $^E";
+          or croak "could not open \"$filename\": $^E\n";
         OsFHandleOpen(my $fh = IO::Handle->new(), $fHandle, 'r')
           or croak "could not read \"$filename\": $^E\n";
+        seek($fh, $from, 0)
+          or croak "could not seek \"$filename\" to $from: $^E\n";
         $contents = <$fh>;
         CloseHandle($fHandle)
           or croak "could not close \"$filename\": $^E\n";
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 20f4a1bbc3..b298d9992f 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -165,19 +165,17 @@ $node_primary->wait_for_catchup($node_standby, 'replay', $start_lsn);
 
 $node_standby->stop;
 
-ok( !find_in_log(
-        $node_standby,
-        "requested WAL segment [0-9A-F]+ has already been removed"),
+ok( !$node_standby->find_in_log(
+         "requested WAL segment [0-9A-F]+ has already been removed"),
     'check that required WAL segments are still available');
 
 # Advance WAL again, the slot loses the oldest segment.
-my $logstart = get_log_size($node_primary);
+my $logstart = $node_primary->current_log_position();
 advance_wal($node_primary, 7);
 $node_primary->safe_psql('postgres', "CHECKPOINT;");
 
 # WARNING should be issued
-ok( find_in_log(
-        $node_primary,
+ok( $node_primary->find_in_log(
         "invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds max_slot_wal_keep_size",
         $logstart),
     'check that the warning is logged');
@@ -190,14 +188,13 @@ is($result, "rep1|f|t|lost|",
     'check that the slot became inactive and the state "lost" persists');
 
 # The standby no longer can connect to the primary
-$logstart = get_log_size($node_standby);
+$logstart = $node_standby->current_log_position();
 $node_standby->start;
 
 my $failed = 0;
 for (my $i = 0; $i < 10000; $i++)
 {
-    if (find_in_log(
-            $node_standby,
+    if ($node_standby->find_in_log(
             "requested WAL segment [0-9A-F]+ has already been removed",
             $logstart))
     {
@@ -264,25 +261,3 @@ sub advance_wal
     }
     return;
 }
-
-# return the size of logfile of $node in bytes
-sub get_log_size
-{
-    my ($node) = @_;
-
-    return (stat $node->logfile)[7];
-}
-
-# find $pat in logfile of $node after $off-th byte
-sub find_in_log
-{
-    my ($node, $pat, $off) = @_;
-
-    $off = 0 unless defined $off;
-    my $log = TestLib::slurp_file($node->logfile);
-    return 0 if (length($log) <= $off);
-
-    $log = substr($log, $off);
-
-    return $log =~ m/$pat/;
-}
-- 
2.27.0

From 918d75aa4064cda0fe6050b80b3aa914ab7eef47 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 9 Dec 2020 17:21:49 +0900
Subject: [PATCH v3 2/3] New test for timeline-tracking of walsender

Walsender should track timeline changes while sending a historic
timeline. Add a test for it.
---
 src/test/recovery/t/001_stream_rep.pl | 41 ++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 9e31a53de7..e78e98fb43 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -2,8 +2,9 @@
 use strict;
 use warnings;
 use PostgresNode;
+use Time::HiRes qw(usleep);
 use TestLib;
-use Test::More tests => 36;
+use Test::More tests => 37;
 
 # Initialize primary node
 my $node_primary = get_new_node('primary');
@@ -409,3 +410,41 @@ ok( ($phys_restart_lsn_pre cmp $phys_restart_lsn_post) == 0,
 my $primary_data = $node_primary->data_dir;
 ok(!-f "$primary_data/pg_wal/$segment_removed",
     "WAL segment $segment_removed recycled after physical slot advancing");
+
+#
+# Check if timeline-increment works while reading a historic timeline.
+my $node_primary_2 = get_new_node('primary_2');
+# archiving is needed to create .paritial segment
+$node_primary_2->init(allows_streaming => 1, has_archiving => 1);
+$node_primary_2->start;
+$node_primary_2->backup($backup_name);
+my $node_standby_3 = get_new_node('standby_3');
+$node_standby_3->init_from_backup($node_primary_2, $backup_name,
+                                  has_streaming => 1);
+$node_primary_2->stop;
+$node_primary_2->set_standby_mode; # increment primary timeline
+$node_primary_2->start;
+$node_primary_2->promote;
+my $logstart = $node_standby_3->current_log_position();
+$node_standby_3->start;
+
+my $success = 0;
+for (my $i = 0 ; $i < 1000; $i++)
+{
+    if ($node_standby_3->find_in_log(
+            "requested WAL segment [0-9A-F]+ has already been removed",
+            $logstart))
+    {
+        last;
+    }
+    elsif ($node_standby_3->find_in_log(
+            "End of WAL reached on timeline",
+               $logstart))
+    {
+        $success = 1;
+        last;
+    }
+    usleep(100_000);
+}
+
+ok($success, 'Timeline increment while reading a historic timeline');
-- 
2.27.0

From 5212aa7733aecd7c4ae8cc615584f26caa8a2ee4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 9 Dec 2020 17:22:41 +0900
Subject: [PATCH v3 3/3] Fix timeline-tracking failure while sending a historic
 timeline

Walsender should track timeline switches while sending a historic
timeline. Regain that behavior, which was broken in PG13, by a thinko
of 709d003fbd.
---
 src/backend/replication/walsender.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fe0d368a35..8545c6c423 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2491,7 +2491,7 @@ WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
         XLogSegNo    endSegNo;
 
         XLByteToSeg(sendTimeLineValidUpto, endSegNo, state->segcxt.ws_segsize);
-        if (state->seg.ws_segno == endSegNo)
+        if (nextSegNo == endSegNo)
             *tli_p = sendTimeLineNextTLI;
     }
 
-- 
2.27.0


pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Cirrus CI (Windows help wanted)