Re: Timeout failure in 019_replslot_limit.pl - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Timeout failure in 019_replslot_limit.pl
Date
Msg-id 20210907.120106.1483239898065111540.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Timeout failure in 019_replslot_limit.pl  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
At Tue, 7 Sep 2021 09:37:10 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote:
> > I scraped the buildfarm logs looking for similar failures, and didn't
> > find any.  (019_replslot_limit.pl hasn't failed at all in the farm
> > since the last fix it received, in late July.)
> 
> The interesting bits are in 019_replslot_limit_primary3.log.  In a
> failed run, I can see that we get immediately a process termination,
> as follows:
> 2021-09-07 07:52:53.402 JST [22890] LOG:  terminating process 23082 to release replication slot "rep3"
> 2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL:  terminating connection due to administrator command
> 2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
> 2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM
pg_replication_slotsWHERE slot_name = 'rep3'
 
> 
> In a successful run, the pattern is different:
> 2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL:  terminating connection due to administrator command
> 2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
> 2021-09-07 09:27:39.832 JST [57092] LOG:  invalidating slot "rep3" because its restart_lsn 0/7000D8 exceeds
max_slot_wal_keep_size
> 2021-09-07 09:27:39.833 JST [57092] LOG:  checkpoint complete: wrote
> 19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled;
> write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB
> 2021-09-07 09:27:39.833 JST [57092] LOG:  checkpoints are occurring too frequently (0 seconds apart)
> 2021-09-07 09:27:39.833 JST [57092] HINT:  Consider increasing the configuration parameter "max_wal_size".
> 2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM
pg_replication_slotsWHERE slot_name = 'rep3'
 
> 
> The slot invalidation is forgotten because we don't complete a
> checkpoint that does the work it should do, no?  There is a completed
> checkpoint before we query pg_replication_slots, and the buildfarm
> shows the same thing.

It seems like the "kill 'STOP'" in the script didn't suspend the
processes before advancing WAL. The attached uses 'ps' command to
check that since I didn't come up with the way to do the same in Perl.

I'm still not sure it works as expected, though.  (Imagining the case
where the state changes before the process actually stops..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index e065c5c008..7388900008 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3);
 # freeze walsender and walreceiver. Slot will still be active, but walreceiver
 # won't get anything anymore.
 kill 'STOP', $senderpid, $receiverpid;
+wait_to_stop($senderpid, $receiverpid);
+
 advance_wal($node_primary3, 2);
 
 my $max_attempts = 180;
@@ -426,3 +428,30 @@ sub find_in_log
 
     return $log =~ m/$pat/;
 }
+
+sub wait_to_stop
+{
+    my (@pids) = @_;
+    my $max_attempts = 180;
+
+    # Haven't found the means to do the same on Windows
+    return if $TestLib::windows_os;
+
+    while ($max_attempts-- >= 0)
+    {
+        my ($all_stopped) = 1;
+
+        foreach my $pid (@pids)
+        {
+            if (`ps -p $pid -o pid,state | tail -1` !~ /^ *\d+ +T/)
+            {
+                $all_stopped = 0;
+                last;
+            }
+        }
+
+        last if ($all_stopped);
+
+        sleep 1;
+    }
+}

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Postgres perl module namespace
Next
From: Amit Kapila
Date:
Subject: Re: [BUG] Failed Assertion in ReorderBufferChangeMemoryUpdate()