Re: Fixing WAL instability in various TAP tests - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Fixing WAL instability in various TAP tests
Date
Msg-id 2870209.1632855613@sss.pgh.pa.us
Whole thread Raw
In response to Re: Fixing WAL instability in various TAP tests  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: Fixing WAL instability in various TAP tests
List pgsql-hackers
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Perhaps having the bloom index messed up answers that, though.  I think it should be easy enough to get the path to
theheap main table fork and the bloom main index fork for both the primary and standby and do a filesystem comparison
aspart of the wal test.  That would tell us if they differ, and also if the differences are limited to just one or the
other.

I think that's probably overkill, and definitely out-of-scope for
contrib/bloom.  If we fear that WAL replay is not reproducing the data
accurately, we should be testing for that in some more centralized place.

Anyway, I confirmed my diagnosis by adding a delay in WAL apply
(0001 below); that makes this test fall over spectacularly.
And 0002 fixes it.  So I propose to push 0002 as soon as the
v14 release freeze ends.

Should we back-patch 0002?  I'm inclined to think so.  Should
we then also back-patch enablement of the bloom test?  Less
sure about that, but I'd lean to doing so.  A test that appears
to be there but isn't actually invoked is pretty misleading.

            regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..eecbe57aee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7370,6 +7370,9 @@ StartupXLOG(void)
             {
                 bool        switchedTLI = false;

+                if (random() < INT_MAX/100)
+                    pg_usleep(100000);
+
 #ifdef WAL_DEBUG
                 if (XLOG_DEBUG ||
                     (rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||
diff --git a/contrib/bloom/t/001_wal.pl b/contrib/bloom/t/001_wal.pl
index 55ad35926f..be8916a8eb 100644
--- a/contrib/bloom/t/001_wal.pl
+++ b/contrib/bloom/t/001_wal.pl
@@ -16,12 +16,10 @@ sub test_index_replay
 {
     my ($test_name) = @_;

+    local $Test::Builder::Level = $Test::Builder::Level + 1;
+
     # Wait for standby to catch up
-    my $applname = $node_standby->name;
-    my $caughtup_query =
-      "SELECT pg_current_wal_lsn() <= write_lsn FROM pg_stat_replication WHERE application_name = '$applname';";
-    $node_primary->poll_query_until('postgres', $caughtup_query)
-      or die "Timed out while waiting for standby 1 to catch up";
+    $node_primary->wait_for_catchup($node_standby);

     my $queries = qq(SET enable_seqscan=off;
 SET enable_bitmapscan=on;

pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: Fixing WAL instability in various TAP tests
Next
From: Antonin Houska
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs