Thread: Fixing WAL instability in various TAP tests

Fixing WAL instability in various TAP tests

From

Mark Dilger

Date:

25 September 2021, 00:33:13

Hackers,

A few TAP tests in the project appear to be sensitive to reductions of the PostgresNode's max_wal_size setting,
resultingin tests failing due to wal files having been removed too soon. The failures in the logs typically are of the
"requestedWAL segment %s has already been removed" variety. I would expect tests which fail under legal alternate GUC
settingsto be hardened to explicitly set the GUCs as they need, rather than implicitly relying on the defaults. As far
asmissing WAL files go, I would expect the TAP test to prevent this with the use of replication slots or some other
mechanism,and not simply to rely on checkpoints not happening too soon. I'm curious if others on this list disagree
withthat point of view.

Failures in src/test/recovery/t/015_promotion_pages.pl can be fixed by creating a physical replication slot on node
"alpha"and using it from node "beta", a technique already used in other TAP tests and apparently merely overlooked in
thisone.

The first two tests in src/bin/pg_basebackup/t fail, and it's not clear that physical replication slots are the
appropriatesolution, since no replication is happening. It's not immediately obvious that the tests are at fault
anyway. On casual inspection, it seems they might be detecting a live bug which simply doesn't manifest under larger
valuesof max_wal_size. Test 010 appears to show a bug with `pg_basebackup -X`, and test 020 with `pg_receivewal`.

The test in contrib/bloom/t/ is deliberately disabled in contrib/bloom/Makefile with a comment that the test is
unstablein the buildfarm, but I didn't find anything to explain what exactly those buildfarm failures might have been
whenI chased down the email thread that gave rise to the related commit. That test happens to be stable on my laptop
untilI change GUC settings to both reduce max_wal_size=32MB and to set wal_consistency_checking=all.

Thoughts?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fixing WAL instability in various TAP tests

From

Noah Misch

Date:

25 September 2021, 05:21:51

On Fri, Sep 24, 2021 at 05:33:13PM -0700, Mark Dilger wrote:
> A few TAP tests in the project appear to be sensitive to reductions of the
> PostgresNode's max_wal_size setting, resulting in tests failing due to wal
> files having been removed too soon.  The failures in the logs typically are
> of the "requested WAL segment %s has already been removed" variety.  I would
> expect tests which fail under legal alternate GUC settings to be hardened to
> explicitly set the GUCs as they need, rather than implicitly relying on the
> defaults.

That is not the general practice in PostgreSQL tests today.  The buildfarm
exercises some settings, so we keep the tests clean for those.  Coping with
max_wal_size=2 that way sounds reasonable.  I'm undecided about the value of
hardening tests against all possible settings.  On the plus side, it would let
us run one buildfarm member that sets every setting to its min_val or
enumvals[1] and another member that elects enumvals[cardinality(enumvals)] or
max_val.  We'd catch some real bugs that way.  On the minus side, every
nontrivial test suite addition would need to try those two cases before commit
or risk buildfarm wrath.  I don't know whether the bugs found would pay for
that trouble.  (There's also a less-important loss around the ability to
exercise a setting and manually inspect the results.  For example, I sometimes
test parallel_tuple_cost=0 parallel_setup_cost=0 and confirm a lack of
crashes.  After hardening, that would require temporary source code edits to
remove the hardening.  That's fine, though.)

Re: Fixing WAL instability in various TAP tests

From

Mark Dilger

Date:

25 September 2021, 14:12:08

> On Sep 24, 2021, at 10:21 PM, Noah Misch <noah@leadboat.com> wrote:
>
>> I would
>> expect tests which fail under legal alternate GUC settings to be hardened to
>> explicitly set the GUCs as they need, rather than implicitly relying on the
>> defaults.
>
> That is not the general practice in PostgreSQL tests today.  The buildfarm
> exercises some settings, so we keep the tests clean for those.  Coping with
> max_wal_size=2 that way sounds reasonable.  I'm undecided about the value of
> hardening tests against all possible settings.

Leaving the tests brittle wastes developer time.

I ran into this problem when I changed the storage underlying bloom indexes and ran the contrib/bloom/t/001_wal.pl test
withwal_consistency_checking=all.  That caused the test to fail with errors about missing wal files, and it took time
tobacktrack and see that the test fails under this setting even before applying my storage layer changes.  Ordinarily,
failuresabout missing wal files would have led me to suspect the TAP test sooner, but since I had mucked around with
storageand wal it initially seemed plausible that my code changes were the problem.  The real problem is that a
replicationslot is not used in the test. 

The failure in src/test/recovery/t/015_promotion_pages.pl is also that a replication slot should be used but is not.

The failure in src/bin/pg_basebackup/t/010_pg_basebackup.pl stems from not heeding the documented requirement for
pg_basebackup-X fetch that the wal_keep_size "be set high enough that the required log data is not removed before the
endof the backup".  It's just assuming that it will be, because that tends to be true under default GUC settings.  I
thinkthis can be fixed by setting wal_keep_size=<SOMETHING_BIG_ENOUGH>,  but (a) you say this is not the general
practicein PostgreSQL tests today, and (b) there doesn't seem to be any principled way to decide what value would be
bigenough.  Sure, we can use something that is big enough in practice, and we'll probably have to go with that, but it
feelslike we're just papering over the problem. 

I'm inclined to guess that the problem in src/bin/pg_basebackup/t/020_pg_receivewal.pl is similar.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

25 September 2021, 14:17:47

Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Sep 24, 2021, at 10:21 PM, Noah Misch <noah@leadboat.com> wrote:
>>> I would
>>> expect tests which fail under legal alternate GUC settings to be hardened to
>>> explicitly set the GUCs as they need, rather than implicitly relying on the
>>> defaults.

>> That is not the general practice in PostgreSQL tests today.  The buildfarm
>> exercises some settings, so we keep the tests clean for those.  Coping with
>> max_wal_size=2 that way sounds reasonable.  I'm undecided about the value of
>> hardening tests against all possible settings.

> Leaving the tests brittle wastes developer time.

Trying to make them proof against all possible settings would waste
a lot more time, though.

            regards, tom lane

Re: Fixing WAL instability in various TAP tests

From

Mark Dilger

Date:

25 September 2021, 15:20:06

> On Sep 25, 2021, at 7:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Leaving the tests brittle wastes developer time.
>
> Trying to make them proof against all possible settings would waste
> a lot more time, though.

You may be right, but the conversation about "all possible settings" was started by Noah.  I was really just talking
abouttests that depend on wal files not being removed, but taking no action to guarantee that, merely trusting that
underdefault settings they won't be.  I can't square that design against other TAP tests that do take measures to
preventwal files being removed.  Why is the precaution taken in some tests but not others?  If this is intentional,
shouldn'tsome comment in the tests without such precautions explain that choice?  Are they intentionally testing that
thedefault GUC wal size settings and wal verbosity won't break the test? 

This isn't a rhetorical question:

In src/test/recovery/t/015_promotion_pages.pl, the comments talk about the how checkpoints impact what happens on the
standby. The test issues an explicit checkpoint on the primary, and again later on the standby, so it is unclear if
that'swhat the comments refer to, or if they also refer to implicit expectations about when/if other checkpoints will
happen. The test breaks when I change the GUC settings, but I can fix that breakage by adding a replication slot to the
test. Have I broken the purpose of the test by doing so, though?  Does using a replication slot to force the wal to not
beremoved early break what the test is designed to check? 

The other tests raise similar questions.  Is the brittleness intentional?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

25 September 2021, 15:43:16

Mark Dilger <mark.dilger@enterprisedb.com> writes:
> You may be right, but the conversation about "all possible settings" was started by Noah.  I was really just talking
abouttests that depend on wal files not being removed, but taking no action to guarantee that, merely trusting that
underdefault settings they won't be.  I can't square that design against other TAP tests that do take measures to
preventwal files being removed.  Why is the precaution taken in some tests but not others? 

If we are doing something about that in some test cases, I'd agree with
doing the same thing in others that need it.  It seems more likely to
be an oversight than an intentional test condition.

            regards, tom lane

Re: Fixing WAL instability in various TAP tests

From

Noah Misch

Date:

25 September 2021, 16:00:39

On Sat, Sep 25, 2021 at 08:20:06AM -0700, Mark Dilger wrote:
> > On Sep 25, 2021, at 7:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Leaving the tests brittle wastes developer time.
> > 
> > Trying to make them proof against all possible settings would waste
> > a lot more time, though.
> 
> You may be right, but the conversation about "all possible settings" was
> started by Noah.

You wrote, "I would expect tests which fail under legal alternate GUC settings
to be hardened to explicitly set the GUCs as they need, rather than implicitly
relying on the defaults."  I read that as raising the general principle, not
just a narrow argument about max_wal_size.  We can discontinue talking about
the general principle and focus on max_wal_size.

> I was really just talking about tests that depend on wal
> files not being removed, but taking no action to guarantee that, merely
> trusting that under default settings they won't be.

As I said, +1 for making tests pass under the min_val of max_wal_size.  If you
want to introduce a max_wal_size=2 buildfarm member so it stays that way, +1
for that as well.

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

25 September 2021, 16:20:39

Noah Misch <noah@leadboat.com> writes:
> On Sat, Sep 25, 2021 at 08:20:06AM -0700, Mark Dilger wrote:
>> You may be right, but the conversation about "all possible settings" was
>> started by Noah.

> You wrote, "I would expect tests which fail under legal alternate GUC settings
> to be hardened to explicitly set the GUCs as they need, rather than implicitly
> relying on the defaults."  I read that as raising the general principle, not
> just a narrow argument about max_wal_size.

As did I.

> We can discontinue talking about
> the general principle and focus on max_wal_size.

It is worth stopping to think about whether there are adjacent settings
that need similar treatment.

In general, it seems like "premature discarding of WAL segments" is
something akin to "premature timeout" errors, and we've got a pretty
aggressive policy about preventing those.  There are a lot of settings
that I'd *not* be in favor of trying to be bulletproof about, because
it doesn't seem worth the trouble; but perhaps this one is.

            regards, tom lane

Re: Fixing WAL instability in various TAP tests

From

Mark Dilger

Date:

25 September 2021, 18:04:29

> On Sep 25, 2021, at 9:00 AM, Noah Misch <noah@leadboat.com> wrote:
>
>> You may be right, but the conversation about "all possible settings" was
>> started by Noah.
>
> You wrote, "I would expect tests which fail under legal alternate GUC settings
> to be hardened to explicitly set the GUCs as they need, rather than implicitly
> relying on the defaults."  I read that as raising the general principle, not
> just a narrow argument about max_wal_size.

In the first draft of my email to Tom, I had language about my inartful crafting of my original post that led Noah to
respondas he did....  I couldn't quite figure out how to phrase that without distracting from the main point.  I don't
thinkyou were (much) offended, but my apologies for any perceived fingerpointing. 

I also don't have a problem with your idea of testing in the build farm with some animals having the gucs set to
minimumvalues and some to maximum and so forth.  I like that idea generally, though don't feel competent to predict how
muchwork that would be to maintain, so I'm just deferring to Tom's and your judgement about that. 

My inartful first post was really meant to say, "here is a problem that I perceive about tap tests vis-a-vis wal files,
dopeople disagree with me that this is a problem, and would patches to address the problem be welcome?"    I took Tom's
responseto be, "yeah, go ahead", and am mostly waiting for the weekend to be over to see if anybody else has anything
tosay about it. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fixing WAL instability in various TAP tests

From

Mark Dilger

Date:

27 September 2021, 19:58:28

> On Sep 25, 2021, at 11:04 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> I took Tom's response to be, "yeah, go ahead", and am mostly waiting for the weekend to be over to see if anybody
elsehas anything to say about it. 

Here is a patch set, one patch per test.  The third patch enables its test in the Makefile, which is commented as
havingbeen disabled due to the test being unstable in the build farm.  Re-enabling the test might be wrong, since the
instabilitymight not have been due to WAL being recycled early.  I didn't find enough history about why that test was
disabled,so trying it in the build farm again is the best I can suggest.  Maybe somebody else on this list knows more? 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

27 September 2021, 20:19:27

Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Here is a patch set, one patch per test.  The third patch enables its test in the Makefile, which is commented as
havingbeen disabled due to the test being unstable in the build farm.  Re-enabling the test might be wrong, since the
instabilitymight not have been due to WAL being recycled early.  I didn't find enough history about why that test was
disabled,so trying it in the build farm again is the best I can suggest.  Maybe somebody else on this list knows more? 

Digging in the archives where the commit points, I find

https://www.postgresql.org/message-id/flat/20181126025125.GH1776%40paquier.xyz

which says there was an unexpected result on my animal longfin.
I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
just now, and it passed fine.  The laptop is not exactly the same
as longfin was in 2018, but it ought to be close enough.  Not sure
what to make of that --- maybe the failure is only intermittent,
or else we fixed the underlying issue since then.

I'm a little inclined to re-enable the test without your other
changes, just to see what happens.

            regards, tom lane

Re: Fixing WAL instability in various TAP tests

From

Mark Dilger

Date:

27 September 2021, 20:21:36

> On Sep 27, 2021, at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I'm a little inclined to re-enable the test without your other
> changes, just to see what happens.

That sounds like a good idea.  Even if it passes at first, I'd prefer to leave it for a week or more to have a better
senseof how stable it is.  Applying my patches too soon would just add more variables to a not-so-well understood
situation.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

27 September 2021, 22:49:31

Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Sep 27, 2021, at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm a little inclined to re-enable the test without your other
>> changes, just to see what happens.

> That sounds like a good idea.  Even if it passes at first, I'd prefer to leave it for a week or more to have a better
senseof how stable it is.  Applying my patches too soon would just add more variables to a not-so-well understood
situation.

Done.  I shall retire to a safe viewing distance and observe the
buildfarm.

            regards, tom lane

Re: Fixing WAL instability in various TAP tests

From

Michael Paquier

Date:

27 September 2021, 23:43:13

On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote:
> I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
> just now, and it passed fine.  The laptop is not exactly the same
> as longfin was in 2018, but it ought to be close enough.  Not sure
> what to make of that --- maybe the failure is only intermittent,
> or else we fixed the underlying issue since then.

Honestly, I have no idea what change in the backend matters here.  And
it is not like bloom has changed in any significant way since d3c09b9.
--
Michael

Attachment

signature.asc

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

28 September 2021, 02:20:12

Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote:
>> I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
>> just now, and it passed fine.  The laptop is not exactly the same
>> as longfin was in 2018, but it ought to be close enough.  Not sure
>> what to make of that --- maybe the failure is only intermittent,
>> or else we fixed the underlying issue since then.

> Honestly, I have no idea what change in the backend matters here.  And
> it is not like bloom has changed in any significant way since d3c09b9.

I went so far as to check out 03faa4a8dd on longfin's host, and I find
that I cannot reproduce the failure shown at

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25+23%3A59%3A03

So that's the same hardware, and identical PG source tree, and different
results.  This seems to leave only two theories standing:

1.  It was a since-fixed macOS bug.  (Unlikely, especially if we also saw
it on other platforms.)

2.  The failure manifested only in the buildfarm, not under manual "make
check".  This is somewhat more plausible, especially since subsequent
buildfarm script changes might then explain why it went away.  But I have
no idea what the "subsequent script changes" might've been.

            regards, tom lane

Re: Fixing WAL instability in various TAP tests

From

Andrew Dunstan

Date:

28 September 2021, 14:01:19

On 9/27/21 10:20 PM, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote:
>>> I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
>>> just now, and it passed fine.  The laptop is not exactly the same
>>> as longfin was in 2018, but it ought to be close enough.  Not sure
>>> what to make of that --- maybe the failure is only intermittent,
>>> or else we fixed the underlying issue since then.
>> Honestly, I have no idea what change in the backend matters here.  And
>> it is not like bloom has changed in any significant way since d3c09b9.
> I went so far as to check out 03faa4a8dd on longfin's host, and I find
> that I cannot reproduce the failure shown at
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25+23%3A59%3A03
>
> So that's the same hardware, and identical PG source tree, and different
> results.  This seems to leave only two theories standing:
>
> 1.  It was a since-fixed macOS bug.  (Unlikely, especially if we also saw
> it on other platforms.)
>
> 2.  The failure manifested only in the buildfarm, not under manual "make
> check".  This is somewhat more plausible, especially since subsequent
> buildfarm script changes might then explain why it went away.  But I have
> no idea what the "subsequent script changes" might've been.
>
>             


Nothing I can think of.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

28 September 2021, 17:27:11

I wrote:
> So that's the same hardware, and identical PG source tree, and different
> results.  This seems to leave only two theories standing:

I forgot theory 3: it's intermittent.  Apparently the probability has
dropped a lot since 2018, but behold:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2021-09-28%2014%3A20%3A41

(with successful runs just before and after this one, on the same
animal)

Note that the delta is not exactly like the previous result, either.
So there's more than one symptom, but in any case it seems like
we have an issue in WAL replay.  I wonder whether it's bloom's fault
or a core bug.

            regards, tom lane

Re: Fixing WAL instability in various TAP tests

From

Mark Dilger

Date:

28 September 2021, 18:07:54

> On Sep 28, 2021, at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wonder whether it's bloom's fault
> or a core bug.

Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing
thesets for stringwise equality, which is certainly order dependent. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

28 September 2021, 18:11:04

I wrote:
> So there's more than one symptom, but in any case it seems like
> we have an issue in WAL replay.  I wonder whether it's bloom's fault
> or a core bug.

Actually ... I bet it's just the test script's fault.  It waits for the
standby to catch up like this:

    my $caughtup_query =
      "SELECT pg_current_wal_lsn() <= write_lsn FROM pg_stat_replication WHERE application_name = '$applname';";
    $node_primary->poll_query_until('postgres', $caughtup_query)
      or die "Timed out while waiting for standby 1 to catch up";

which seems like completely the wrong condition.  Don't we need the
standby to have *replayed* the WAL, not merely written it to disk?

I'm also wondering why this doesn't use wait_for_catchup, instead
of reinventing the query to use.

            regards, tom lane

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

28 September 2021, 18:20:18

Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing
thesets for stringwise equality, which is certainly order dependent. 

Well, it's forcing a bitmap scan, so what we're getting is the native
ordering of a bitmapscan result.  That should match, given that what
we're doing is physical replication.  I think adding ORDER BY would
be more likely to obscure real issues than hide test instability.

            regards, tom lane

Re: Fixing WAL instability in various TAP tests

From

Mark Dilger

Date:

28 September 2021, 18:43:30

> On Sep 28, 2021, at 11:07 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing
thesets for stringwise equality, which is certainly order dependent.

Taking the output from the buildfarm page, parsing out the first test's results and comparing got vs. expected for this
test:

is($primary_result, $standby_result, "$test_name: query result matches");

the primary result had all the same rows as the standby, along with additional rows. Comparing the results, they match
otherthan rows missing from the standby that are present on the primary. That seems consistent with the view that the
queryon the standby is running before all the data has replicated across.

However, the missing rows all have column i either 0 or 3, though the test round-robins i=0..9 as it performs the
inserts. I would expect the wal for the inserts to not cluster around any particular value of i. The DELETE and VACUUM
commandsdo operate on a single value of i, so that would make sense if the data failed to be deleted on the standby
aftersuccessfully being deleted on the primary, but then I'd expect the standby to have more rows, not fewer.

Perhaps having the bloom index messed up answers that, though. I think it should be easy enough to get the path to the
heapmain table fork and the bloom main index fork for both the primary and standby and do a filesystem comparison as
partof the wal test. That would tell us if they differ, and also if the differences are limited to just one or the
other.

I'll go write that up....

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fixing WAL instability in various TAP tests

From

Tom Lane

Date:

28 September 2021, 19:00:13

Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Perhaps having the bloom index messed up answers that, though.  I think it should be easy enough to get the path to
theheap main table fork and the bloom main index fork for both the primary and standby and do a filesystem comparison
aspart of the wal test.  That would tell us if they differ, and also if the differences are limited to just one or the
other.

I think that's probably overkill, and definitely out-of-scope for
contrib/bloom.  If we fear that WAL replay is not reproducing the data
accurately, we should be testing for that in some more centralized place.

Anyway, I confirmed my diagnosis by adding a delay in WAL apply
(0001 below); that makes this test fall over spectacularly.
And 0002 fixes it.  So I propose to push 0002 as soon as the
v14 release freeze ends.

Should we back-patch 0002?  I'm inclined to think so.  Should
we then also back-patch enablement of the bloom test?  Less
sure about that, but I'd lean to doing so.  A test that appears
to be there but isn't actually invoked is pretty misleading.

            regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..eecbe57aee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7370,6 +7370,9 @@ StartupXLOG(void)
             {
                 bool        switchedTLI = false;

+                if (random() < INT_MAX/100)
+                    pg_usleep(100000);
+
 #ifdef WAL_DEBUG
                 if (XLOG_DEBUG ||
                     (rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||
diff --git a/contrib/bloom/t/001_wal.pl b/contrib/bloom/t/001_wal.pl
index 55ad35926f..be8916a8eb 100644
--- a/contrib/bloom/t/001_wal.pl
+++ b/contrib/bloom/t/001_wal.pl
@@ -16,12 +16,10 @@ sub test_index_replay
 {
     my ($test_name) = @_;

+    local $Test::Builder::Level = $Test::Builder::Level + 1;
+
     # Wait for standby to catch up
-    my $applname = $node_standby->name;
-    my $caughtup_query =
-      "SELECT pg_current_wal_lsn() <= write_lsn FROM pg_stat_replication WHERE application_name = '$applname';";
-    $node_primary->poll_query_until('postgres', $caughtup_query)
-      or die "Timed out while waiting for standby 1 to catch up";
+    $node_primary->wait_for_catchup($node_standby);

     my $queries = qq(SET enable_seqscan=off;
 SET enable_bitmapscan=on;

Re: Fixing WAL instability in various TAP tests

From

Michael Paquier

Date:

29 September 2021, 03:15:23

On Tue, Sep 28, 2021 at 03:00:13PM -0400, Tom Lane wrote:
> Should we back-patch 0002?  I'm inclined to think so.  Should
> we then also back-patch enablement of the bloom test?  Less
> sure about that, but I'd lean to doing so.  A test that appears
> to be there but isn't actually invoked is pretty misleading.

A backpatch sounds adapted to me for both patches.  The only risk that
I could see here is somebody implementing a new test by copy-pasting
this one if we were to keep things as they are on stable branches.
--
Michael

Attachment

signature.asc

Re: Fixing WAL instability in various TAP tests

From

"Bossart, Nathan"

Date:

21 October 2021, 22:23:20

On 9/28/21, 8:17 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> On Tue, Sep 28, 2021 at 03:00:13PM -0400, Tom Lane wrote:
>> Should we back-patch 0002?  I'm inclined to think so.  Should
>> we then also back-patch enablement of the bloom test?  Less
>> sure about that, but I'd lean to doing so.  A test that appears
>> to be there but isn't actually invoked is pretty misleading.
>
> A backpatch sounds adapted to me for both patches.  The only risk that
> I could see here is somebody implementing a new test by copy-pasting
> this one if we were to keep things as they are on stable branches.

I found this thread via the Commitfest entry
(https://commitfest.postgresql.org/35/3333/), and I also see that the
following patches have been committed:

        https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7d1aa6b
        https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6bc6bd4

However, it looks like there are a couple of other patches upthread
[0] that attempt to ensure the tests pass for different settings of
max_wal_size.  Do we intend to proceed with those, or should we just
close out the Commmitfest entry?

Nathan

[0] https://postgr.es/m/C1D227C2-C271-4310-8C85-C5368C298622%40enterprisedb.com

Re: Fixing WAL instability in various TAP tests

From

Mark Dilger

Date:

03 November 2021, 23:56:35

> On Oct 21, 2021, at 3:23 PM, Bossart, Nathan <bossartn@amazon.com> wrote:
>
> Do we intend to proceed with those, or should we just
> close out the Commmitfest entry?

I have withdrawn the patch.  The issues were intermittent on the buildfarm, and committing other changes along with
whatTom already committed would seem to confuse matters if any new issues were to arise.  We can come back to this
sometimein the future, if need be. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company