Thread: Fixing WAL instability in various TAP tests
Hackers, A few TAP tests in the project appear to be sensitive to reductions of the PostgresNode's max_wal_size setting, resultingin tests failing due to wal files having been removed too soon. The failures in the logs typically are of the "requestedWAL segment %s has already been removed" variety. I would expect tests which fail under legal alternate GUC settingsto be hardened to explicitly set the GUCs as they need, rather than implicitly relying on the defaults. As far asmissing WAL files go, I would expect the TAP test to prevent this with the use of replication slots or some other mechanism,and not simply to rely on checkpoints not happening too soon. I'm curious if others on this list disagree withthat point of view. Failures in src/test/recovery/t/015_promotion_pages.pl can be fixed by creating a physical replication slot on node "alpha"and using it from node "beta", a technique already used in other TAP tests and apparently merely overlooked in thisone. The first two tests in src/bin/pg_basebackup/t fail, and it's not clear that physical replication slots are the appropriatesolution, since no replication is happening. It's not immediately obvious that the tests are at fault anyway. On casual inspection, it seems they might be detecting a live bug which simply doesn't manifest under larger valuesof max_wal_size. Test 010 appears to show a bug with `pg_basebackup -X`, and test 020 with `pg_receivewal`. The test in contrib/bloom/t/ is deliberately disabled in contrib/bloom/Makefile with a comment that the test is unstablein the buildfarm, but I didn't find anything to explain what exactly those buildfarm failures might have been whenI chased down the email thread that gave rise to the related commit. That test happens to be stable on my laptop untilI change GUC settings to both reduce max_wal_size=32MB and to set wal_consistency_checking=all. Thoughts? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 24, 2021 at 05:33:13PM -0700, Mark Dilger wrote: > A few TAP tests in the project appear to be sensitive to reductions of the > PostgresNode's max_wal_size setting, resulting in tests failing due to wal > files having been removed too soon. The failures in the logs typically are > of the "requested WAL segment %s has already been removed" variety. I would > expect tests which fail under legal alternate GUC settings to be hardened to > explicitly set the GUCs as they need, rather than implicitly relying on the > defaults. That is not the general practice in PostgreSQL tests today. The buildfarm exercises some settings, so we keep the tests clean for those. Coping with max_wal_size=2 that way sounds reasonable. I'm undecided about the value of hardening tests against all possible settings. On the plus side, it would let us run one buildfarm member that sets every setting to its min_val or enumvals[1] and another member that elects enumvals[cardinality(enumvals)] or max_val. We'd catch some real bugs that way. On the minus side, every nontrivial test suite addition would need to try those two cases before commit or risk buildfarm wrath. I don't know whether the bugs found would pay for that trouble. (There's also a less-important loss around the ability to exercise a setting and manually inspect the results. For example, I sometimes test parallel_tuple_cost=0 parallel_setup_cost=0 and confirm a lack of crashes. After hardening, that would require temporary source code edits to remove the hardening. That's fine, though.)
> On Sep 24, 2021, at 10:21 PM, Noah Misch <noah@leadboat.com> wrote: > >> I would >> expect tests which fail under legal alternate GUC settings to be hardened to >> explicitly set the GUCs as they need, rather than implicitly relying on the >> defaults. > > That is not the general practice in PostgreSQL tests today. The buildfarm > exercises some settings, so we keep the tests clean for those. Coping with > max_wal_size=2 that way sounds reasonable. I'm undecided about the value of > hardening tests against all possible settings. Leaving the tests brittle wastes developer time. I ran into this problem when I changed the storage underlying bloom indexes and ran the contrib/bloom/t/001_wal.pl test withwal_consistency_checking=all. That caused the test to fail with errors about missing wal files, and it took time tobacktrack and see that the test fails under this setting even before applying my storage layer changes. Ordinarily, failuresabout missing wal files would have led me to suspect the TAP test sooner, but since I had mucked around with storageand wal it initially seemed plausible that my code changes were the problem. The real problem is that a replicationslot is not used in the test. The failure in src/test/recovery/t/015_promotion_pages.pl is also that a replication slot should be used but is not. The failure in src/bin/pg_basebackup/t/010_pg_basebackup.pl stems from not heeding the documented requirement for pg_basebackup-X fetch that the wal_keep_size "be set high enough that the required log data is not removed before the endof the backup". It's just assuming that it will be, because that tends to be true under default GUC settings. I thinkthis can be fixed by setting wal_keep_size=<SOMETHING_BIG_ENOUGH>, but (a) you say this is not the general practicein PostgreSQL tests today, and (b) there doesn't seem to be any principled way to decide what value would be bigenough. Sure, we can use something that is big enough in practice, and we'll probably have to go with that, but it feelslike we're just papering over the problem. I'm inclined to guess that the problem in src/bin/pg_basebackup/t/020_pg_receivewal.pl is similar. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: >> On Sep 24, 2021, at 10:21 PM, Noah Misch <noah@leadboat.com> wrote: >>> I would >>> expect tests which fail under legal alternate GUC settings to be hardened to >>> explicitly set the GUCs as they need, rather than implicitly relying on the >>> defaults. >> That is not the general practice in PostgreSQL tests today. The buildfarm >> exercises some settings, so we keep the tests clean for those. Coping with >> max_wal_size=2 that way sounds reasonable. I'm undecided about the value of >> hardening tests against all possible settings. > Leaving the tests brittle wastes developer time. Trying to make them proof against all possible settings would waste a lot more time, though. regards, tom lane
> On Sep 25, 2021, at 7:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Leaving the tests brittle wastes developer time. > > Trying to make them proof against all possible settings would waste > a lot more time, though. You may be right, but the conversation about "all possible settings" was started by Noah. I was really just talking abouttests that depend on wal files not being removed, but taking no action to guarantee that, merely trusting that underdefault settings they won't be. I can't square that design against other TAP tests that do take measures to preventwal files being removed. Why is the precaution taken in some tests but not others? If this is intentional, shouldn'tsome comment in the tests without such precautions explain that choice? Are they intentionally testing that thedefault GUC wal size settings and wal verbosity won't break the test? This isn't a rhetorical question: In src/test/recovery/t/015_promotion_pages.pl, the comments talk about the how checkpoints impact what happens on the standby. The test issues an explicit checkpoint on the primary, and again later on the standby, so it is unclear if that'swhat the comments refer to, or if they also refer to implicit expectations about when/if other checkpoints will happen. The test breaks when I change the GUC settings, but I can fix that breakage by adding a replication slot to the test. Have I broken the purpose of the test by doing so, though? Does using a replication slot to force the wal to not beremoved early break what the test is designed to check? The other tests raise similar questions. Is the brittleness intentional? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > You may be right, but the conversation about "all possible settings" was started by Noah. I was really just talking abouttests that depend on wal files not being removed, but taking no action to guarantee that, merely trusting that underdefault settings they won't be. I can't square that design against other TAP tests that do take measures to preventwal files being removed. Why is the precaution taken in some tests but not others? If we are doing something about that in some test cases, I'd agree with doing the same thing in others that need it. It seems more likely to be an oversight than an intentional test condition. regards, tom lane
On Sat, Sep 25, 2021 at 08:20:06AM -0700, Mark Dilger wrote: > > On Sep 25, 2021, at 7:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Leaving the tests brittle wastes developer time. > > > > Trying to make them proof against all possible settings would waste > > a lot more time, though. > > You may be right, but the conversation about "all possible settings" was > started by Noah. You wrote, "I would expect tests which fail under legal alternate GUC settings to be hardened to explicitly set the GUCs as they need, rather than implicitly relying on the defaults." I read that as raising the general principle, not just a narrow argument about max_wal_size. We can discontinue talking about the general principle and focus on max_wal_size. > I was really just talking about tests that depend on wal > files not being removed, but taking no action to guarantee that, merely > trusting that under default settings they won't be. As I said, +1 for making tests pass under the min_val of max_wal_size. If you want to introduce a max_wal_size=2 buildfarm member so it stays that way, +1 for that as well.
Noah Misch <noah@leadboat.com> writes: > On Sat, Sep 25, 2021 at 08:20:06AM -0700, Mark Dilger wrote: >> You may be right, but the conversation about "all possible settings" was >> started by Noah. > You wrote, "I would expect tests which fail under legal alternate GUC settings > to be hardened to explicitly set the GUCs as they need, rather than implicitly > relying on the defaults." I read that as raising the general principle, not > just a narrow argument about max_wal_size. As did I. > We can discontinue talking about > the general principle and focus on max_wal_size. It is worth stopping to think about whether there are adjacent settings that need similar treatment. In general, it seems like "premature discarding of WAL segments" is something akin to "premature timeout" errors, and we've got a pretty aggressive policy about preventing those. There are a lot of settings that I'd *not* be in favor of trying to be bulletproof about, because it doesn't seem worth the trouble; but perhaps this one is. regards, tom lane
> On Sep 25, 2021, at 9:00 AM, Noah Misch <noah@leadboat.com> wrote: > >> You may be right, but the conversation about "all possible settings" was >> started by Noah. > > You wrote, "I would expect tests which fail under legal alternate GUC settings > to be hardened to explicitly set the GUCs as they need, rather than implicitly > relying on the defaults." I read that as raising the general principle, not > just a narrow argument about max_wal_size. In the first draft of my email to Tom, I had language about my inartful crafting of my original post that led Noah to respondas he did.... I couldn't quite figure out how to phrase that without distracting from the main point. I don't thinkyou were (much) offended, but my apologies for any perceived fingerpointing. I also don't have a problem with your idea of testing in the build farm with some animals having the gucs set to minimumvalues and some to maximum and so forth. I like that idea generally, though don't feel competent to predict how muchwork that would be to maintain, so I'm just deferring to Tom's and your judgement about that. My inartful first post was really meant to say, "here is a problem that I perceive about tap tests vis-a-vis wal files, dopeople disagree with me that this is a problem, and would patches to address the problem be welcome?" I took Tom's responseto be, "yeah, go ahead", and am mostly waiting for the weekend to be over to see if anybody else has anything tosay about it. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Sep 25, 2021, at 11:04 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > I took Tom's response to be, "yeah, go ahead", and am mostly waiting for the weekend to be over to see if anybody elsehas anything to say about it. Here is a patch set, one patch per test. The third patch enables its test in the Makefile, which is commented as havingbeen disabled due to the test being unstable in the build farm. Re-enabling the test might be wrong, since the instabilitymight not have been due to WAL being recycled early. I didn't find enough history about why that test was disabled,so trying it in the build farm again is the best I can suggest. Maybe somebody else on this list knows more? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Mark Dilger <mark.dilger@enterprisedb.com> writes: > Here is a patch set, one patch per test. The third patch enables its test in the Makefile, which is commented as havingbeen disabled due to the test being unstable in the build farm. Re-enabling the test might be wrong, since the instabilitymight not have been due to WAL being recycled early. I didn't find enough history about why that test was disabled,so trying it in the build farm again is the best I can suggest. Maybe somebody else on this list knows more? Digging in the archives where the commit points, I find https://www.postgresql.org/message-id/flat/20181126025125.GH1776%40paquier.xyz which says there was an unexpected result on my animal longfin. I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop just now, and it passed fine. The laptop is not exactly the same as longfin was in 2018, but it ought to be close enough. Not sure what to make of that --- maybe the failure is only intermittent, or else we fixed the underlying issue since then. I'm a little inclined to re-enable the test without your other changes, just to see what happens. regards, tom lane
> On Sep 27, 2021, at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'm a little inclined to re-enable the test without your other > changes, just to see what happens. That sounds like a good idea. Even if it passes at first, I'd prefer to leave it for a week or more to have a better senseof how stable it is. Applying my patches too soon would just add more variables to a not-so-well understood situation. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: >> On Sep 27, 2021, at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm a little inclined to re-enable the test without your other >> changes, just to see what happens. > That sounds like a good idea. Even if it passes at first, I'd prefer to leave it for a week or more to have a better senseof how stable it is. Applying my patches too soon would just add more variables to a not-so-well understood situation. Done. I shall retire to a safe viewing distance and observe the buildfarm. regards, tom lane
On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote: > I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop > just now, and it passed fine. The laptop is not exactly the same > as longfin was in 2018, but it ought to be close enough. Not sure > what to make of that --- maybe the failure is only intermittent, > or else we fixed the underlying issue since then. Honestly, I have no idea what change in the backend matters here. And it is not like bloom has changed in any significant way since d3c09b9. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote: >> I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop >> just now, and it passed fine. The laptop is not exactly the same >> as longfin was in 2018, but it ought to be close enough. Not sure >> what to make of that --- maybe the failure is only intermittent, >> or else we fixed the underlying issue since then. > Honestly, I have no idea what change in the backend matters here. And > it is not like bloom has changed in any significant way since d3c09b9. I went so far as to check out 03faa4a8dd on longfin's host, and I find that I cannot reproduce the failure shown at https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25+23%3A59%3A03 So that's the same hardware, and identical PG source tree, and different results. This seems to leave only two theories standing: 1. It was a since-fixed macOS bug. (Unlikely, especially if we also saw it on other platforms.) 2. The failure manifested only in the buildfarm, not under manual "make check". This is somewhat more plausible, especially since subsequent buildfarm script changes might then explain why it went away. But I have no idea what the "subsequent script changes" might've been. regards, tom lane
On 9/27/21 10:20 PM, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: >> On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote: >>> I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop >>> just now, and it passed fine. The laptop is not exactly the same >>> as longfin was in 2018, but it ought to be close enough. Not sure >>> what to make of that --- maybe the failure is only intermittent, >>> or else we fixed the underlying issue since then. >> Honestly, I have no idea what change in the backend matters here. And >> it is not like bloom has changed in any significant way since d3c09b9. > I went so far as to check out 03faa4a8dd on longfin's host, and I find > that I cannot reproduce the failure shown at > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25+23%3A59%3A03 > > So that's the same hardware, and identical PG source tree, and different > results. This seems to leave only two theories standing: > > 1. It was a since-fixed macOS bug. (Unlikely, especially if we also saw > it on other platforms.) > > 2. The failure manifested only in the buildfarm, not under manual "make > check". This is somewhat more plausible, especially since subsequent > buildfarm script changes might then explain why it went away. But I have > no idea what the "subsequent script changes" might've been. > > Nothing I can think of. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
I wrote: > So that's the same hardware, and identical PG source tree, and different > results. This seems to leave only two theories standing: I forgot theory 3: it's intermittent. Apparently the probability has dropped a lot since 2018, but behold: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2021-09-28%2014%3A20%3A41 (with successful runs just before and after this one, on the same animal) Note that the delta is not exactly like the previous result, either. So there's more than one symptom, but in any case it seems like we have an issue in WAL replay. I wonder whether it's bloom's fault or a core bug. regards, tom lane
> On Sep 28, 2021, at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wonder whether it's bloom's fault > or a core bug. Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing thesets for stringwise equality, which is certainly order dependent. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I wrote: > So there's more than one symptom, but in any case it seems like > we have an issue in WAL replay. I wonder whether it's bloom's fault > or a core bug. Actually ... I bet it's just the test script's fault. It waits for the standby to catch up like this: my $caughtup_query = "SELECT pg_current_wal_lsn() <= write_lsn FROM pg_stat_replication WHERE application_name = '$applname';"; $node_primary->poll_query_until('postgres', $caughtup_query) or die "Timed out while waiting for standby 1 to catch up"; which seems like completely the wrong condition. Don't we need the standby to have *replayed* the WAL, not merely written it to disk? I'm also wondering why this doesn't use wait_for_catchup, instead of reinventing the query to use. regards, tom lane
Mark Dilger <mark.dilger@enterprisedb.com> writes: > Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing thesets for stringwise equality, which is certainly order dependent. Well, it's forcing a bitmap scan, so what we're getting is the native ordering of a bitmapscan result. That should match, given that what we're doing is physical replication. I think adding ORDER BY would be more likely to obscure real issues than hide test instability. regards, tom lane
> On Sep 28, 2021, at 11:07 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing thesets for stringwise equality, which is certainly order dependent. Taking the output from the buildfarm page, parsing out the first test's results and comparing got vs. expected for this test: is($primary_result, $standby_result, "$test_name: query result matches"); the primary result had all the same rows as the standby, along with additional rows. Comparing the results, they match otherthan rows missing from the standby that are present on the primary. That seems consistent with the view that the queryon the standby is running before all the data has replicated across. However, the missing rows all have column i either 0 or 3, though the test round-robins i=0..9 as it performs the inserts. I would expect the wal for the inserts to not cluster around any particular value of i. The DELETE and VACUUM commandsdo operate on a single value of i, so that would make sense if the data failed to be deleted on the standby aftersuccessfully being deleted on the primary, but then I'd expect the standby to have more rows, not fewer. Perhaps having the bloom index messed up answers that, though. I think it should be easy enough to get the path to the heapmain table fork and the bloom main index fork for both the primary and standby and do a filesystem comparison as partof the wal test. That would tell us if they differ, and also if the differences are limited to just one or the other. I'll go write that up.... — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > Perhaps having the bloom index messed up answers that, though. I think it should be easy enough to get the path to theheap main table fork and the bloom main index fork for both the primary and standby and do a filesystem comparison aspart of the wal test. That would tell us if they differ, and also if the differences are limited to just one or the other. I think that's probably overkill, and definitely out-of-scope for contrib/bloom. If we fear that WAL replay is not reproducing the data accurately, we should be testing for that in some more centralized place. Anyway, I confirmed my diagnosis by adding a delay in WAL apply (0001 below); that makes this test fall over spectacularly. And 0002 fixes it. So I propose to push 0002 as soon as the v14 release freeze ends. Should we back-patch 0002? I'm inclined to think so. Should we then also back-patch enablement of the bloom test? Less sure about that, but I'd lean to doing so. A test that appears to be there but isn't actually invoked is pretty misleading. regards, tom lane diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..eecbe57aee 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7370,6 +7370,9 @@ StartupXLOG(void) { bool switchedTLI = false; + if (random() < INT_MAX/100) + pg_usleep(100000); + #ifdef WAL_DEBUG if (XLOG_DEBUG || (rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) || diff --git a/contrib/bloom/t/001_wal.pl b/contrib/bloom/t/001_wal.pl index 55ad35926f..be8916a8eb 100644 --- a/contrib/bloom/t/001_wal.pl +++ b/contrib/bloom/t/001_wal.pl @@ -16,12 +16,10 @@ sub test_index_replay { my ($test_name) = @_; + local $Test::Builder::Level = $Test::Builder::Level + 1; + # Wait for standby to catch up - my $applname = $node_standby->name; - my $caughtup_query = - "SELECT pg_current_wal_lsn() <= write_lsn FROM pg_stat_replication WHERE application_name = '$applname';"; - $node_primary->poll_query_until('postgres', $caughtup_query) - or die "Timed out while waiting for standby 1 to catch up"; + $node_primary->wait_for_catchup($node_standby); my $queries = qq(SET enable_seqscan=off; SET enable_bitmapscan=on;
On Tue, Sep 28, 2021 at 03:00:13PM -0400, Tom Lane wrote: > Should we back-patch 0002? I'm inclined to think so. Should > we then also back-patch enablement of the bloom test? Less > sure about that, but I'd lean to doing so. A test that appears > to be there but isn't actually invoked is pretty misleading. A backpatch sounds adapted to me for both patches. The only risk that I could see here is somebody implementing a new test by copy-pasting this one if we were to keep things as they are on stable branches. -- Michael
Attachment
On 9/28/21, 8:17 PM, "Michael Paquier" <michael@paquier.xyz> wrote: > On Tue, Sep 28, 2021 at 03:00:13PM -0400, Tom Lane wrote: >> Should we back-patch 0002? I'm inclined to think so. Should >> we then also back-patch enablement of the bloom test? Less >> sure about that, but I'd lean to doing so. A test that appears >> to be there but isn't actually invoked is pretty misleading. > > A backpatch sounds adapted to me for both patches. The only risk that > I could see here is somebody implementing a new test by copy-pasting > this one if we were to keep things as they are on stable branches. I found this thread via the Commitfest entry (https://commitfest.postgresql.org/35/3333/), and I also see that the following patches have been committed: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7d1aa6b https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6bc6bd4 However, it looks like there are a couple of other patches upthread [0] that attempt to ensure the tests pass for different settings of max_wal_size. Do we intend to proceed with those, or should we just close out the Commmitfest entry? Nathan [0] https://postgr.es/m/C1D227C2-C271-4310-8C85-C5368C298622%40enterprisedb.com
> On Oct 21, 2021, at 3:23 PM, Bossart, Nathan <bossartn@amazon.com> wrote: > > Do we intend to proceed with those, or should we just > close out the Commmitfest entry? I have withdrawn the patch. The issues were intermittent on the buildfarm, and committing other changes along with whatTom already committed would seem to confuse matters if any new issues were to arise. We can come back to this sometimein the future, if need be. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company