Thread: pgsql: Test replay of regression tests, attempt II.
Test replay of regression tests, attempt II. See commit message for 123828a7fa563025d0ceee10cf1b2a253cd05319. The only change this time is the order of the arguments passed to pg_regress. The previously version broke in the build farm environment due to the contents of EXTRA_REGRESS_OPTS (see also commit 8cade04c which had to do something similar). Discussion: https://postgr.es/m/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/f47ed79cc8a0cfa154dc7f01faaf59822552363f Modified Files -------------- doc/src/sgml/regress.sgml | 11 +++++ src/test/recovery/Makefile | 6 ++- src/test/recovery/t/027_stream_regress.pl | 79 +++++++++++++++++++++++++++++++ src/tools/msvc/vcregress.pl | 2 + 4 files changed, 97 insertions(+), 1 deletion(-)
On Mon, Jan 17, 2022 at 03:35:48AM +0000, Thomas Munro wrote: > Test replay of regression tests, attempt II. > > See commit message for 123828a7fa563025d0ceee10cf1b2a253cd05319. The > only change this time is the order of the arguments passed to > pg_regress. The previously version broke in the build farm environment > due to the contents of EXTRA_REGRESS_OPTS (see also commit 8cade04c > which had to do something similar). > > Discussion: https://postgr.es/m/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com Butterflyfish has just failed in this new test: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD Ad the pg_regress command has failed one of the tests: reloptions ... FAILED 1377 ms No idea what's going on here, but I guess that we'd better show up the contents of regression.diffs in the TAP logs if it exists. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Mon, Jan 17, 2022 at 03:35:48AM +0000, Thomas Munro wrote: >> Test replay of regression tests, attempt II. > Butterflyfish has just failed in this new test: > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD > Ad the pg_regress command has failed one of the tests: > reloptions ... FAILED 1377 ms > No idea what's going on here, but I guess that we'd better show up the > contents of regression.diffs in the TAP logs if it exists. I'm quite appalled that this patch has apparently added an entire new run of the core regression tests to the standard check-world/buildfarm cycle. Was any thought taken for the developer-time and buildfarm-load impact that will have? Neither of those resources is free. regards, tom lane
I wrote: > I'm quite appalled that this patch has apparently added an entire new run > of the core regression tests to the standard check-world/buildfarm cycle. Not only that, but it leaves junk that is neither cleaned by "make clean" nor .gitignored: $ git status On branch master Your branch is up to date with 'origin/master'. Untracked files: (use "git add <file>..." to include in what will be committed) primary.dump results/ standby.dump $ make clean rm -rf tmp_check These files should have been put into the tmp_check directory... that's what it's for. regards, tom lane
On Tue, Jan 18, 2022 at 6:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > I'm quite appalled that this patch has apparently added an entire new run > > of the core regression tests to the standard check-world/buildfarm cycle. I wonder if there is a good way to share the resulting data directory with the pg_upgrade test. > These files should have been put into the tmp_check directory... > that's what it's for. Yeah, I mentioned this problem on the thread and was about to update the clean target to remove them, but this sounds like a better idea. Will fix.
On Mon, Jan 17, 2022 at 8:02 PM Michael Paquier <michael@paquier.xyz> wrote: > Butterflyfish has just failed in this new test: > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD > > Ad the pg_regress command has failed one of the tests: > reloptions ... FAILED 1377 ms Also spurfowl and curculio. > No idea what's going on here, but I guess that we'd better show up the > contents of regression.diffs in the TAP logs if it exists. Hmm. Yeah.
Hi, On 2022-01-18 07:47:48 +1300, Thomas Munro wrote: > > No idea what's going on here, but I guess that we'd better show up the > > contents of regression.diffs in the TAP logs if it exists. > > Hmm. Yeah. It already is in the regress_log_*, but I agree, it'd be good to add it to the normal output in the failure case. https://api.cirrus-ci.com/v1/artifact/task/6460573058596864/tap/src/test/recovery/tmp_check/log/regress_log_027_stream_regress I think system_or_bail() might be the wrong tool here. If I understand Test::More's BAIL_OUT() correctly [1], it'll prevent other tests from running as well. Which doesn't seem quite right for a failure like this? Kinda seems like nearly all of our uses of system_or_bail() are bad? Greetings, Andres Freund [1]: https://perldoc.perl.org/Test::More#Test-control "Indicates to the harness that things are going so badly all testing should terminate. This includes the running of any additional test scripts."
On Tue, Jan 18, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote: > diff -w -U3 c:/cirrus/src/test/recovery/../regress/expected/reloptions.out c:/cirrus/src/test/recovery/results/reloptions.out > --- c:/cirrus/src/test/recovery/../regress/expected/reloptions.out 2022-01-17 07:08:52.779337800 +0000 > +++ c:/cirrus/src/test/recovery/results/reloptions.out 2022-01-17 07:24:46.785394900 +0000 > @@ -133,7 +133,7 @@ > SELECT pg_relation_size('reloptions_test') = 0; > ?column? > ---------- > - t > + f > (1 row) > > and the server log: > https://api.cirrus-ci.com/v1/artifact/task/6460573058596864/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log > > I'm not entirely clear on what the regression test is trying to test here. > > > It's plausible the standby was a bit behind at that point and prevented > VACUUM from truncating via hot_standby_feeback? I don't think hot_standby_feedback is enabled. There is some history of flakiness here, recently adjusted by the following commit. I'm not entirely sure what replication has to do with it... perhaps it's just messing with CPU scheduling. commit fe246d1c111d43fd60a1b0afff25ed09b7ae11eb Author: Michael Paquier <michael@paquier.xyz> Date: Fri Apr 2 09:44:42 2021 +0900 Improve stability of test with vacuum_truncate in reloptions.sql Hmm... looking at that commit and the referenced discussion, I'm not sure why that commit added only FREEZE and not also DISABLE_PAGE_SKIPPING, following the example of: commit c2dc1a79767a0f947e1145f82eb65dfe4360d25f Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Jan 20 11:49:29 2021 -0500 Disable vacuum page skipping in selected test cases.
On Tue, Jan 18, 2022 at 11:08 AM Thomas Munro <thomas.munro@gmail.com> wrote: > commit fe246d1c111d43fd60a1b0afff25ed09b7ae11eb > Author: Michael Paquier <michael@paquier.xyz> > Date: Fri Apr 2 09:44:42 2021 +0900 > > Improve stability of test with vacuum_truncate in reloptions.sql > > Hmm... looking at that commit and the referenced discussion, I'm not > sure why that commit added only FREEZE and not also > DISABLE_PAGE_SKIPPING, following the example of: > > commit c2dc1a79767a0f947e1145f82eb65dfe4360d25f > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Wed Jan 20 11:49:29 2021 -0500 > > Disable vacuum page skipping in selected test cases. Alright, I've pushed a change like that. Let's see if that clears it up. (From the curious coincidences department: the earlier complaint[1] also involved some kind of replication (Postgres Pro multimaster, about which I know nothing).) [1] https://www.postgresql.org/message-id/flat/87tuotr2hh.fsf%40ars-thinkpad
Thomas Munro <thomas.munro@gmail.com> writes: > Alright, I've pushed a change like that. Let's see if that clears it > up. Nope: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-01-18%2019%3A50%3A57 That reloptions test has been there awhile, and we weren't seeing issues with it before. What about the replication environment would cause VACUUM to behave differently? Maybe it thinks there are non-removable tuples because the walsender is holding back global xmin? regards, tom lane
On 2022-01-18 15:15:44 -0500, Tom Lane wrote: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-01-18%2019%3A50%3A57 > > That reloptions test has been there awhile, and we weren't seeing > issues with it before. What about the replication environment > would cause VACUUM to behave differently? Maybe it thinks there > are non-removable tuples because the walsender is holding back > global xmin? hot_standby_feedback isn't enabled, so it shouldn't... I wonder if it's just a question of the different configuration used by Cluster.pm, rather than the fact that a replica is actually connected. Locally I see the following additions to the configuration for the primary in 027_stream_regres.pl fsync = off restart_after_crash = off log_line_prefix = '%m [%p] %q%a ' log_statement = all log_replication_commands = on wal_retrieve_retry_interval = '500ms' stats_temp_directory = 'pg_stat_tmp' wal_level = replica max_wal_senders = 10 max_replication_slots = 10 wal_log_hints = on hot_standby = on shared_buffers = 1MB max_connections 25 max_wal_size = 128MB port = 63698 unix_socket_directories = '/tmp/098ygO1bAF' listen_addresses = '' max_prepared_transactions = 10 That's an extremely small shared_buffers for running the regression tests, it'd not be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM ends up skipping over a page because of page contention? Also, it's odd that there's "max_connections 25" without an equal sign. I'd kind of expected that to cause an error.... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > That's an extremely small shared_buffers for running the regression tests, it'd not > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM > ends up skipping over a page because of page contention? Hmm, good thought. I tried running the test with even smaller shared_buffers, but could not make the reloptions test fall over for me. But this theory implies a strong timing dependency, so it might still only happen on particular machines. (If anyone else tries it: below about 400kB, other tests start failing with "no free unpinned buffers" and the like.) > Also, it's odd that there's "max_connections 25" without an equal sign. I'd > kind of expected that to cause an error.... I see that guc.c intentionally allows the equal sign to be optional. Too lazy to check if that's documented. It is weird that this one test appears to be intentionally testing that, though, because it's sure got nada to do with the expressed purpose of the test. regards, tom lane
On Wed, Jan 19, 2022 at 11:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > Also, it's odd that there's "max_connections 25" without an equal sign. I'd > > kind of expected that to cause an error.... > > I see that guc.c intentionally allows the equal sign to be optional. > Too lazy to check if that's documented. It is weird that this one > test appears to be intentionally testing that, though, because it's > sure got nada to do with the expressed purpose of the test. That's coming from here: $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1); That's passed to: =item $node->adjust_conf(filename, setting, value, skip_equals) That 1 was unintentional (and I don't see anything else using this facility), but it doesn't seem to hurt anything since it's parsed just fine. I'll take it out in passing in a future commit once we figure out what's up with reloptions...
Hi, On 2022-01-18 17:19:06 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > That's an extremely small shared_buffers for running the regression tests, it'd not > > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM > > ends up skipping over a page because of page contention? > > Hmm, good thought. I tried running the test with even smaller > shared_buffers, but could not make the reloptions test fall over for > me. But this theory implies a strong timing dependency, so it might > still only happen on particular machines. (If anyone else tries it: > below about 400kB, other tests start failing with "no free unpinned > buffers" and the like.) I ran the test in a loop for 200+ times now, without reproducing the problem. Rorqual runs on a shared machine though, so it's quite possible that IO will be slower, and thus triggering the issue. I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM - that'd show information about the number of pages with tuples etc. But I don't currently see a way of that causing the regression tests to fail. Even if I set client_min_messages=error, the messages still get sent to the client, because elevel == INFO is special cased in should_output_to_client(). And I don't see a way of redirecting the output of common.c:NoticeProcessor() in psql either. Greetings, Andres Freund
On Wed, Jan 19, 2022 at 12:08 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-01-18 17:19:06 -0500, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > That's an extremely small shared_buffers for running the regression tests, it'd not > > > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM > > > ends up skipping over a page because of page contention? > > > > Hmm, good thought. I tried running the test with even smaller > > shared_buffers, but could not make the reloptions test fall over for > > me. But this theory implies a strong timing dependency, so it might > > still only happen on particular machines. (If anyone else tries it: > > below about 400kB, other tests start failing with "no free unpinned > > buffers" and the like.) > > I ran the test in a loop for 200+ times now, without reproducing the > problem. Rorqual runs on a shared machine though, so it's quite possible that > IO will be slower, and thus triggering the issue. > > I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM - > that'd show information about the number of pages with tuples etc. But I don't > currently see a way of that causing the regression tests to fail. > > Even if I set client_min_messages=error, the messages still get sent to the > client, because elevel == INFO is special cased in > should_output_to_client(). And I don't see a way of redirecting the output of > common.c:NoticeProcessor() in psql either. I hacked a branch thusly: @@ -327,6 +327,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, verbose = (params->options & VACOPT_VERBOSE) != 0; instrument = (verbose || (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)); + instrument = true; if (instrument) { pg_rusage_init(&ru0); Having failed to reproduce this locally, I clicked on "re-run tests" all afternoon on CI until eventually I captured a failure log[1] there, with the smoking gun: pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen There are three places that skip and bump that counter, but two of them were disabled when I added DISABLE_PAGE_SKIPPING, leaving this one: LockBuffer(buf, BUFFER_LOCK_SHARE); if (!lazy_check_needs_freeze(buf, &hastup, vacrel)) { UnlockReleaseBuffer(buf); vacrel->scanned_pages++; vacrel->pinskipped_pages++; if (hastup) vacrel->nonempty_pages = blkno + 1; continue; } Since this page doesn't require wraparound vacuuming, if we fail to conditionally acquire the cleanup lock, this block skips the page. [1] https://api.cirrus-ci.com/v1/artifact/task/5096848598761472/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log
Hi, On 2022-01-20 17:23:30 +1300, Thomas Munro wrote: > Having failed to reproduce this locally, I clicked on "re-run tests" > all afternoon on CI until eventually I captured a failure log[1] > there, with the smoking gun: Phew, finally. > Since this page doesn't require wraparound vacuuming, if we fail to > conditionally acquire the cleanup lock, this block skips the page. The most likely source for such a pin is bgwriter, I think? It's not actually *immediately* obviously why the page doesn't require wraparound, as VACUUM FREEZE should use a very aggressive one. The reason is presumably that there are concurrent backends holding back the horizon. Note that the oldest xmin in the log below (15356 for both vacuums) is older than the xids from the intentionally failed insert (15465). I wonder if the easiest way to make this test reliable would be to make the table a temporary one? That now uses very aggressive horizons, there's no bgwriter that could pin the page, etc. Regards, Andres 2022-01-20 03:58:31.966 UTC [32584][client backend] [pg_regress/reloptions][4/5416:0] LOG: statement: INSERT INTO reloptions_testVALUES (1, NULL), (NULL, NULL); 2022-01-20 03:58:31.967 UTC [32584][client backend] [pg_regress/reloptions][4/5416:15465] ERROR: null value in column "i"of relation "reloptions_test" violates not-null constraint 2022-01-20 03:58:31.967 UTC [32584][client backend] [pg_regress/reloptions][4/5416:15465] DETAIL: Failing row contains (null,null). 2022-01-20 03:58:31.967 UTC [32584][client backend] [pg_regress/reloptions][4/5416:15465] STATEMENT: INSERT INTO reloptions_testVALUES (1, NULL), (NULL, NULL); 2022-01-20 03:58:31.970 UTC [32584][client backend] [pg_regress/reloptions][4/5417:0] LOG: statement: VACUUM (FREEZE, DISABLE_PAGE_SKIPPING)reloptions_test; 2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5418:0] LOG: automatic aggressive vacuum oftable "regression.public.reloptions_test": index scans: 0 pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 15356 index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 372.024 MB/s, avg write rate: 0.000 MB/s buffer usage: 3 hits, 6 misses, 0 dirtied WAL usage: 1 records, 0 full page images, 237 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5418:0] STATEMENT: VACUUM (FREEZE, DISABLE_PAGE_SKIPPING)reloptions_test; 2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5419:0] LOG: automatic aggressive vacuum oftable "regression.pg_toast.pg_toast_35786": index scans: 0 pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 15356 avg read rate: 790.028 MB/s, avg write rate: 0.000 MB/s buffer usage: 13 hits, 9 misses, 0 dirtied WAL usage: 0 records, 0 full page images, 0 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5419:0] STATEMENT: VACUUM (FREEZE, DISABLE_PAGE_SKIPPING)reloptions_test;
On Thu, Jan 20, 2022 at 6:24 PM Andres Freund <andres@anarazel.de> wrote: > I wonder if the easiest way to make this test reliable would be to make the > table a temporary one? That now uses very aggressive horizons, there's no > bgwriter that could pin the page, etc. Good idea, thanks. I pushed that minimal change. Skipping over some other unrelated recoveryCheck failures showing in the BF today[1], next up we have another kind of failure on these Linux sparc animals: 2022-01-19 16:39:48.768 CET [9703:4] DETAIL: Last completed transaction was at log time 2022-01-19 16:39:48.669624+01. 2022-01-19 16:39:54.629 CET [9703:5] LOG: restartpoint starting: wal 2022-01-19 16:39:55.180 CET [9705:5] LOG: incorrect resource manager data checksum in record at 0/AD445A8 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-01-19%2015%3A44%3A59 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2022-01-19%2015%3A14%3A06 That would have kept me busy for a long time, if Noah hadn't recently diagnosed an ext4-on-sparc bug where you sometimes read bogus zeroes from a file that is being concurrently written[2], which of course breaks streaming replication and thus this test. To turn that green, I guess we'll need to switch to another filesystem, or wait for a patched kernel. [1] https://www.postgresql.org/message-id/CA%2BhUKGKV6fOHvfiPt8%3DdOKzvswjAyLoFoJF1iQXMNpi7%2BhD1JQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/20220116210241.GC756210%40rfd.leadboat.com