Thread: pgsql: Test replay of regression tests, attempt II.

pgsql: Test replay of regression tests, attempt II.

From
Thomas Munro
Date:
Test replay of regression tests, attempt II.

See commit message for 123828a7fa563025d0ceee10cf1b2a253cd05319.  The
only change this time is the order of the arguments passed to
pg_regress.  The previously version broke in the build farm environment
due to the contents of EXTRA_REGRESS_OPTS (see also commit 8cade04c
which had to do something similar).

Discussion: https://postgr.es/m/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/f47ed79cc8a0cfa154dc7f01faaf59822552363f

Modified Files
--------------
doc/src/sgml/regress.sgml                 | 11 +++++
src/test/recovery/Makefile                |  6 ++-
src/test/recovery/t/027_stream_regress.pl | 79 +++++++++++++++++++++++++++++++
src/tools/msvc/vcregress.pl               |  2 +
4 files changed, 97 insertions(+), 1 deletion(-)


Re: pgsql: Test replay of regression tests, attempt II.

From
Michael Paquier
Date:
On Mon, Jan 17, 2022 at 03:35:48AM +0000, Thomas Munro wrote:
> Test replay of regression tests, attempt II.
>
> See commit message for 123828a7fa563025d0ceee10cf1b2a253cd05319.  The
> only change this time is the order of the arguments passed to
> pg_regress.  The previously version broke in the build farm environment
> due to the contents of EXTRA_REGRESS_OPTS (see also commit 8cade04c
> which had to do something similar).
>
> Discussion: https://postgr.es/m/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com

Butterflyfish has just failed in this new test:
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD

Ad the pg_regress command has failed one of the tests:
     reloptions                   ... FAILED     1377 ms

No idea what's going on here, but I guess that we'd better show up the
contents of regression.diffs in the TAP logs if it exists.
--
Michael

Attachment

Re: pgsql: Test replay of regression tests, attempt II.

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Jan 17, 2022 at 03:35:48AM +0000, Thomas Munro wrote:
>> Test replay of regression tests, attempt II.

> Butterflyfish has just failed in this new test:
> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD
> Ad the pg_regress command has failed one of the tests:
>      reloptions                   ... FAILED     1377 ms
> No idea what's going on here, but I guess that we'd better show up the
> contents of regression.diffs in the TAP logs if it exists.

I'm quite appalled that this patch has apparently added an entire new run
of the core regression tests to the standard check-world/buildfarm cycle.
Was any thought taken for the developer-time and buildfarm-load impact
that will have?  Neither of those resources is free.

            regards, tom lane



Re: pgsql: Test replay of regression tests, attempt II.

From
Tom Lane
Date:
I wrote:
> I'm quite appalled that this patch has apparently added an entire new run
> of the core regression tests to the standard check-world/buildfarm cycle.

Not only that, but it leaves junk that is neither cleaned by "make clean"
nor .gitignored:

$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        primary.dump
        results/
        standby.dump

$ make clean
rm -rf tmp_check

These files should have been put into the tmp_check directory...
that's what it's for.

            regards, tom lane



Re: pgsql: Test replay of regression tests, attempt II.

From
Thomas Munro
Date:
On Tue, Jan 18, 2022 at 6:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > I'm quite appalled that this patch has apparently added an entire new run
> > of the core regression tests to the standard check-world/buildfarm cycle.

I wonder if there is a good way to share the resulting data directory
with the pg_upgrade test.

> These files should have been put into the tmp_check directory...
> that's what it's for.

Yeah, I mentioned this problem on the thread and was about to update
the clean target to remove them, but this sounds like a better idea.
Will fix.



Re: pgsql: Test replay of regression tests, attempt II.

From
Thomas Munro
Date:
On Mon, Jan 17, 2022 at 8:02 PM Michael Paquier <michael@paquier.xyz> wrote:
> Butterflyfish has just failed in this new test:
> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD
>
> Ad the pg_regress command has failed one of the tests:
>      reloptions                   ... FAILED     1377 ms

Also spurfowl and curculio.

> No idea what's going on here, but I guess that we'd better show up the
> contents of regression.diffs in the TAP logs if it exists.

Hmm.  Yeah.



Re: pgsql: Test replay of regression tests, attempt II.

From
Andres Freund
Date:
Hi,

On 2022-01-18 07:47:48 +1300, Thomas Munro wrote:
> > No idea what's going on here, but I guess that we'd better show up the
> > contents of regression.diffs in the TAP logs if it exists.
> 
> Hmm.  Yeah.

It already is in the regress_log_*, but I agree, it'd be good to add it to the
normal output in the failure case.

https://api.cirrus-ci.com/v1/artifact/task/6460573058596864/tap/src/test/recovery/tmp_check/log/regress_log_027_stream_regress

I think system_or_bail() might be the wrong tool here. If I understand
Test::More's BAIL_OUT() correctly [1], it'll prevent other tests from running as
well. Which doesn't seem quite right for a failure like this?

Kinda seems like nearly all of our uses of system_or_bail() are bad?

Greetings,

Andres Freund

[1]: https://perldoc.perl.org/Test::More#Test-control
"Indicates to the harness that things are going so badly all testing
should terminate. This includes the running of any additional test scripts."



Re: pgsql: Test replay of regression tests, attempt II.

From
Thomas Munro
Date:
On Tue, Jan 18, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:
> diff -w -U3 c:/cirrus/src/test/recovery/../regress/expected/reloptions.out
c:/cirrus/src/test/recovery/results/reloptions.out
> --- c:/cirrus/src/test/recovery/../regress/expected/reloptions.out      2022-01-17 07:08:52.779337800 +0000
> +++ c:/cirrus/src/test/recovery/results/reloptions.out  2022-01-17 07:24:46.785394900 +0000
> @@ -133,7 +133,7 @@
>  SELECT pg_relation_size('reloptions_test') = 0;
>   ?column?
>  ----------
> - t
> + f
>  (1 row)
>
> and the server log:
>
https://api.cirrus-ci.com/v1/artifact/task/6460573058596864/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log
>
> I'm not entirely clear on what the regression test is trying to test here.
>
>
> It's plausible the standby was a bit behind at that point and prevented
> VACUUM from truncating via hot_standby_feeback?

I don't think hot_standby_feedback is enabled.  There is some history
of flakiness here, recently adjusted by the following commit.  I'm not
entirely sure what replication has to do with it... perhaps it's just
messing with CPU scheduling.

commit fe246d1c111d43fd60a1b0afff25ed09b7ae11eb
Author: Michael Paquier <michael@paquier.xyz>
Date:   Fri Apr 2 09:44:42 2021 +0900

    Improve stability of test with vacuum_truncate in reloptions.sql

Hmm... looking at that commit and the referenced discussion, I'm not
sure why that commit added only FREEZE and not also
DISABLE_PAGE_SKIPPING, following the example of:

commit c2dc1a79767a0f947e1145f82eb65dfe4360d25f
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Wed Jan 20 11:49:29 2021 -0500

    Disable vacuum page skipping in selected test cases.



Re: pgsql: Test replay of regression tests, attempt II.

From
Thomas Munro
Date:
On Tue, Jan 18, 2022 at 11:08 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> commit fe246d1c111d43fd60a1b0afff25ed09b7ae11eb
> Author: Michael Paquier <michael@paquier.xyz>
> Date:   Fri Apr 2 09:44:42 2021 +0900
>
>     Improve stability of test with vacuum_truncate in reloptions.sql
>
> Hmm... looking at that commit and the referenced discussion, I'm not
> sure why that commit added only FREEZE and not also
> DISABLE_PAGE_SKIPPING, following the example of:
>
> commit c2dc1a79767a0f947e1145f82eb65dfe4360d25f
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Wed Jan 20 11:49:29 2021 -0500
>
>     Disable vacuum page skipping in selected test cases.

Alright, I've pushed a change like that.  Let's see if that clears it
up.  (From the curious coincidences department: the earlier
complaint[1] also involved some kind of replication (Postgres Pro
multimaster, about which I know nothing).)

[1] https://www.postgresql.org/message-id/flat/87tuotr2hh.fsf%40ars-thinkpad



Re: pgsql: Test replay of regression tests, attempt II.

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Alright, I've pushed a change like that.  Let's see if that clears it
> up.

Nope:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-01-18%2019%3A50%3A57

That reloptions test has been there awhile, and we weren't seeing
issues with it before.  What about the replication environment
would cause VACUUM to behave differently?  Maybe it thinks there
are non-removable tuples because the walsender is holding back
global xmin?

            regards, tom lane



Re: pgsql: Test replay of regression tests, attempt II.

From
Andres Freund
Date:
On 2022-01-18 15:15:44 -0500, Tom Lane wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-01-18%2019%3A50%3A57
> 
> That reloptions test has been there awhile, and we weren't seeing
> issues with it before.  What about the replication environment
> would cause VACUUM to behave differently?  Maybe it thinks there
> are non-removable tuples because the walsender is holding back
> global xmin?

hot_standby_feedback isn't enabled, so it shouldn't...

I wonder if it's just a question of the different configuration used by
Cluster.pm, rather than the fact that a replica is actually connected.

Locally I see the following additions to the configuration for the primary in
027_stream_regres.pl

  fsync = off
  restart_after_crash = off
  log_line_prefix = '%m [%p] %q%a '
  log_statement = all
  log_replication_commands = on
  wal_retrieve_retry_interval = '500ms'
  stats_temp_directory = 'pg_stat_tmp'
  wal_level = replica
  max_wal_senders = 10
  max_replication_slots = 10
  wal_log_hints = on
  hot_standby = on
  shared_buffers = 1MB
  max_connections 25
  max_wal_size = 128MB
  port = 63698
  unix_socket_directories = '/tmp/098ygO1bAF'
  listen_addresses = ''
  max_prepared_transactions = 10

That's an extremely small shared_buffers for running the regression tests, it'd not
be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
ends up skipping over a page because of page contention?

Also, it's odd that there's "max_connections 25" without an equal sign. I'd
kind of expected that to cause an error....

Greetings,

Andres Freund



Re: pgsql: Test replay of regression tests, attempt II.

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> That's an extremely small shared_buffers for running the regression tests, it'd not
> be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
> ends up skipping over a page because of page contention?

Hmm, good thought.  I tried running the test with even smaller
shared_buffers, but could not make the reloptions test fall over for
me.  But this theory implies a strong timing dependency, so it might
still only happen on particular machines.  (If anyone else tries it:
below about 400kB, other tests start failing with "no free unpinned
buffers" and the like.)

> Also, it's odd that there's "max_connections 25" without an equal sign. I'd
> kind of expected that to cause an error....

I see that guc.c intentionally allows the equal sign to be optional.
Too lazy to check if that's documented.  It is weird that this one
test appears to be intentionally testing that, though, because it's
sure got nada to do with the expressed purpose of the test.

            regards, tom lane



Re: pgsql: Test replay of regression tests, attempt II.

From
Thomas Munro
Date:
On Wed, Jan 19, 2022 at 11:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Also, it's odd that there's "max_connections 25" without an equal sign. I'd
> > kind of expected that to cause an error....
>
> I see that guc.c intentionally allows the equal sign to be optional.
> Too lazy to check if that's documented.  It is weird that this one
> test appears to be intentionally testing that, though, because it's
> sure got nada to do with the expressed purpose of the test.

That's coming from here:

$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1);

That's passed to:

=item $node->adjust_conf(filename, setting, value, skip_equals)

That 1 was unintentional (and I don't see anything else using this
facility), but it doesn't seem to hurt anything since it's parsed just
fine.  I'll take it out in passing in a future commit once we figure
out what's up with reloptions...



Re: pgsql: Test replay of regression tests, attempt II.

From
Andres Freund
Date:
Hi,

On 2022-01-18 17:19:06 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > That's an extremely small shared_buffers for running the regression tests, it'd not
> > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
> > ends up skipping over a page because of page contention?
> 
> Hmm, good thought.  I tried running the test with even smaller
> shared_buffers, but could not make the reloptions test fall over for
> me.  But this theory implies a strong timing dependency, so it might
> still only happen on particular machines.  (If anyone else tries it:
> below about 400kB, other tests start failing with "no free unpinned
> buffers" and the like.)

I ran the test in a loop for 200+ times now, without reproducing the
problem. Rorqual runs on a shared machine though, so it's quite possible that
IO will be slower, and thus triggering the issue.

I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM -
that'd show information about the number of pages with tuples etc. But I don't
currently see a way of that causing the regression tests to fail.

Even if I set client_min_messages=error, the messages still get sent to the
client, because elevel == INFO is special cased in
should_output_to_client(). And I don't see a way of redirecting the output of
common.c:NoticeProcessor() in psql either.

Greetings,

Andres Freund



Re: pgsql: Test replay of regression tests, attempt II.

From
Thomas Munro
Date:
On Wed, Jan 19, 2022 at 12:08 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-01-18 17:19:06 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > That's an extremely small shared_buffers for running the regression tests, it'd not
> > > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
> > > ends up skipping over a page because of page contention?
> >
> > Hmm, good thought.  I tried running the test with even smaller
> > shared_buffers, but could not make the reloptions test fall over for
> > me.  But this theory implies a strong timing dependency, so it might
> > still only happen on particular machines.  (If anyone else tries it:
> > below about 400kB, other tests start failing with "no free unpinned
> > buffers" and the like.)
>
> I ran the test in a loop for 200+ times now, without reproducing the
> problem. Rorqual runs on a shared machine though, so it's quite possible that
> IO will be slower, and thus triggering the issue.
>
> I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM -
> that'd show information about the number of pages with tuples etc. But I don't
> currently see a way of that causing the regression tests to fail.
>
> Even if I set client_min_messages=error, the messages still get sent to the
> client, because elevel == INFO is special cased in
> should_output_to_client(). And I don't see a way of redirecting the output of
> common.c:NoticeProcessor() in psql either.

I hacked a branch thusly:

@@ -327,6 +327,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
        verbose = (params->options & VACOPT_VERBOSE) != 0;
        instrument = (verbose || (IsAutoVacuumWorkerProcess() &&

params->log_min_duration >= 0));
+       instrument = true;
        if (instrument)
        {
                pg_rusage_init(&ru0);

Having failed to reproduce this locally, I clicked on "re-run tests"
all afternoon on CI until eventually I captured a failure log[1]
there, with the smoking gun:

pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen

There are three places that skip and bump that counter, but two of
them were disabled when I added DISABLE_PAGE_SKIPPING, leaving this
one:

            LockBuffer(buf, BUFFER_LOCK_SHARE);
            if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
            {
                UnlockReleaseBuffer(buf);
                vacrel->scanned_pages++;
                vacrel->pinskipped_pages++;
                if (hastup)
                    vacrel->nonempty_pages = blkno + 1;
                continue;
            }

Since this page doesn't require wraparound vacuuming, if we fail to
conditionally acquire the cleanup lock, this block skips the page.

[1]
https://api.cirrus-ci.com/v1/artifact/task/5096848598761472/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log



Re: pgsql: Test replay of regression tests, attempt II.

From
Andres Freund
Date:
Hi,

On 2022-01-20 17:23:30 +1300, Thomas Munro wrote:
> Having failed to reproduce this locally, I clicked on "re-run tests"
> all afternoon on CI until eventually I captured a failure log[1]
> there, with the smoking gun:

Phew, finally.


> Since this page doesn't require wraparound vacuuming, if we fail to
> conditionally acquire the cleanup lock, this block skips the page.

The most likely source for such a pin is bgwriter, I think?

It's not actually *immediately* obviously why the page doesn't require
wraparound, as VACUUM FREEZE should use a very aggressive one. The reason is
presumably that there are concurrent backends holding back the horizon. Note
that the oldest xmin in the log below (15356 for both vacuums) is older than
the xids from the intentionally failed insert (15465).

I wonder if the easiest way to make this test reliable would be to make the
table a temporary one? That now uses very aggressive horizons, there's no
bgwriter that could pin the page, etc.

Regards,

Andres


2022-01-20 03:58:31.966 UTC [32584][client backend] [pg_regress/reloptions][4/5416:0] LOG:  statement: INSERT INTO
reloptions_testVALUES (1, NULL), (NULL, NULL);
 
2022-01-20 03:58:31.967 UTC [32584][client backend] [pg_regress/reloptions][4/5416:15465] ERROR:  null value in column
"i"of relation "reloptions_test" violates not-null constraint
 
2022-01-20 03:58:31.967 UTC [32584][client backend] [pg_regress/reloptions][4/5416:15465] DETAIL:  Failing row contains
(null,null).
 
2022-01-20 03:58:31.967 UTC [32584][client backend] [pg_regress/reloptions][4/5416:15465] STATEMENT:  INSERT INTO
reloptions_testVALUES (1, NULL), (NULL, NULL);
 

2022-01-20 03:58:31.970 UTC [32584][client backend] [pg_regress/reloptions][4/5417:0] LOG:  statement: VACUUM (FREEZE,
DISABLE_PAGE_SKIPPING)reloptions_test;
 
2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5418:0] LOG:  automatic aggressive vacuum
oftable "regression.public.reloptions_test": index scans: 0
 
    pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen
    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 15356
    index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
    avg read rate: 372.024 MB/s, avg write rate: 0.000 MB/s
    buffer usage: 3 hits, 6 misses, 0 dirtied
    WAL usage: 1 records, 0 full page images, 237 bytes
    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5418:0] STATEMENT:  VACUUM (FREEZE,
DISABLE_PAGE_SKIPPING)reloptions_test;
 
2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5419:0] LOG:  automatic aggressive vacuum
oftable "regression.pg_toast.pg_toast_35786": index scans: 0
 
    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 15356
    avg read rate: 790.028 MB/s, avg write rate: 0.000 MB/s
    buffer usage: 13 hits, 9 misses, 0 dirtied
    WAL usage: 0 records, 0 full page images, 0 bytes
    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5419:0] STATEMENT:  VACUUM (FREEZE,
DISABLE_PAGE_SKIPPING)reloptions_test;
 




Re: pgsql: Test replay of regression tests, attempt II.

From
Thomas Munro
Date:
On Thu, Jan 20, 2022 at 6:24 PM Andres Freund <andres@anarazel.de> wrote:
> I wonder if the easiest way to make this test reliable would be to make the
> table a temporary one? That now uses very aggressive horizons, there's no
> bgwriter that could pin the page, etc.

Good idea, thanks.  I pushed that minimal change.

Skipping over some other unrelated recoveryCheck failures showing in
the BF today[1], next up we have another kind of failure on these
Linux sparc animals:

2022-01-19 16:39:48.768 CET [9703:4] DETAIL:  Last completed
transaction was at log time 2022-01-19 16:39:48.669624+01.
2022-01-19 16:39:54.629 CET [9703:5] LOG:  restartpoint starting: wal
2022-01-19 16:39:55.180 CET [9705:5] LOG:  incorrect resource manager
data checksum in record at 0/AD445A8

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-01-19%2015%3A44%3A59
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2022-01-19%2015%3A14%3A06

That would have kept me busy for a long time, if Noah hadn't recently
diagnosed an ext4-on-sparc bug where you sometimes read bogus zeroes
from a file that is being concurrently written[2], which of course
breaks streaming replication and thus this test. To turn that green, I
guess we'll need to switch to another filesystem, or wait for a
patched kernel.

[1] https://www.postgresql.org/message-id/CA%2BhUKGKV6fOHvfiPt8%3DdOKzvswjAyLoFoJF1iQXMNpi7%2BhD1JQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20220116210241.GC756210%40rfd.leadboat.com