Thread: A test for replay of regression tests

A test for replay of regression tests

From
Thomas Munro
Date:
Hi,

We have automated tests for many specific replication and recovery
scenarios, but nothing that tests replay of a wide range of records.
People working on recovery code often use installcheck (presumably
along with other custom tests) to exercise it, sometimes with
wal_consistency_check enabled.  So, why don't we automate that?  Aside
from exercising the WAL decoding machinery (which brought me here),
that'd hopefully provide some decent improvements in coverage of the
various redo routines, many of which are not currently exercised at
all.

I'm not quite sure where it belongs, though.  The attached initial
sketch patch puts it under rc/test/recovery near other similar things,
but I'm not sure if it's really OK to invoke make -C ../regress from
here.  I copied pg_update/test.sh's trick of using a different
outputdir to avoid clobbering a concurrent run under src/test/regress,
and I also needed to invent a way to stop it from running the cursed
tablespace test (deferring startup of the standby also works but eats
way too much space, which I learned after blowing out a smallish
development VM's disk).   Alternatively, we could put it under
src/test/regress, which makes some kind of logical sense, but it would
make a quick "make check" take more than twice as long.  Perhaps it
could use a different target, one that check-world somehow reaches but
not check, and that also doesn't seem great.  Ideas on how to
structure this or improve the perl welcome.

Attachment

RE: A test for replay of regression tests

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Thomas Munro <thomas.munro@gmail.com>
> We have automated tests for many specific replication and recovery scenarios,
> but nothing that tests replay of a wide range of records.
> People working on recovery code often use installcheck (presumably along
> with other custom tests) to exercise it, sometimes with
> wal_consistency_check enabled.  So, why don't we automate that?  Aside
> from exercising the WAL decoding machinery (which brought me here), that'd
> hopefully provide some decent improvements in coverage of the various redo
> routines, many of which are not currently exercised at all.
> 
> I'm not quite sure where it belongs, though.  The attached initial sketch patch

I think that's a good attempt.  It'd be better to confirm that the database contents are identical on the primary and
standby. But... I remember when I ran make installcheck with a standby connected, then ran pg_dumpall on both the
primaryand standby and compare the two output files, about 40 lines of difference were observed.  Those differences
wereall about the sequence values.  I didn't think about whether I should/can remove the differences.
 


+# XXX The tablespace tests don't currently work when the standby shares a
+# filesystem with the primary due to colliding absolute paths.  We'll skip
+# that for now.

Maybe we had better have a hidden feature that creates tablespace contents in

/tablespace_location/PG_..._<some_name>/

<some_name> is the value of cluster_name or application_name.

Or, we may provide a visible feature that allows users to store tablespace contents in a specified directory regardless
ofthe LOCATION value in CREATE TABLESPACE.  For instance, add a GUC like
 

    table_space_directory = '/some_dir'

Then, the tablespace contents are stored in /some_dir/<tablespace_name>/.  This may be useful if a DBaaS provider wants
tooffer some tablespace-based feature, say tablespace transparent data encryption, but doesn't want users to specify
filesystemdirectories.
 


Regards
Takayuki Tsunakawa



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Apr 23, 2021 at 6:27 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> From: Thomas Munro <thomas.munro@gmail.com>
> > I'm not quite sure where it belongs, though.  The attached initial sketch patch
>
> I think that's a good attempt.  It'd be better to confirm that the database contents are identical on the primary and
standby. But... I remember when I ran make installcheck with a standby connected, then ran pg_dumpall on both the
primaryand standby and compare the two output files, about 40 lines of difference were observed.  Those differences
wereall about the sequence values.  I didn't think about whether I should/can remove the differences. 

Interesting idea.  I hadn't noticed the thing with sequences before.

> +# XXX The tablespace tests don't currently work when the standby shares a
> +# filesystem with the primary due to colliding absolute paths.  We'll skip
> +# that for now.
>
> Maybe we had better have a hidden feature that creates tablespace contents in
>
> /tablespace_location/PG_..._<some_name>/
>
> <some_name> is the value of cluster_name or application_name.
>
> Or, we may provide a visible feature that allows users to store tablespace contents in a specified directory
regardlessof the LOCATION value in CREATE TABLESPACE.  For instance, add a GUC like 
>
>     table_space_directory = '/some_dir'
>
> Then, the tablespace contents are stored in /some_dir/<tablespace_name>/.  This may be useful if a DBaaS provider
wantsto offer some tablespace-based feature, say tablespace transparent data encryption, but doesn't want users to
specifyfilesystem directories. 

Yeah, a few similar ideas have been put forward before, for example in
this thread:

https://www.postgresql.org/message-id/flat/CALfoeisEF92F5nJ-aAcuWTvF_Aogxq_1bHLem_kVfM_tHc2mfg%40mail.gmail.com

... but also others.  I would definitely like to fix that problem too
(having worked on several things that interact with tablespaces, I
definitely want to be able to extend those tests in future patches,
and get coverage on the build farm and CI), but with --skip-tests that
could be done independently.

Apparently the invocation of make failed for some reason on CI (not
sure why, works for me), but in any case, that feels a bit clunky and
might not ever work on Windows, so perhaps we should figure out how to
invoke the pg_regress[.exe] program directly.



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2021-04-23 17:37:48 +1200, Thomas Munro wrote:
> We have automated tests for many specific replication and recovery
> scenarios, but nothing that tests replay of a wide range of records.
> People working on recovery code often use installcheck (presumably
> along with other custom tests) to exercise it, sometimes with
> wal_consistency_check enabled.  So, why don't we automate that?  Aside
> from exercising the WAL decoding machinery (which brought me here),
> that'd hopefully provide some decent improvements in coverage of the
> various redo routines, many of which are not currently exercised at
> all.

Yay.


> I'm not quite sure where it belongs, though.  The attached initial
> sketch patch puts it under rc/test/recovery near other similar things,
> but I'm not sure if it's really OK to invoke make -C ../regress from
> here.

I'd say it's not ok, and we should just invoke pg_regress without make.


> Add a new TAP test under src/test/recovery that runs the regression
> tests with wal_consistency_checking=all.

Hm. I wonder if running with wal_consistency_checking=all doesn't also
reduce coverage of some aspects of recovery, by increasing record sizes
etc.


> I copied pg_update/test.sh's trick of using a different
> outputdir to avoid clobbering a concurrent run under src/test/regress,
> and I also needed to invent a way to stop it from running the cursed
> tablespace test (deferring startup of the standby also works but eats
> way too much space, which I learned after blowing out a smallish
> development VM's disk).

That's because you are using wal_consistency_checking=all, right?
Because IIRC we don't generate that much WAL otherwise?


> +# Create some content on primary and check its presence in standby 1
> +$node_primary->safe_psql('postgres',
> +    "CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
> +
> +# Wait for standby to catch up
> +$node_primary->wait_for_catchup($node_standby_1, 'replay',
> +    $node_primary->lsn('insert'));

> +my $result =
> +  $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
> +print "standby 1: $result\n";
> +is($result, qq(1002), 'check streamed content on standby 1');

Why is this needed?



Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 4/23/21 1:37 AM, Thomas Munro wrote:
> Hi,
>
> We have automated tests for many specific replication and recovery
> scenarios, but nothing that tests replay of a wide range of records.
> People working on recovery code often use installcheck (presumably
> along with other custom tests) to exercise it, sometimes with
> wal_consistency_check enabled.  So, why don't we automate that?  Aside
> from exercising the WAL decoding machinery (which brought me here),
> that'd hopefully provide some decent improvements in coverage of the
> various redo routines, many of which are not currently exercised at
> all.
>
> I'm not quite sure where it belongs, though.  The attached initial
> sketch patch puts it under rc/test/recovery near other similar things,
> but I'm not sure if it's really OK to invoke make -C ../regress from
> here.  I copied pg_update/test.sh's trick of using a different
> outputdir to avoid clobbering a concurrent run under src/test/regress,
> and I also needed to invent a way to stop it from running the cursed
> tablespace test (deferring startup of the standby also works but eats
> way too much space, which I learned after blowing out a smallish
> development VM's disk).   Alternatively, we could put it under
> src/test/regress, which makes some kind of logical sense, but it would
> make a quick "make check" take more than twice as long.  Perhaps it
> could use a different target, one that check-world somehow reaches but
> not check, and that also doesn't seem great.  Ideas on how to
> structure this or improve the perl welcome.



Nice, I like adding a skip-tests option to pg_regress.

The perl looks ok - I assume those

    print "standby 1: $result\n";  

lines are there for debugging. Normally you would just process $result
using the Test::More functions


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2021-04-23 17:37:48 +1200, Thomas Munro wrote:
>> We have automated tests for many specific replication and recovery
>> scenarios, but nothing that tests replay of a wide range of records.

> Yay.

+1

>> Add a new TAP test under src/test/recovery that runs the regression
>> tests with wal_consistency_checking=all.

> Hm. I wonder if running with wal_consistency_checking=all doesn't also
> reduce coverage of some aspects of recovery, by increasing record sizes
> etc.

Yeah.  I found out earlier that wal_consistency_checking=all is an
absolute PIG.  Running the regression tests that way requires tens of
gigabytes of disk space, and a significant amount of time if your
disk is not very speedy.  If we put this into the buildfarm at all,
it would have to be opt-in, not run-by-default, because a lot of BF
animals simply don't have the horsepower.  I think I'd vote against
adding it to check-world, too; the cost/benefit ratio is not good
unless you are specifically working on replay functions.

And as you say, it alters the behavior enough to make it a little
questionable whether we're exercising the "normal" code paths.

The two things I'd say about this are (1) Whether to use
wal_consistency_checking, and with what value, needs to be
easily adjustable. (2) People will want to run other test suites
than the core regression tests, eg contrib modules.

            regards, tom lane



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2021-04-23 11:53:35 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Hm. I wonder if running with wal_consistency_checking=all doesn't also
> > reduce coverage of some aspects of recovery, by increasing record sizes
> > etc.
> 
> Yeah.  I found out earlier that wal_consistency_checking=all is an
> absolute PIG.  Running the regression tests that way requires tens of
> gigabytes of disk space, and a significant amount of time if your
> disk is not very speedy.  If we put this into the buildfarm at all,
> it would have to be opt-in, not run-by-default, because a lot of BF
> animals simply don't have the horsepower.  I think I'd vote against
> adding it to check-world, too; the cost/benefit ratio is not good
> unless you are specifically working on replay functions.

I think it'd be a huge improvement to test recovery during check-world
by default - it's a huge swath of crucial code that practically has no
test coverage.  I agree that testing by default with
wal_consistency_checking=all isn't feasible due to the time & space
overhead, so I think we should not do that.


> The two things I'd say about this are (1) Whether to use
> wal_consistency_checking, and with what value, needs to be
> easily adjustable. (2) People will want to run other test suites
> than the core regression tests, eg contrib modules.

I'm not really excited about tackling 2) initially. I think it's a
significant issue that we don't have test coverage for most redo
routines and that we should change that with some urgency - even if we
back out these WAL prefetch related changes, there've been sufficiently
many changes affecting WAL that I think it's worth doing the minimal
thing soon.

I don't think there's actually that much need to test contrib modules
through recovery - most of them don't seem like they'd add meaningful
coverage? The exception is contrib/bloom, but perhaps that'd be better
tackled with a dedicated test?

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2021-04-23 11:53:35 -0400, Tom Lane wrote:
>> Yeah.  I found out earlier that wal_consistency_checking=all is an
>> absolute PIG.  Running the regression tests that way requires tens of
>> gigabytes of disk space, and a significant amount of time if your
>> disk is not very speedy.  If we put this into the buildfarm at all,
>> it would have to be opt-in, not run-by-default, because a lot of BF
>> animals simply don't have the horsepower.  I think I'd vote against
>> adding it to check-world, too; the cost/benefit ratio is not good
>> unless you are specifically working on replay functions.

> I think it'd be a huge improvement to test recovery during check-world
> by default - it's a huge swath of crucial code that practically has no
> test coverage.  I agree that testing by default with
> wal_consistency_checking=all isn't feasible due to the time & space
> overhead, so I think we should not do that.

I was mainly objecting to enabling wal_consistency_checking by default.
I agree it's bad that we have so little routine test coverage on WAL
replay code.

>> The two things I'd say about this are (1) Whether to use
>> wal_consistency_checking, and with what value, needs to be
>> easily adjustable. (2) People will want to run other test suites
>> than the core regression tests, eg contrib modules.

> I don't think there's actually that much need to test contrib modules
> through recovery - most of them don't seem like they'd add meaningful
> coverage? The exception is contrib/bloom, but perhaps that'd be better
> tackled with a dedicated test?

contrib/bloom is indeed the only(?) case within contrib, but in my mind
that's a proxy for what people will be needing to test out-of-core AMs.
It might not be a test to run by default, but there needs to be a way
to do it.

Also, I suspect that there are bits of GIST/GIN/SPGIST that are not
well exercised if you don't run the contrib modules that add opclasses
for those.

            regards, tom lane



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2021-04-23 13:13:15 -0400, Tom Lane wrote:
> contrib/bloom is indeed the only(?) case within contrib, but in my mind
> that's a proxy for what people will be needing to test out-of-core AMs.
> It might not be a test to run by default, but there needs to be a way
> to do it.

Hm. My experience in the past was that the best way to test an external
AM is to run the core regression tests with a different
default_table_access_method. That does require some work of ensuring the
AM is installed and the relevant extension created, which in turn
requires a different test schedule, or hacking template1.  So likely a
different test script anyway?


> Also, I suspect that there are bits of GIST/GIN/SPGIST that are not
> well exercised if you don't run the contrib modules that add opclasses
> for those.

Possible - still think it'd be best to get the minimal thing in asap,
and then try to extend further afterwards... Perfect being the enemy of
good and all that.

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Thomas Munro
Date:
Ok, here's a new version incorporating feedback so far.

1.  Invoke pg_regress directly (no make).

2.  Use PG_TEST_EXTRA="wal_consistency_checking" as a way to opt in to
the more expensive test.

3.  Use parallel schedule rather than serial.  It's faster but also
the non-determinism might discover more things.  This required
changing the TAP test max_connections setting from 10 to 25.

4.  Remove some extraneous print statements and
check-if-data-is-replicated-using-SELECT tests that are technically
not needed (I had copied those from 001_stream_rep.pl).

Attachment

Re: A test for replay of regression tests

From
Anastasia Lubennikova
Date:

вт, 8 июн. 2021 г. в 02:25, Thomas Munro <thomas.munro@gmail.com>:
Ok, here's a new version incorporating feedback so far.

1.  Invoke pg_regress directly (no make).

2.  Use PG_TEST_EXTRA="wal_consistency_checking" as a way to opt in to
the more expensive test.

3.  Use parallel schedule rather than serial.  It's faster but also
the non-determinism might discover more things.  This required
changing the TAP test max_connections setting from 10 to 25.

4.  Remove some extraneous print statements and
check-if-data-is-replicated-using-SELECT tests that are technically
not needed (I had copied those from 001_stream_rep.pl).

Thank you for working on this test set!
I was especially glad to see the skip-tests option for pg_regress. I think it will become a very handy tool for hackers.

To try the patch I had to resolve a few merge conflicts, see a rebased version in attachments.

>       auth_extra       => [ '--create-role', 'repl_role' ]);
This line and the comment above it look like some copy-paste artifacts. Did I get it right? If so, I suggest removing them.
Other than that, the patch looks good to me.

--
Best regards,
Lubennikova Anastasia
Attachment

Re: A test for replay of regression tests

From
Anastasia Lubennikova
Date:


вт, 8 июн. 2021 г. в 02:44, Anastasia Lubennikova <lubennikovaav@gmail.com>:

вт, 8 июн. 2021 г. в 02:25, Thomas Munro <thomas.munro@gmail.com>:
Ok, here's a new version incorporating feedback so far.

1.  Invoke pg_regress directly (no make).

2.  Use PG_TEST_EXTRA="wal_consistency_checking" as a way to opt in to
the more expensive test.

3.  Use parallel schedule rather than serial.  It's faster but also
the non-determinism might discover more things.  This required
changing the TAP test max_connections setting from 10 to 25.

4.  Remove some extraneous print statements and
check-if-data-is-replicated-using-SELECT tests that are technically
not needed (I had copied those from 001_stream_rep.pl).

Thank you for working on this test set!
I was especially glad to see the skip-tests option for pg_regress. I think it will become a very handy tool for hackers.

To try the patch I had to resolve a few merge conflicts, see a rebased version in attachments.

>       auth_extra       => [ '--create-role', 'repl_role' ]);
This line and the comment above it look like some copy-paste artifacts. Did I get it right? If so, I suggest removing them.
Other than that, the patch looks good to me.

For some reason, it failed on cfbot, so I'll switch it back to 'Waiting on author'.
BTW, do I get it right, that cfbot CI will need some adjustments to print regression.out for this test?

See one more revision of the patch attached. It contains the following changes:
- rebase to recent main
- removed 'auth_extra' piece, that I mentioned above.
- added lacking make clean and gitignore changes.

--
Best regards,
Lubennikova Anastasia
Attachment

Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Thu, Jun 10, 2021 at 7:37 PM Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:
> вт, 8 июн. 2021 г. в 02:44, Anastasia Lubennikova <lubennikovaav@gmail.com>:
>> Thank you for working on this test set!
>> I was especially glad to see the skip-tests option for pg_regress. I think it will become a very handy tool for
hackers.
>>
>> To try the patch I had to resolve a few merge conflicts, see a rebased version in attachments.
>>
>> >       auth_extra       => [ '--create-role', 'repl_role' ]);
>> This line and the comment above it look like some copy-paste artifacts. Did I get it right? If so, I suggest
removingthem. 
>> Other than that, the patch looks good to me.
>
> For some reason, it failed on cfbot, so I'll switch it back to 'Waiting on author'.
> BTW, do I get it right, that cfbot CI will need some adjustments to print regression.out for this test?
>
> See one more revision of the patch attached. It contains the following changes:
> - rebase to recent main
> - removed 'auth_extra' piece, that I mentioned above.
> - added lacking make clean and gitignore changes.

Thanks!  Yeah, there does seem to be a mysterious CI failure there,
not reproducible locally for me.  You're right that it's not dumping
enough information to diagnose the problem... I will look into it
tomorrow.



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Thu, Jun 10, 2021 at 7:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jun 10, 2021 at 7:37 PM Anastasia Lubennikova
> <lubennikovaav@gmail.com> wrote:
> > For some reason, it failed on cfbot, so I'll switch it back to 'Waiting on

Sorry for the delay.  I got stuck in a CI loop trying to make a new
improved version work on Windows, so far without success.  If anyone
can tell me what's wrong on that OS I'd be very grateful to hear it.
I don't know if it's something I haven't understood about reparse
points, or something to do with "restricted tokens" and accounts
privileges.  The symptoms on Cirrus are:

   DROP TABLESPACE regress_tblspacewith;
  +WARNING:  could not open directory
"pg_tblspc/16386/PG_15_202109101": No such file or directory
  +WARNING:  could not stat file "pg_tblspc/16386": No such file or directory

The short explanation of this new version is that the tablespace tests
now work on a pair of local nodes because you can do this sort of
thing:

postgres=# create tablespace xxx location 'pg_user_files/xxx';
ERROR:  directory "pg_user_files/xxx" does not exist
postgres=# create tablespace xxx new location 'pg_user_files/xxx';
CREATE TABLESPACE

Patches:

0001: Allow restricted relative tablespace paths.

Rationale:  I really want to be able to run the tablespace test with a
local replica, instead of just skipping it (including but not only
from this new TAP test).  After re-reading a bunch of threads about
how to achieve that that never went anywhere and considering various
ideas already presented, I wondered if we could agree on allowing
relative paths under one specific directory "pg_user_files" (a
directory that PostgreSQL itself will completely ignore).  Better
names welcome.

I wonder if for Windows we'd want to switch to real symlinks, since,
as far as I know from some light reading, reparse points are converted
to absolute paths on creation, so the pgdata directory would be less
portable than it would be on POSIX systems.

0002: CREATE TABLESPACE ... NEW LOCATION.

The new syntax "NEW" says that it's OK if the directory doesn't exist
yet, we'll just create it.

Rationale:  With relative paths, it's tricky for pg_regress to find
the data directory of the primary server + any streaming replicas that
may be downstream from it (and possibly remote) to create the
directory, but the server can do it easily.  Better syntax welcome.
(I originally wanted to use WITH (<something>) but that syntax is
tangled up with persistent relopts.)

0003: Use relative paths for tablespace regression test.

Remove the pg_regress logic for creating the directory, and switch to
relative paths using the above.

0004: Test replay of regression tests.

Same as before, this adds a replicated run of the regression tests in
src/test/recovery/t/027_stream_regress.pl, with an optional expensive
mode that you can enable with
PG_TEST_EXTRA="wal_consistency_checking".

I removed the useless --create-role as pointed out by Anastasia.

I added a step to compare the contents of the primary and replica
servers with pg_dump, as suggested by Tsunakawa-san.

I think the way I pass in the psql source directory to --bindir is not
good, but I've reached my daily limit of Perl; how should I be
specifying the tmp_install bin directory here?  This is so pg_regress
can find psql.

system_or_bail("../regress/pg_regress",
               "--bindir=../../bin/psql",
               "--port=" . $node_primary->port,
               "--schedule=../regress/parallel_schedule",
               "--dlpath=../regress",
               "--inputdir=../regress");

Attachment

Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Wed, Oct 6, 2021 at 7:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I wonder if for Windows we'd want to switch to real symlinks, since,
> as far as I know from some light reading, reparse points are converted
> to absolute paths on creation, so the pgdata directory would be less
> portable than it would be on POSIX systems.

I looked into that a bit, and it seems that the newer "real" symlinks
can't be created without admin privileges, unlike junction points, so
that wouldn't help.  I still need to figure out what this
junction-based patch set is doing wrong on Windows though trial-by-CI.
In the meantime, here is a rebase over recent changes to the Perl
testing APIs.

Attachment

Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 11/23/21 04:07, Thomas Munro wrote:
> On Wed, Oct 6, 2021 at 7:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I wonder if for Windows we'd want to switch to real symlinks, since,
>> as far as I know from some light reading, reparse points are converted
>> to absolute paths on creation, so the pgdata directory would be less
>> portable than it would be on POSIX systems.
> I looked into that a bit, and it seems that the newer "real" symlinks
> can't be created without admin privileges, unlike junction points, so
> that wouldn't help.  I still need to figure out what this
> junction-based patch set is doing wrong on Windows though trial-by-CI.
> In the meantime, here is a rebase over recent changes to the Perl
> testing APIs.


More exactly you need to "Create Symbolic Links" privilege. see
<https://github.com/git-for-windows/git/wiki/Symbolic-Links>


I haven't yet worked out a way of granting that from the command line,
but if it's working the buildfarm client (as of git tip) will use it on
windows for the workdirs space saving feature.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 11/23/21 10:47, Andrew Dunstan wrote:
> On 11/23/21 04:07, Thomas Munro wrote:
>> On Wed, Oct 6, 2021 at 7:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> I wonder if for Windows we'd want to switch to real symlinks, since,
>>> as far as I know from some light reading, reparse points are converted
>>> to absolute paths on creation, so the pgdata directory would be less
>>> portable than it would be on POSIX systems.
>> I looked into that a bit, and it seems that the newer "real" symlinks
>> can't be created without admin privileges, unlike junction points, so
>> that wouldn't help.  I still need to figure out what this
>> junction-based patch set is doing wrong on Windows though trial-by-CI.
>> In the meantime, here is a rebase over recent changes to the Perl
>> testing APIs.
>
> More exactly you need to "Create Symbolic Links" privilege. see
> <https://github.com/git-for-windows/git/wiki/Symbolic-Links>
>
>
> I haven't yet worked out a way of granting that from the command line,
> but if it's working the buildfarm client (as of git tip) will use it on
> windows for the workdirs space saving feature.


Update:

There is a PowerShell module called Carbon which provides this and a
whole lot more. It can be installed in numerous ways, including via
Chocolatey. Here's what I am using:

    choco install -y Carbon
    Import-Module Carbon
    Grant-CPrivilege -Identity myuser -Privilege SeCreateSymbolicLinkPrivilege

See <https://get-carbon.org/Grant-Privilege.html> The command name I
used above is now the preferred spelling, although that's not reflected
on the manual page.


cheers


andrew

-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Tue, Nov 30, 2021 at 3:14 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>     choco install -y Carbon
>     Import-Module Carbon
>     Grant-CPrivilege -Identity myuser -Privilege SeCreateSymbolicLinkPrivilege

Interesting.  Well, I found the problem with my last patch (to wit:
junction points must be absolute, unlike real symlinks, which I'd
considered already but I missed that tmp_check's DataDir had a stray
internal \.\), and now I'm wondering whether these newer real symlinks
could help.  The constraints are pretty hard to work with... I thought
about a couple of options:

1.  We could try to use real symlinks, and fall back to junction
points if that fails.  That means that these new tests I'm proposing
would fail unless you grant that privilege or run in developer mode as
you were saying.  It bothers me a bit that developers and the BF would
be testing a different code path than production databases run...
unless you're thinking we should switch to symlinks with no fallback,
and require that privilege to be granted by end users to production
servers at least if they want to use tablespaces, and also drop
pre-Win10 support in the same release?  That's bigger than I was
thinking.

2.  We could convert relative paths to absolute paths at junction
point creation time, which I tried, and "check" now passes.  Problems:
(1) now you can't move pgdata around, (2) the is-the-path-too-long
check performed on a primary is not sufficient to check if the
transformed absolute path will be too long on a replica.

The most conservative simple idea I have so far is: go with option 2,
but make this whole thing an undocumented developer-only mode, and
turn it on in the regression tests.  Here's a patch set like that.
Thoughts?

Another option would be to stop using operating system symlinks, and
build the target paths ourselves; I didn't investigate that as it
seemed like a bigger change than this warrants.

Next problem:  The below is clearly not the right way to find the
pg_regress binary and bindir, and doesn't work on Windows or VPATH.
Any suggestions for how to do this?  I feel like something like
$node->installed_command() or similar logic is needed...

# Run the regression tests against the primary.
# XXX How should we find the pg_regress binary and bindir?
system_or_bail("../regress/pg_regress",
               "--bindir=../../bin/psql",
               "--port=" . $node_primary->port,
               "--schedule=../regress/parallel_schedule",
               "--dlpath=../regress",
               "--inputdir=../regress");

BTW 0002 is one of those renaming patches from git that GNU patch
doesn't seem to apply correctly, sorry cfbot...

Attachment

Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 12/3/21 23:21, Thomas Munro wrote:
>
> Next problem:  The below is clearly not the right way to find the
> pg_regress binary and bindir, and doesn't work on Windows or VPATH.
> Any suggestions for how to do this?  I feel like something like
> $node->installed_command() or similar logic is needed...
>
> # Run the regression tests against the primary.
> # XXX How should we find the pg_regress binary and bindir?
> system_or_bail("../regress/pg_regress",
>                "--bindir=../../bin/psql",
>                "--port=" . $node_primary->port,
>                "--schedule=../regress/parallel_schedule",
>                "--dlpath=../regress",
>                "--inputdir=../regress");
>

TAP tests are passed a path to pg_regress as $ENV{PG_REGRESS}. You
should be using that. On non-MSVC, the path to a non-installed psql is
in this case  "$TESTDIR/../../bin/psql" - this should work for VPATH
builds as well as non-VPATH. On MSVC it's a bit harder - it's
"$top_builddir/$releasetype/psql" but we don't expose that. Perhaps we
should. c.f. commit f4ce6c4d3a


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Sun, Dec 5, 2021 at 4:16 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> TAP tests are passed a path to pg_regress as $ENV{PG_REGRESS}. You
> should be using that. On non-MSVC, the path to a non-installed psql is
> in this case  "$TESTDIR/../../bin/psql" - this should work for VPATH
> builds as well as non-VPATH. On MSVC it's a bit harder - it's
> "$top_builddir/$releasetype/psql" but we don't expose that. Perhaps we
> should. c.f. commit f4ce6c4d3a

Thanks, that helped.  Here's a new version that passes on Windows,
Unix and Unix with VPATH.  I also had to figure out where the DLLs
are, and make sure that various output files go to the build
directory, not source directory, if different, which I did by passing
down another similar environment variable.  Better ideas?  (It
confused me for some time that make follows the symlink and runs the
perl code from inside the source directory.)

This adds 2 whole minutes to the recovery check, when running with the
Windows serial-only scripts on Cirrus CI (using Andres's CI patches).
For Linux it adds ~20 seconds to the total of -j8 check-world.
Hopefully that's time well spent, because it adds test coverage for
all the redo routines, and hopefully soon we won't have to run 'em in
series on Windows.

Does anyone want to object to the concept of the "pg_user_files"
directory or the developer-only GUC "allow_relative_tablespaces"?
There's room for discussion about names; maybe initdb shouldn't create
the directory unless you ask it to, or something.

Attachment

Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 12/8/21 18:10, Thomas Munro wrote:
> On Sun, Dec 5, 2021 at 4:16 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> TAP tests are passed a path to pg_regress as $ENV{PG_REGRESS}. You
>> should be using that. On non-MSVC, the path to a non-installed psql is
>> in this case  "$TESTDIR/../../bin/psql" - this should work for VPATH
>> builds as well as non-VPATH. On MSVC it's a bit harder - it's
>> "$top_builddir/$releasetype/psql" but we don't expose that. Perhaps we
>> should. c.f. commit f4ce6c4d3a
> Thanks, that helped.  Here's a new version that passes on Windows,
> Unix and Unix with VPATH.  I also had to figure out where the DLLs
> are, and make sure that various output files go to the build
> directory, not source directory, if different, which I did by passing
> down another similar environment variable.  Better ideas?  (It
> confused me for some time that make follows the symlink and runs the
> perl code from inside the source directory.)


The new version appears to set an empty --bindir for pg_regress. Is that
right?


> This adds 2 whole minutes to the recovery check, when running with the
> Windows serial-only scripts on Cirrus CI (using Andres's CI patches).
> For Linux it adds ~20 seconds to the total of -j8 check-world.
> Hopefully that's time well spent, because it adds test coverage for
> all the redo routines, and hopefully soon we won't have to run 'em in
> series on Windows.
>
> Does anyone want to object to the concept of the "pg_user_files"
> directory or the developer-only GUC "allow_relative_tablespaces"?
> There's room for discussion about names; maybe initdb shouldn't create
> the directory unless you ask it to, or something.


I'm slightly worried that some bright spark will discover it and think
it's a good idea for a production setup.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2021-12-09 08:12:14 -0500, Andrew Dunstan wrote:
> > Does anyone want to object to the concept of the "pg_user_files"
> > directory or the developer-only GUC "allow_relative_tablespaces"?
> > There's room for discussion about names; maybe initdb shouldn't create
> > the directory unless you ask it to, or something.

Personally I'd rather put relative tablespaces into a dedicated directory or
just into pg_tblspc, but without a symlink. Some tools need to understand
tablespace layout etc, and having them in a directory that, by the name, will
also contain other things seems likely to cause confusion.


> I'm slightly worried that some bright spark will discover it and think
> it's a good idea for a production setup.

It'd not really be worse than the current situation of accidentally corrupting
a local replica or such :/.

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Dec 10, 2021 at 2:12 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> The new version appears to set an empty --bindir for pg_regress. Is that
> right?

It seems to be necessary to find eg psql, since --bindir='' means
"expect $PATH to contain the installed binaries", and that's working
on both build systems.  The alternative would be to export yet another
environment variable, $PG_INSTALL or such -- do you think that'd be
better, or did I miss something that exists already like that?



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Dec 10, 2021 at 8:38 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-12-09 08:12:14 -0500, Andrew Dunstan wrote:
> > > Does anyone want to object to the concept of the "pg_user_files"
> > > directory or the developer-only GUC "allow_relative_tablespaces"?
> > > There's room for discussion about names; maybe initdb shouldn't create
> > > the directory unless you ask it to, or something.
>
> Personally I'd rather put relative tablespaces into a dedicated directory or
> just into pg_tblspc, but without a symlink. Some tools need to understand
> tablespace layout etc, and having them in a directory that, by the name, will
> also contain other things seems likely to cause confusion.

Alright, let me try it that way...  more soon.



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2021-12-09 12:10:23 +1300, Thomas Munro wrote:
> From a60ada37f3ff7d311d56fe453b2abeadf0b350e5 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Wed, 4 Aug 2021 22:17:54 +1200
> Subject: [PATCH v8 2/5] Use relative paths for tablespace regression test.
> 
> Remove the machinery from pg_regress that manages the testtablespace
> directory.  Instead, use a relative path.
> 
> Discussion: https://postgr.es/m/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com

Seems like we ought to add a tiny tap test or such for this - otherwise we
won't have any coverage of "normal" tablespaces? I don't think they need to be
exercised as part of the normal tests, but having some very basic testing
in a tap test seems worthwhile.


> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
> index 9467a199c8..5cfa137cde 100644
> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> @@ -460,7 +460,7 @@ sub init
>          print $conf "hot_standby = on\n";
>          # conservative settings to ensure we can run multiple postmasters:
>          print $conf "shared_buffers = 1MB\n";
> -        print $conf "max_connections = 10\n";
> +        print $conf "max_connections = 25\n";
>          # limit disk space consumption, too:
>          print $conf "max_wal_size = 128MB\n";
>      }

What's the relation of this to the rest?


> +# Perform a logical dump of primary and standby, and check that they match
> +command_ok(
> +    [ "pg_dump", '-f', $outputdir . '/primary.dump', '--no-sync',
> +      '-p', $node_primary->port, 'regression' ],
> +    "dump primary server");
> +command_ok(
> +    [ "pg_dump", '-f', $outputdir . '/standby.dump', '--no-sync',
> +      '-p', $node_standby_1->port, 'regression' ],
> +    "dump standby server");
> +command_ok(
> +    [ "diff", $outputdir . '/primary.dump', $outputdir . '/standby.dump' ],
> +    "compare primary and standby dumps");
> +

Absurd nitpick: What's the deal with using "" for one part, and '' for the
rest?

Separately: I think the case of seeing diffs will be too hard to debug like
this, as the difference isn't shown afaict?

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Dec 10, 2021 at 10:36 AM Andres Freund <andres@anarazel.de> wrote:
> Seems like we ought to add a tiny tap test or such for this - otherwise we
> won't have any coverage of "normal" tablespaces? I don't think they need to be
> exercised as part of the normal tests, but having some very basic testing
> in a tap test seems worthwhile.

Good idea, that was bothering me too.  Done.

> > -             print $conf "max_connections = 10\n";
> > +             print $conf "max_connections = 25\n";

> What's the relation of this to the rest?

Someone decided that allow_streaming should imply max_connections =
10, but we need ~20 to run the parallel regression test schedule.
However, I can just as easily move that to a local adjustment in the
TAP test file.  Done, like so:

+$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1);

> Absurd nitpick: What's the deal with using "" for one part, and '' for the
> rest?

Fixed.

> Separately: I think the case of seeing diffs will be too hard to debug like
> this, as the difference isn't shown afaict?

Seems to be OK.  The output goes to
src/test/recovery/tmp_check/log/regress_log_027_stream_regress, so for
example if you comment out the bit that deals with SEQUENCE caching
you'll see:

# Running: pg_dump -f
/usr/home/tmunro/projects/postgresql/src/test/recovery/primary.dump
--no-sync -p 63693 regression
ok 2 - dump primary server
# Running: pg_dump -f
/usr/home/tmunro/projects/postgresql/src/test/recovery/standby.dump
--no-sync -p 63694 regression
ok 3 - dump standby server
# Running: diff
/usr/home/tmunro/projects/postgresql/src/test/recovery/primary.dump
/usr/home/tmunro/projects/postgresql/src/test/recovery/standby.dump
436953c436953
< SELECT pg_catalog.setval('public.clstr_tst_s_rf_a_seq', 32, true);
---
> SELECT pg_catalog.setval('public.clstr_tst_s_rf_a_seq', 33, true);
... more hunks ...

And from the previous email:

On Fri, Dec 10, 2021 at 10:35 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Dec 10, 2021 at 8:38 AM Andres Freund <andres@anarazel.de> wrote:
> > Personally I'd rather put relative tablespaces into a dedicated directory or
> > just into pg_tblspc, but without a symlink. Some tools need to understand
> > tablespace layout etc, and having them in a directory that, by the name, will
> > also contain other things seems likely to cause confusion.

Ok, in this version I have a developer-only GUC
allow_in_place_tablespaces instead.  If you turn it on, you can do:

CREATE TABLESPACE regress_blah LOCATION = '';

... and then pg_tblspc/OID is created directly as a directory.  Not
allowed by default or documented.

Attachment

Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Dec 10, 2021 at 12:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> +$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1);

Erm, in fact this requirement came about in an earlier version where I
was invoking make and getting --max-concurrent-tests=20 from
src/test/regress/GNUmakefile.  Which I should probably replicate
here...



Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 12/9/21 15:15, Thomas Munro wrote:
> On Fri, Dec 10, 2021 at 2:12 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> The new version appears to set an empty --bindir for pg_regress. Is that
>> right?
> It seems to be necessary to find eg psql, since --bindir='' means
> "expect $PATH to contain the installed binaries", and that's working
> on both build systems.  The alternative would be to export yet another
> environment variable, $PG_INSTALL or such -- do you think that'd be
> better, or did I miss something that exists already like that?



No, that seems ok. Might be worth a comment.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2021-12-10 12:58:01 +1300, Thomas Munro wrote:
> > What's the relation of this to the rest?
> 
> Someone decided that allow_streaming should imply max_connections =
> 10, but we need ~20 to run the parallel regression test schedule.
> However, I can just as easily move that to a local adjustment in the
> TAP test file.  Done, like so:
> 
> +$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1);

Possible that this will cause problem on some *BSD platform with a limited
count of semaphores. But we can deal with that if / when it happens.



> > Separately: I think the case of seeing diffs will be too hard to debug like
> > this, as the difference isn't shown afaict?
> 
> Seems to be OK.  The output goes to
> src/test/recovery/tmp_check/log/regress_log_027_stream_regress, so for
> example if you comment out the bit that deals with SEQUENCE caching
> you'll see:

Ah, ok. Not sure what I thought there...


> On Fri, Dec 10, 2021 at 10:35 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Fri, Dec 10, 2021 at 8:38 AM Andres Freund <andres@anarazel.de> wrote:
> > > Personally I'd rather put relative tablespaces into a dedicated directory or
> > > just into pg_tblspc, but without a symlink. Some tools need to understand
> > > tablespace layout etc, and having them in a directory that, by the name, will
> > > also contain other things seems likely to cause confusion.
> 
> Ok, in this version I have a developer-only GUC
> allow_in_place_tablespaces instead.  If you turn it on, you can do:
> 
> CREATE TABLESPACE regress_blah LOCATION = '';

> ... and then pg_tblspc/OID is created directly as a directory.  Not
> allowed by default or documented.

WFM. I think we might eventually want to allow it by default, but we can deal
with that whenever somebody wants to spend the energy doing so.



> @@ -590,16 +595,35 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid)
>      char       *linkloc;
>      char       *location_with_version_dir;
>      struct stat st;
> +    bool        in_place;
>  
>      linkloc = psprintf("pg_tblspc/%u", tablespaceoid);
> -    location_with_version_dir = psprintf("%s/%s", location,
> +
> +    /*
> +     * If we're asked to make an 'in place' tablespace, create the directory
> +     * directly where the symlink would normally go.  This is a developer-only
> +     * option for now, to facilitate regression testing.
> +     */
> +    in_place =
> +        (allow_in_place_tablespaces || InRecovery) && strlen(location) == 0;

Why is in_place set to true by InRecovery?

ISTM that allow_in_place_tablespaces should be checked in CreateTableSpace(),
and create_tablespace_directories() should just do whatever it's told?
Otherwise it seems there's ample potential for confusion, e.g. because of the
GUC differing between primary and replica, or between crash and a new start.


> +    if (in_place)
> +    {
> +        if (MakePGDirectory(linkloc) < 0 && errno != EEXIST)
> +            ereport(ERROR,
> +                    (errcode_for_file_access(),
> +                     errmsg("could not create directory \"%s\": %m",
> +                            linkloc)));
> +    }
> +
> +    location_with_version_dir = psprintf("%s/%s", in_place ? linkloc : location,
>                                           TABLESPACE_VERSION_DIRECTORY);
>  
>      /*
>       * Attempt to coerce target directory to safe permissions.  If this fails,
>       * it doesn't exist or has the wrong owner.
>       */
> -    if (chmod(location, pg_dir_create_mode) != 0)
> +    if (!in_place && chmod(location, pg_dir_create_mode) != 0)
>      {
>          if (errno == ENOENT)
>              ereport(ERROR,

Maybe add a coment saying that we don't need to chmod here because
MakePGDirectory() takes care of that?


> @@ -648,13 +672,13 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid)
>      /*
>       * In recovery, remove old symlink, in case it points to the wrong place.
>       */
> -    if (InRecovery)
> +    if (!in_place && InRecovery)
>          remove_tablespace_symlink(linkloc);

I don't think it's right to check !in_place as you do here, given that it
currently depends on a GUC setting that's possibly differs between WAL
generation and replay time?


> --- a/src/test/regress/output/tablespace.source
> +++ b/src/test/regress/expected/tablespace.out
> @@ -1,7 +1,18 @@
> +-- relative tablespace locations are not allowed
> +CREATE TABLESPACE regress_tblspace LOCATION 'relative'; -- fail
> +ERROR:  tablespace location must be an absolute path
> +-- empty tablespace locations are not usually allowed
> +CREATE TABLESPACE regress_tblspace LOCATION ''; -- fail
> +ERROR:  tablespace location must be an absolute path
> +-- as a special developer-only option to allow us to use tablespaces
> +-- with streaming replication on the same server, an empty location
> +-- can be allowed as a way to say that the tablespace should be created
> +-- as a directory in pg_tblspc, rather than being a symlink
> +SET allow_in_place_tablespaces = true;
>  -- create a tablespace using WITH clause
> -CREATE TABLESPACE regress_tblspacewith LOCATION '@testtablespace@' WITH (some_nonexistent_parameter = true); --
fail
> +CREATE TABLESPACE regress_tblspacewith LOCATION '' WITH (some_nonexistent_parameter = true); -- fail
>  ERROR:  unrecognized parameter "some_nonexistent_parameter"
> -CREATE TABLESPACE regress_tblspacewith LOCATION '@testtablespace@' WITH (random_page_cost = 3.0); -- ok
> +CREATE TABLESPACE regress_tblspacewith LOCATION '' WITH (random_page_cost = 3.0); -- ok

Perhaps also add a test that we catch "in-place" tablespace creation with
allow_in_place_tablespaces = false? Although perhaps that's better done in the
previous commit...


> +++ b/src/test/modules/test_misc/t/002_tablespace.pl

Two minor things that I think would be worth testing here:
1) moving between two "absolute" tablespaces. That could conceivably break differently
   between in-place and relative tablespaces.
2) Moving between absolute and relative tablespace.


> +# required for 027_stream_regress.pl
> +REGRESS_OUTPUTDIR=$(abs_top_builddir)/src/test/recovery
> +export REGRESS_OUTPUTDIR

Why do we need this?


> +# Initialize primary node
> +my $node_primary = PostgreSQL::Test::Cluster->new('primary');
> +$node_primary->init(allows_streaming => 1);
> +$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1);

Probably should set at least max_prepared_transactions > 0, so the tests
requiring prepared xacts can work. They have nontrivial replay routines, so
that seems worthwhile?


> +# Perform a logical dump of primary and standby, and check that they match
> +command_ok(
> +    [ 'pg_dump', '-f', $outputdir . '/primary.dump', '--no-sync',
> +      '-p', $node_primary->port, 'regression' ],
> +    'dump primary server');
> +command_ok(
> +    [ 'pg_dump', '-f', $outputdir . '/standby.dump', '--no-sync',
> +      '-p', $node_standby_1->port, 'regression' ],
> +    'dump standby server');
> +command_ok(
> +    [ 'diff', $outputdir . '/primary.dump', $outputdir . '/standby.dump' ],
> +    'compare primary and standby dumps');
> +
> +$node_standby_1->stop;
> +$node_primary->stop;

This doesn't verify if global objects are replayed correctly. Perhaps it'd be
better to use pg_dumpall?

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Michael Paquier
Date:
On Fri, Dec 10, 2021 at 12:58:01PM +1300, Thomas Munro wrote:
> -# required for 017_shm.pl
> +# required for 017_shm.pl and 027_stream_regress.pl
>  REGRESS_SHLIB=$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX)
>  export REGRESS_SHLIB

Hmm.  FWIW, I am working on doing similar for pg_upgrade to switch to
TAP there, and we share a lot in terms of running pg_regress on an
exising cluster.  Wouldn't it be better to move this definition to
src/Makefile.global.in rather than src/test/recovery/?

My pg_regress command is actually very similar to yours, so I am
wondering if this would be better if somehow centralized, perhaps in
Cluster.pm.
--
Michael

Attachment

Re: A test for replay of regression tests

From
Michael Paquier
Date:
On Wed, Dec 15, 2021 at 05:50:45PM +0900, Michael Paquier wrote:
> Hmm.  FWIW, I am working on doing similar for pg_upgrade to switch to
> TAP there, and we share a lot in terms of running pg_regress on an
> exising cluster.  Wouldn't it be better to move this definition to
> src/Makefile.global.in rather than src/test/recovery/?
>
> My pg_regress command is actually very similar to yours, so I am
> wondering if this would be better if somehow centralized, perhaps in
> Cluster.pm.

By the way, while I was sorting out my things, I have noticed that v4
does not handle EXTRA_REGRESS_OPT.  Is that wanted?  You could just
add that into your patch set and push the extra options to the
pg_regress command:
my $extra_opts_val = $ENV{EXTRA_REGRESS_OPT} || "";
my @extra_opts     = split(/\s+/, $extra_opts_val);
--
Michael

Attachment

Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Wed, Dec 22, 2021 at 11:41 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Rebased and updated based on feedback.  Responses to multiple emails below:

Pushed, but the build farm doesn't like it with a couple of different
ways of failing.  I'll collect some results and revert shortly.



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Sat, Jan 15, 2022 at 12:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Dec 22, 2021 at 11:41 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Rebased and updated based on feedback.  Responses to multiple emails below:
>
> Pushed, but the build farm doesn't like it with a couple of different
> ways of failing.  I'll collect some results and revert shortly.

Problems:

1. The way I invoke pg_regress doesn't seem to work correctly under
the build farm client (though it works fine under make), not sure why
yet, but reproduced here and will figure it out tomorrow.
2. The new test in src/test/modules/t/002_tablespace.pl apparently has
some path-related problem on Windows that I didn't know about, because
CI isn't even running the TAP tests under src/test/module/test_misc
(and various other locations), but the BF is :-/  And I was happy
because modulescheck was passing...

I reverted the two commits responsible for those failures to keep the
build farm green, and I'll try to fix them tomorrow.



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2022-01-15 02:32:35 +1300, Thomas Munro wrote:
> 1. The way I invoke pg_regress doesn't seem to work correctly under
> the build farm client (though it works fine under make), not sure why
> yet, but reproduced here and will figure it out tomorrow.

I think it's just a problem of the buildfarm specifying port names in
extra_opts. E.g.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2022-01-14%2011%3A49%3A36
has

# Checking port 58074
# Found port 58074
Name: primary
...
# Running: /home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/regress/pg_regress
--dlpath="/home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/regress"--bindir= --port=58074
--schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress
--outputdir="/home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/recovery"--port=5678
 
(using postmaster on /tmp/1W6qVPVyCv, port 5678)

Note how there's both --port=58074 and --port=5678 in the pg_regress
invocation. The latter comming from EXTRA_REGRESS_OPTS, which the buildfarm
client sets.

The quickest fix would probably be to just move the 027_stream_regress.pl
added --port until after $extra_opts?


> 2. The new test in src/test/modules/t/002_tablespace.pl apparently has
> some path-related problem on Windows

This is the failure you're talking about?

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2022-01-14%2012%3A04%3A55


> that I didn't know about, because CI isn't even running the TAP tests under
> src/test/module/test_misc (and various other locations), but the BF is :-/
> And I was happy because modulescheck was passing...

This we need to fix... But if you're talking about fairywren's failure, it's
more than not running some tests, it's that we do not test windows mingw
outside of cross compilation.

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Sat, Jan 15, 2022 at 12:49 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-01-15 02:32:35 +1300, Thomas Munro wrote:
> > 1. The way I invoke pg_regress doesn't seem to work correctly under
> > the build farm client (though it works fine under make), not sure why
> > yet, but reproduced here and will figure it out tomorrow.
>
> I think it's just a problem of the buildfarm specifying port names in
> extra_opts. E.g.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2022-01-14%2011%3A49%3A36
> has
>
> # Checking port 58074
> # Found port 58074
> Name: primary
> ...
> # Running: /home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/regress/pg_regress
--dlpath="/home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/regress"--bindir= --port=58074
--schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress
--outputdir="/home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/recovery"--port=5678 
> (using postmaster on /tmp/1W6qVPVyCv, port 5678)
>
> Note how there's both --port=58074 and --port=5678 in the pg_regress
> invocation. The latter comming from EXTRA_REGRESS_OPTS, which the buildfarm
> client sets.
>
> The quickest fix would probably be to just move the 027_stream_regress.pl
> added --port until after $extra_opts?

Thanks, I figured it was an environment variable biting me, and indeed
it was that one.  I reordered the arguments, tested locally under the
buildfarm client script, and pushed.  I'll keep an eye on the build
farm.

One thing I noticed is that the pg_dump output files should really be
rm'd by the clean target; I may push something for that later.

> > 2. The new test in src/test/modules/t/002_tablespace.pl apparently has
> > some path-related problem on Windows
>
> This is the failure you're talking about?
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2022-01-14%2012%3A04%3A55
>
> > that I didn't know about, because CI isn't even running the TAP tests under
> > src/test/module/test_misc (and various other locations), but the BF is :-/
> > And I was happy because modulescheck was passing...
>
> This we need to fix... But if you're talking about fairywren's failure, it's
> more than not running some tests, it's that we do not test windows mingw
> outside of cross compilation.

I'm temporarily stumped by complete ignorance of MSYS.  I tried the
test on plain old Windows/MSVC by cherry-picking the reverted commit
d1511fe1 and running .\src\tools\msvc\vcregress.bat taptest
.\src\test\modules\test_misc in my Windows 10 VM, and that passed with
flying colours (so Windows CI would have passed too, if we weren't
ignoring TAP tests in unusual locations, I assume).  I'll look into
installing MSYS to work this out if necessary, but it may take me a
few days.

Here's how it failed on fairywren, in case someone knowledgeable of
MSYS path translation etc can spot the problem:

psql:<stdin>:1: ERROR:  directory
"/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/modules/test_misc/tmp_check/t_002_tablespace_main_data/ts1"
does not exist
not ok 1 - create tablespace with absolute path

I think that means chmod() failed with ENOENT.  That's weird, because
the .pl does:

+my $TS1_LOCATION = $node->basedir() . "/ts1";
+mkdir($TS1_LOCATION);



Re: A test for replay of regression tests

From
Noah Misch
Date:
On Mon, Jan 17, 2022 at 05:25:19PM +1300, Thomas Munro wrote:
> Here's how it failed on fairywren, in case someone knowledgeable of
> MSYS path translation etc can spot the problem:
> 
> psql:<stdin>:1: ERROR:  directory
> "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/modules/test_misc/tmp_check/t_002_tablespace_main_data/ts1"
> does not exist
> not ok 1 - create tablespace with absolute path
> 
> I think that means chmod() failed with ENOENT.  That's weird, because
> the .pl does:
> 
> +my $TS1_LOCATION = $node->basedir() . "/ts1";
> +mkdir($TS1_LOCATION);

You likely need a PostgreSQL::Test::Utils::perl2host() call.  MSYS Perl
understands Cygwin-style names like /home/... as well as Windows-style names,
but this PostgreSQL configuration understands only Windows-style names.



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Mon, Jan 17, 2022 at 6:53 PM Noah Misch <noah@leadboat.com> wrote:
> On Mon, Jan 17, 2022 at 05:25:19PM +1300, Thomas Munro wrote:
> > Here's how it failed on fairywren, in case someone knowledgeable of
> > MSYS path translation etc can spot the problem:

> You likely need a PostgreSQL::Test::Utils::perl2host() call.  MSYS Perl
> understands Cygwin-style names like /home/... as well as Windows-style names,
> but this PostgreSQL configuration understands only Windows-style names.

Thanks.  I added that and pushed.  Let's see if fairywren likes it
when it comes back online.

I also learned that in the CI environment, node->basedir() is a path
containing an internal "." component (I mean "something/./something").
I added a regex to collapse those, because they're unacceptable in
Windows junction point targets.  I'm aware that there is something
happening in another CF entry that might address that sort of
thing[1], so then perhaps I could remove the kludge.

I tested that with a throw-away change to .cirrus.yml, like so.  The
CI thread[2] is discussing a proper solution to these Windows CI blind
spots.

   test_modules_script:
     - perl src/tools/msvc/vcregress.pl modulescheck
+    - perl src/tools/msvc/vcregress.pl taptest ./src/test/modules/test_misc

[1] https://commitfest.postgresql.org/36/3331/
[2] https://www.postgresql.org/message-id/20220114235457.GQ14051%40telsasoft.com



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Jan 21, 2022 at 3:42 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks.  I added that and pushed.  Let's see if fairywren likes it
> when it comes back online.

A watched pot never boils, but I wonder why Andrew's 4 Windows
configurations jacana, bowerbird, fairywren and drongo have stopped
returning results.



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2022-01-17 17:25:19 +1300, Thomas Munro wrote:
> I reordered the arguments, tested locally under the buildfarm client script,
> and pushed.  I'll keep an eye on the build farm.

After the reloptions fix the tests seem much more likely to succeed than
before. Progress!

Unfortunately we don't quite seem there yet:

I saw a couple failures like:

https://api.cirrus-ci.com/v1/artifact/task/5394938773897216/regress_diffs/build/testrun/recovery/t/027_stream_regress/regression.diffs
(from https://cirrus-ci.com/task/5394938773897216?logs=check_world#L183 )

 -- Should succeed
 DROP TABLESPACE regress_tblspace_renamed;
+ERROR:  tablespace "regress_tblspace_renamed" is not empty


I assume the reason we see this semi-regularly when the regression tests run
as part of 027_stream_regress, but not in the main regression test run, is
similar to the reloptions problem, namely that we run with a much smaller
shared buffers.

I assume what happens is that this just makes the known problem of bgwriter or
some other process keeping open a filehandle to an already deleted relation,
preventing the deletion to "fully" take effect, worse.

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Sat, Jan 22, 2022 at 8:48 AM Andres Freund <andres@anarazel.de> wrote:
> Unfortunately we don't quite seem there yet:
>
> I saw a couple failures like:
>
https://api.cirrus-ci.com/v1/artifact/task/5394938773897216/regress_diffs/build/testrun/recovery/t/027_stream_regress/regression.diffs
> (from https://cirrus-ci.com/task/5394938773897216?logs=check_world#L183 )
>
>  -- Should succeed
>  DROP TABLESPACE regress_tblspace_renamed;
> +ERROR:  tablespace "regress_tblspace_renamed" is not empty
>
>
> I assume the reason we see this semi-regularly when the regression tests run
> as part of 027_stream_regress, but not in the main regression test run, is
> similar to the reloptions problem, namely that we run with a much smaller
> shared buffers.
>
> I assume what happens is that this just makes the known problem of bgwriter or
> some other process keeping open a filehandle to an already deleted relation,
> preventing the deletion to "fully" take effect, worse.

Right, I assume this would be fixed by [1].  I need to re-convince
myself of that patch's correctness and make some changes after
Robert's feedback; I'll look into committing it next week.  From a
certain point of view it's now quite good that we hit this case
occasionally in CI.

[1] https://commitfest.postgresql.org/36/2962/



Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 1/21/22 13:58, Thomas Munro wrote:
> On Fri, Jan 21, 2022 at 3:42 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Thanks.  I added that and pushed.  Let's see if fairywren likes it
>> when it comes back online.
> A watched pot never boils, but I wonder why Andrew's 4 Windows
> configurations jacana, bowerbird, fairywren and drongo have stopped
> returning results.



I think I have unstuck both machines. I will keep an eye on them and
make sure they don't get stuck again.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Sat, Jan 22, 2022 at 8:48 AM Andres Freund <andres@anarazel.de> wrote:
> Unfortunately we don't quite seem there yet:

And another way to fail:

pg_dump: error: query failed: ERROR:  canceling statement due to
conflict with recovery

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-01-22%2003%3A06%3A42

Probably needs hot_standby_feedback on.  Will adjust this soon.

One more failure seen in today's crop was a "stats" failure on
seawasp, which must be the well known pre-existing problem.  (Probably
just needs someone to rewrite the stats subsystem to use shared memory
instead of UDP).



Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 1/21/22 16:22, Andrew Dunstan wrote:
> On 1/21/22 13:58, Thomas Munro wrote:
>> On Fri, Jan 21, 2022 at 3:42 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> Thanks.  I added that and pushed.  Let's see if fairywren likes it
>>> when it comes back online.
>> A watched pot never boils, but I wonder why Andrew's 4 Windows
>> configurations jacana, bowerbird, fairywren and drongo have stopped
>> returning results.
>
>
> I think I have unstuck both machines. I will keep an eye on them and
> make sure they don't get stuck again.
>
>

fairywren is not happy with the recovery tests still.


I have noticed on a different setup that this test adds 11 minutes to
the runtime of the recovery tests, effectively doubling it. The doubling
is roughly true on faster setups, too. At least I would like a simple
way to disable the test.


I'm not very happy about that.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2022-01-27 15:27:17 -0500, Andrew Dunstan wrote:
> fairywren is not happy with the recovery tests still.

Any more details?


> I have noticed on a different setup that this test adds 11 minutes to the
> runtime of the recovery tests, effectively doubling it. The doubling is
> roughly true on faster setups, too

Does a normal regress run take roughly that long? Or is the problem that the
027_stream_regress.pl ends up defaulting to shared_buffers=1MB, causing lots
of unnecessary IO?


> . At least I would like a simple
> way to disable the test.

One thing we could do to speed up the overall runtime would be to move
027_stream_regress.pl to something numbered earlier. Combined with
PROVE_FLAGS=-j2 that could at least run them in parallel with the rest of the
recovery tests.


Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Jan 28, 2022 at 9:27 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> I have noticed on a different setup that this test adds 11 minutes to
> the runtime of the recovery tests, effectively doubling it. The doubling
> is roughly true on faster setups, too. At least I would like a simple
> way to disable the test.

Ouch, that's ... a lot.  Some randomly selected times: ~20 seconds
(dev machines), ~40 seconds (Cirrus CI's Windows image), ~2-3 minutes
(very cheap cloud host accounts), ~3 minutes (my Rapsberry Pi pinned
onto two CPU cores), ~11 minutes (your Windows number).  It would be
good to understand why that's such an outlier.

Re skipping, I've also been wondering about an exclusion list to skip
parts of the regression tests that don't really add recovery coverage
but take non-trivial time, like the join tests.



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2022-01-27 12:47:08 -0800, Andres Freund wrote:
> > I have noticed on a different setup that this test adds 11 minutes to the
> > runtime of the recovery tests, effectively doubling it. The doubling is
> > roughly true on faster setups, too
>
> Does a normal regress run take roughly that long? Or is the problem that the
> 027_stream_regress.pl ends up defaulting to shared_buffers=1MB, causing lots
> of unnecessary IO?

In my msys install a normal regress run takes 57s, 027_stream_regress.pl takes
194s.

It's *not* shared_buffers. Or any of the other postgresql.conf settings. As
far as I can tell.

< tries a bunch of things >

ARGH. It's the utterly broken handling of refused connections on windows. The
pg_regress invocation doesn't specify the host address, just the port.

Now you might reasonably ask, why does that slow things down so much, rather
than working or not working? The problem is that a tcp connect() on windows
doesn't immediately fail when a connection establishment is rejected, but
instead internally retries several times. Which takes 2s.  The reason there
are rejected connections without specifying the host is that Cluster.pm
configures to listen to 127.0.0.1. But the default for libpq/psql is to try
"localhost". Which name resolution returns first as ipv6 (i.e. ::1). Which
takes 2s to fail, upon which libpq goes and tries 127.0.0.1, which works.

That means every single psql started by 027_stream_regress.pl's pg_regress
takes 2s. Which of course adds up...

I'll go and sob in a corner.

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Andres Freund
Date:
On 2022-01-27 14:03:51 -0800, Andres Freund wrote:
> In my msys install a normal regress run takes 57s, 027_stream_regress.pl takes
> 194s.
>
> That means every single psql started by 027_stream_regress.pl's pg_regress
> takes 2s. Which of course adds up...

Oh, forgot: After adding --host to the pg_regress invocation
027_stream_regress.pl takes 75s (from 194s before).



Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 1/27/22 15:47, Andres Freund wrote:
> Hi,
>
> On 2022-01-27 15:27:17 -0500, Andrew Dunstan wrote:
>> fairywren is not happy with the recovery tests still.
> Any more details?



I'll go back and get some.


>
>
>> I have noticed on a different setup that this test adds 11 minutes to the
>> runtime of the recovery tests, effectively doubling it. The doubling is
>> roughly true on faster setups, too
> Does a normal regress run take roughly that long? Or is the problem that the
> 027_stream_regress.pl ends up defaulting to shared_buffers=1MB, causing lots
> of unnecessary IO?


On crake (slowish fedora 34), a normal check run took 95s, and this test
took 114s. On my windows test instance where I noticed this (w10,
msys2/ucrt), check took 516s and this test took 685s.


>
>
>> . At least I would like a simple
>> way to disable the test.
> One thing we could do to speed up the overall runtime would be to move
> 027_stream_regress.pl to something numbered earlier. Combined with
> PROVE_FLAGS=-j2 that could at least run them in parallel with the rest of the
> recovery tests.
>
>

Seems like a bandaid.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Jan 28, 2022 at 11:03 AM Andres Freund <andres@anarazel.de> wrote:
> That means every single psql started by 027_stream_regress.pl's pg_regress
> takes 2s. Which of course adds up...

That is very surprising, thanks.  Will fix.

I've been experimenting with reusing psql sessions and backends for
queries in TAP tests, since some Windows animals seem to take a
significant fraction of a second *per query* due to forking and
startup costs.  ~100ms or whatever is nothing compared to that ~2000ms
silliness, but it still adds up over thousands of queries.  I'll post
an experimental patch soon, but this discussion has given me the idea
that pg_regress might ideally be able to reuse processes too, at least
sometimes...



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2022-01-27 17:16:17 -0500, Andrew Dunstan wrote:
> On crake (slowish fedora 34), a normal check run took 95s, and this test
> took 114s.

That's roughly what I see on msys after the fix.


> On my windows test instance where I noticed this (w10,
> msys2/ucrt), check took 516s and this test took 685s.

Hm. That's both excruciatingly slow. Way way slower than what I see here, also
w10, msys2/ucrt.  Any chance the test instance has windows defender running,
without a directory exclusion? I saw that trash performance to a near
standstill.

Does it get better with the attached patch?


I was confused why this didn't fail fatally on CI, which uses
PG_TEST_USE_UNIX_SOCKETS. I think he reason is that pg_regress' use of PGHOST
is busted, btw. It says it'll use PGHOST if --host isn't specified, but it
doesn't work.


         * When testing an existing install, we honor existing environment
         * variables, except if they're overridden by command line options.
         */
        if (hostname != NULL)
        {
            setenv("PGHOST", hostname, 1);
            unsetenv("PGHOSTADDR");
        }

but hostname is initialized in the existing-install case:

#if !defined(HAVE_UNIX_SOCKETS)
    use_unix_sockets = false;
#elif defined(WIN32)

    /*
     * We don't use Unix-domain sockets on Windows by default, even if the
     * build supports them.  (See comment at remove_temp() for a reason.)
     * Override at your own risk.
     */
    use_unix_sockets = getenv("PG_TEST_USE_UNIX_SOCKETS") ? true : false;
#else
    use_unix_sockets = true;
#endif

    if (!use_unix_sockets)
        hostname = "localhost";



Greetings,

Andres Freund

Attachment

Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 1/27/22 15:47, Andres Freund wrote:
> Hi,
>
> On 2022-01-27 15:27:17 -0500, Andrew Dunstan wrote:
>> fairywren is not happy with the recovery tests still.
> Any more details?


(Not actually fairywren, but equivalent) It's hung at
src/test/recovery/t/009_twophase.pl line 84:


    $psql_rc = $cur_primary->psql('postgres', "COMMIT PREPARED
    'xact_009_1'");


This is an Amazon EC2 WS2019 instance, of type t3.large i.e. 8Gb of
memory (not the same machine I reported test times from). Perhaps I need
to test on another instance. Note though that when I tested with a
ucrt64 build, including use of the ucrt64 perl/prove, the recovery test
passed on an equivalent instance, so that's probably another reason to
switch fairywren to using the ucrt64 environment.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2022-01-27 14:36:32 -0800, Andres Freund wrote:
> > On my windows test instance where I noticed this (w10,
> > msys2/ucrt), check took 516s and this test took 685s.
> 
> Hm. That's both excruciatingly slow. Way way slower than what I see here, also
> w10, msys2/ucrt.  Any chance the test instance has windows defender running,
> without a directory exclusion? I saw that trash performance to a near
> standstill.

Could you post the regression test output with the timings? Unless it's AV, I
don't see why a windows VM with a moderate amount of memory should take that
long.

Do the test times get less bad if you use PG_TEST_USE_UNIX_SOCKETS=1
PG_REGRESS_SOCK_DIR: "c:/some-dir/"?


I see there's reports that the connection-timeout problem can be a lot worse
on windows, because several applications, e.g. docker, add additional names
for localhost.  Are there any non-commented entries in
C:\Windows\System32\drivers\etc\hosts


> Does it get better with the attached patch?

I pushed something like it now - seemed to be no reason to wait, given it
makes think less slow on my VM.

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2022-01-27 17:51:52 -0500, Andrew Dunstan wrote:
> (Not actually fairywren, but equivalent) It's hung at
> src/test/recovery/t/009_twophase.pl line 84:
>
>
>     $psql_rc = $cur_primary->psql('postgres', "COMMIT PREPARED
>     'xact_009_1'");

That very likely is the socket-shutdown bug that lead to:

commit 64b2c6507e5714b5c688b9c5cc551fbedb7b3b58
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2022-01-25 12:17:40 -0500

    Revert "graceful shutdown" changes for Windows, in back branches only.

    This reverts commits 6051857fc and ed52c3707, but only in the back
    branches.  Further testing has shown that while those changes do fix
    some things, they also break others; in particular, it looks like
    walreceivers fail to detect walsender-initiated connection close
    reliably if the walsender shuts down this way.  We'll keep trying to
    improve matters in HEAD, but it now seems unwise to push these changes
    into stable releases.

    Discussion: https://postgr.es/m/CA+hUKG+OeoETZQ=Qw5Ub5h3tmwQhBmDA=nuNO3KG=zWfUypFAw@mail.gmail.com

If you apply that commit, does the problem go away?


That's why I'd suggested to revert them in
https://postgr.es/m/20220125023609.5ohu3nslxgoygihl%40alap3.anarazel.de


> This is an Amazon EC2 WS2019 instance, of type t3.large i.e. 8Gb of
> memory (not the same machine I reported test times from). Perhaps I need
> to test on another instance. Note though that when I tested with a
> ucrt64 build, including use of the ucrt64 perl/prove, the recovery test
> passed on an equivalent instance, so that's probably another reason to
> switch fairywren to using the ucrt64 environment.

Without the revert I do get through the tests some of the time - imo likely
that the hang isn't related to the specific msys/mingw environment.

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Jan 28, 2022 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:
>     Revert "graceful shutdown" changes for Windows, in back branches only.

FTR I'm actively working on a fix for that one for master now (see
that other thread where the POC survived Alexander's torture testing).



Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 1/27/22 18:24, Thomas Munro wrote:
> On Fri, Jan 28, 2022 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:
>>     Revert "graceful shutdown" changes for Windows, in back branches only.
> FTR I'm actively working on a fix for that one for master now (see
> that other thread where the POC survived Alexander's torture testing).



OK, good. A further data point on that: I am not seeing a recovery test
hang or commit_ts test failure on real W10 machines, including jacana. I
am only getting them on WS2019 VMs e.g. drongo/fairywren.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Sat, Jan 22, 2022 at 6:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Jan 22, 2022 at 8:48 AM Andres Freund <andres@anarazel.de> wrote:
> > Unfortunately we don't quite seem there yet:
>
> And another way to fail:
>
> pg_dump: error: query failed: ERROR:  canceling statement due to
> conflict with recovery
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-01-22%2003%3A06%3A42
>
> Probably needs hot_standby_feedback on.  Will adjust this soon.

Seen again today on prairiedog.  Erm, scratch that idea, HS feedback
interferes with test results.  I guess max_standby_streaming_delay
should be increased to 'forever', like in the attached, since pg_dump
runs for a very long time on prairiedog:

2022-02-01 04:47:59.294 EST [3670:15] 027_stream_regress.pl LOG:
statement: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ ONLY
...
2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR:
canceling statement due to conflict with recovery

Attachment

Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2022-02-02 13:59:56 +1300, Thomas Munro wrote:
> Seen again today on prairiedog.  Erm, scratch that idea, HS feedback
> interferes with test results.

It'd not be sufficient anyway, I think. E.g. autovacuum truncating a table
would not be prevented by hs_f I think?


> I guess max_standby_streaming_delay
> should be increased to 'forever', like in the attached

Seems reasonable.


> , since pg_dump runs for a very long time on prairiedog:

> 2022-02-01 04:47:59.294 EST [3670:15] 027_stream_regress.pl LOG:
> statement: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ ONLY
> ...
> 2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR:
> canceling statement due to conflict with recovery

That, uh, seems slow. Is it perhaps waiting for a lock? Seems
Cluster.pm::init() should add at least log_lock_waits...

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Wed, Feb 2, 2022 at 2:14 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-02-02 13:59:56 +1300, Thomas Munro wrote:
> > 2022-02-01 04:47:59.294 EST [3670:15] 027_stream_regress.pl LOG:
> > statement: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ ONLY
> > ...
> > 2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR:
> > canceling statement due to conflict with recovery
>
> That, uh, seems slow. Is it perhaps waiting for a lock? Seems
> Cluster.pm::init() should add at least log_lock_waits...

I quoted the wrong lines, let me try that again this time for the same
session, the one with pid 3683:

2022-02-01 04:48:38.352 EST [3683:15] 027_stream_regress.pl LOG:
statement: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ ONLY
...
2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR:
canceling statement due to conflict with recovery

It looks like it's processing statements fairly consistently slowly
through the whole period.   Each non-trivial statement takes a bit
under ~10ms, so it would make sense if by the time we've processed
~2.5k lines we've clocked up 30 seconds and a VACUUM replay whacks us.



Re: A test for replay of regression tests

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> It looks like it's processing statements fairly consistently slowly
> through the whole period.   Each non-trivial statement takes a bit
> under ~10ms, so it would make sense if by the time we've processed
> ~2.5k lines we've clocked up 30 seconds and a VACUUM replay whacks us.

This test is set up to time out after 30 seconds?  We've long had
an unofficial baseline that no timeouts under 180 seconds should
be used in the buildfarm.

            regards, tom lane



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On February 1, 2022 6:11:24 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Thomas Munro <thomas.munro@gmail.com> writes:
>> It looks like it's processing statements fairly consistently slowly
>> through the whole period.   Each non-trivial statement takes a bit
>> under ~10ms, so it would make sense if by the time we've processed
>> ~2.5k lines we've clocked up 30 seconds and a VACUUM replay whacks us.
>
>This test is set up to time out after 30 seconds?  We've long had
>an unofficial baseline that no timeouts under 180 seconds should
>be used in the buildfarm.

30s is the default value of the streaming replay conflict timeout. After that the startup process cancelled the session
runningpg_dump. So it's not an intentional timeout in the test. 

It's not surprising that pg_dump takes 30s on that old a machine. But more than 2min still surprised me. Is that really
dobe expected? 

- Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: A test for replay of regression tests

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> It's not surprising that pg_dump takes 30s on that old a machine. But more than 2min still surprised me. Is that
reallydo be expected? 

In the previous buildfarm run, that dump took just under 31s:

2022-01-31 14:21:10.358 EST [19325:1] [unknown] LOG:  connection received: host=[local]
2022-01-31 14:21:10.367 EST [19325:2] [unknown] LOG:  connection authorized: user=buildfarm database=regression
application_name=027_stream_regress.pl
...
2022-01-31 14:21:41.139 EST [19325:2663] 027_stream_regress.pl LOG:  disconnection: session time: 0:00:30.782
user=buildfarmdatabase=regression host=[local] 

In the failing run, we have:

2022-02-01 04:48:37.757 EST [3683:1] [unknown] LOG:  connection received: host=[local]
2022-02-01 04:48:37.767 EST [3683:2] [unknown] LOG:  connection authorized: user=buildfarm database=regression
application_name=027_stream_regress.pl
...
2022-02-01 04:49:09.719 EST [3683:2584] 027_stream_regress.pl LOG:  statement: COPY public.tenk1 (unique1, unique2,
two,four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4) TO
stdout;
2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR:  canceling statement due to conflict with recovery
2022-02-01 04:49:09.881 EST [3683:2586] 027_stream_regress.pl DETAIL:  User query might have needed to see row versions
thatmust be removed. 
2022-02-01 04:49:09.881 EST [3683:2587] 027_stream_regress.pl STATEMENT:  COPY public.tenk1 (unique1, unique2, two,
four,ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4) TO
stdout;
2022-02-01 04:49:09.889 EST [3685:1] [unknown] LOG:  connection received: host=[local]
2022-02-01 04:49:09.905 EST [3683:2588] 027_stream_regress.pl LOG:  could not send data to client: Broken pipe
2022-02-01 04:49:09.905 EST [3683:2589] 027_stream_regress.pl ERROR:  canceling statement due to conflict with recovery
2022-02-01 04:49:09.905 EST [3683:2590] 027_stream_regress.pl DETAIL:  User query might have needed to see row versions
thatmust be removed. 
2022-02-01 04:49:09.906 EST [3683:2591] 027_stream_regress.pl FATAL:  connection to client lost
2022-02-01 04:49:09.935 EST [3683:2592] 027_stream_regress.pl LOG:  disconnection: session time: 0:00:32.179
user=buildfarmdatabase=regression host=[local] 

That's only a little over 30s.  Where are you getting 2m from?

            regards, tom lane



Re: A test for replay of regression tests

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Seen again today on prairiedog.  Erm, scratch that idea, HS feedback
> interferes with test results.  I guess max_standby_streaming_delay
> should be increased to 'forever', like in the attached, since pg_dump
> runs for a very long time on prairiedog:

FWIW, I'd vote for keeping a finite timeout, but making it say
ten minutes.  If the thing gets stuck for some reason, you don't
really want the test waiting forever.  (Some buildfarm animals
have overall-test-time limits, but I think it's not the default,
and the behavior when that gets hit is pretty unfriendly anyway
-- you don't get any report of the run at all.)

            regards, tom lane



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Wed, Feb 2, 2022 at 3:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Seen again today on prairiedog.  Erm, scratch that idea, HS feedback
> > interferes with test results.  I guess max_standby_streaming_delay
> > should be increased to 'forever', like in the attached, since pg_dump
> > runs for a very long time on prairiedog:
>
> FWIW, I'd vote for keeping a finite timeout, but making it say
> ten minutes.  If the thing gets stuck for some reason, you don't
> really want the test waiting forever.  (Some buildfarm animals
> have overall-test-time limits, but I think it's not the default,
> and the behavior when that gets hit is pretty unfriendly anyway
> -- you don't get any report of the run at all.)

Ok, I've set it to 10 minutes.  Thanks.



Re: A test for replay of regression tests

From
Thomas Munro
Date:
Another failure under 027_stream_regress.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-16%2005%3A58%3A05

     vacuum                       ... FAILED     3463 ms

I'll try to come up with the perl needed to see the regression.diffs
next time...



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Another failure under 027_stream_regress.pl:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-16%2005%3A58%3A05
>
>      vacuum                       ... FAILED     3463 ms
>
> I'll try to come up with the perl needed to see the regression.diffs
> next time...

Here's my proposed change to achieve that.

Here's an example of where it shows up if it fails (from my
deliberately sabotaged CI run
https://cirrus-ci.com/build/6730380228165632 where I was verifying
that it also works on Windows):

Unix:
https://api.cirrus-ci.com/v1/artifact/task/5421419923243008/log/src/test/recovery/tmp_check/log/regress_log_027_stream_regress
Windows:
https://api.cirrus-ci.com/v1/artifact/task/4717732481466368/log/src/test/recovery/tmp_check/log/regress_log_027_stream_regress

Attachment

Re: A test for replay of regression tests

From
Andrew Dunstan
Date:
On 3/20/22 05:36, Thomas Munro wrote:
> On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Another failure under 027_stream_regress.pl:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-16%2005%3A58%3A05
>>
>>      vacuum                       ... FAILED     3463 ms
>>
>> I'll try to come up with the perl needed to see the regression.diffs
>> next time...
> Here's my proposed change to achieve that.


I think that's OK.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Mon, Mar 21, 2022 at 2:34 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> On 3/20/22 05:36, Thomas Munro wrote:
> > On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >> I'll try to come up with the perl needed to see the regression.diffs
> >> next time...
> > Here's my proposed change to achieve that.
>
> I think that's OK.

Thanks for looking!  Pushed.



Re: A test for replay of regression tests

From
Masahiko Sawada
Date:
i,

On Mon, Mar 21, 2022 at 5:45 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Mon, Mar 21, 2022 at 2:34 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> > On 3/20/22 05:36, Thomas Munro wrote:
> > > On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > >> I'll try to come up with the perl needed to see the regression.diffs
> > >> next time...
> > > Here's my proposed change to achieve that.
> >
> > I think that's OK.
>
> Thanks for looking!  Pushed.

FYI idiacanthus failed 027_stream_regress.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2022-03-22%2001%3A58%3A04

The log shows:

=== dumping /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/src/test/recovery/tmp_check/regression.diffs
===
diff -U3 /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql/src/test/regress/expected/vacuum.out
/home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/src/test/recovery/tmp_check/results/vacuum.out
--- /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql/src/test/regress/expected/vacuum.out
2021-07-01 19:00:01.936659446 +0200
+++ /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/src/test/recovery/tmp_check/results/vacuum.out
2022-03-22 03:28:09.813377179 +0100
@@ -181,7 +181,7 @@
 SELECT pg_relation_size('vac_truncate_test') = 0;
  ?column?
 ----------
- t
+ f
 (1 row)

 VACUUM (TRUNCATE FALSE, FULL TRUE) vac_truncate_test;
=== EOF ===
not ok 2 - regression tests pass

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Tue, Mar 22, 2022 at 4:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>  SELECT pg_relation_size('vac_truncate_test') = 0;
>   ?column?
>  ----------
> - t
> + f

Thanks.  Ahh, déjà vu...  this probably needs the same treatment as
b700f96c and 3414099c provided for the reloptions test.  Well, at
least the first one.  Here's a patch like that.

Attachment

Re: A test for replay of regression tests

From
Peter Geoghegan
Date:
On Mon, Mar 21, 2022 at 8:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks.  Ahh, déjà vu...  this probably needs the same treatment as
> b700f96c and 3414099c provided for the reloptions test.  Well, at
> least the first one.  Here's a patch like that.

If you want to know whether or not the buildfarm will have problems
due to VACUUM failing to get a cleanup lock randomly, then I suggest
that you use an approach like the one from my patch here:

https://postgr.es/m/CAH2-WzkiB-qcsBmWrpzP0nxvrQExoUts1d7TYShg_DrkOHeg4Q@mail.gmail.com

I recently tried it again myself. With the gizmo in place the tests
fail in exactly the same way you've had problems with on the
buildfarm. On the first try, even.

--
Peter Geoghegan



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Mar 25, 2022 at 4:03 PM Peter Geoghegan <pg@bowt.ie> wrote:
> If you want to know whether or not the buildfarm will have problems
> due to VACUUM failing to get a cleanup lock randomly, then I suggest
> that you use an approach like the one from my patch here:
>
> https://postgr.es/m/CAH2-WzkiB-qcsBmWrpzP0nxvrQExoUts1d7TYShg_DrkOHeg4Q@mail.gmail.com
>
> I recently tried it again myself. With the gizmo in place the tests
> fail in exactly the same way you've had problems with on the
> buildfarm. On the first try, even.

Interesting.  IIUC your chaos gizmo shows that particular vacuum test
still failing on master, but that wouldn't happen in real life because
since 383f2221 it's a temp table.  Your gizmo should probably detect
temp rels, as your comment says.  I was sort of thinking that perhaps
if DISABLE_PAGE_SKIPPING is eventually made to do what its name sounds
like it does, we could remove TEMP from that test and it'd still pass
with the gizmo...



Re: A test for replay of regression tests

From
Peter Geoghegan
Date:
On Thu, Mar 24, 2022 at 8:56 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Interesting.  IIUC your chaos gizmo shows that particular vacuum test
> still failing on master, but that wouldn't happen in real life because
> since 383f2221 it's a temp table.  Your gizmo should probably detect
> temp rels, as your comment says.  I was sort of thinking that perhaps
> if DISABLE_PAGE_SKIPPING is eventually made to do what its name sounds
> like it does, we could remove TEMP from that test and it'd still pass
> with the gizmo...

Why not just use VACUUM FREEZE? That should work, because it won't
settle for a cleanup lock on any page with an XID < OldestXmin. And
even if there were only LP_DEAD items on a page, that wouldn't matter
either, because we don't need a cleanup lock to get rid of those
anymore. And we consistently do all the same steps for rel truncation
in the no-cleanup-lock path (lazy_scan_noprune) now.

I think that DISABLE_PAGE_SKIPPING isn't appropriate for this kind of
thing. It mostly just makes VACUUM not trust the visibility map, which
isn't going to help. While DISABLE_PAGE_SKIPPING also forces
aggressive mode, that isn't going to help either, unless you somehow
also make sure that FreezeLimit is OldestXmin (e.g. by setting
vacuum_freeze_min_age to 0).

VACUUM FREEZE (without DISABLE_PAGE_SKIPPING) seems like it would do
everything you want, without using a temp table. At least on the
master branch.

-- 
Peter Geoghegan



Re: A test for replay of regression tests

From
Andres Freund
Date:
Hi,

On 2022-03-24 21:06:21 -0700, Peter Geoghegan wrote:
> On Thu, Mar 24, 2022 at 8:56 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Interesting.  IIUC your chaos gizmo shows that particular vacuum test
> > still failing on master, but that wouldn't happen in real life because
> > since 383f2221 it's a temp table.  Your gizmo should probably detect
> > temp rels, as your comment says.  I was sort of thinking that perhaps
> > if DISABLE_PAGE_SKIPPING is eventually made to do what its name sounds
> > like it does, we could remove TEMP from that test and it'd still pass
> > with the gizmo...
> 
> Why not just use VACUUM FREEZE? That should work, because it won't
> settle for a cleanup lock on any page with an XID < OldestXmin. And
> even if there were only LP_DEAD items on a page, that wouldn't matter
> either, because we don't need a cleanup lock to get rid of those
> anymore. And we consistently do all the same steps for rel truncation
> in the no-cleanup-lock path (lazy_scan_noprune) now.
> 
> I think that DISABLE_PAGE_SKIPPING isn't appropriate for this kind of
> thing. It mostly just makes VACUUM not trust the visibility map, which
> isn't going to help. While DISABLE_PAGE_SKIPPING also forces
> aggressive mode, that isn't going to help either, unless you somehow
> also make sure that FreezeLimit is OldestXmin (e.g. by setting
> vacuum_freeze_min_age to 0).
> 
> VACUUM FREEZE (without DISABLE_PAGE_SKIPPING) seems like it would do
> everything you want, without using a temp table. At least on the
> master branch.

We tried that in a prior case:
https://postgr.es/m/20220120052404.sonrhq3f3qgplpzj%40alap3.anarazel.de

I don't know if the same danger applies here though.

Greetings,

Andres Freund



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Fri, Mar 25, 2022 at 5:16 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-03-24 21:06:21 -0700, Peter Geoghegan wrote:
> > VACUUM FREEZE (without DISABLE_PAGE_SKIPPING) seems like it would do
> > everything you want, without using a temp table. At least on the
> > master branch.
>
> We tried that in a prior case:
> https://postgr.es/m/20220120052404.sonrhq3f3qgplpzj%40alap3.anarazel.de

Yeah, or really, it was Michael that tried that in commit fe246d1c,
and then we tried more things with 3414099c and b700f96c.  It's a bit
of a belt-and-braces setup admittedly...



Re: A test for replay of regression tests

From
Peter Geoghegan
Date:
On Thu, Mar 24, 2022 at 9:16 PM Andres Freund <andres@anarazel.de> wrote:
> > VACUUM FREEZE (without DISABLE_PAGE_SKIPPING) seems like it would do
> > everything you want, without using a temp table. At least on the
> > master branch.
>
> We tried that in a prior case:
> https://postgr.es/m/20220120052404.sonrhq3f3qgplpzj%40alap3.anarazel.de

Oh, yeah. If some other backend is holding back OldestXmin, and you
can't find a way of dealing with that, then you'll need a temp table.
(Mind you, that trick only works on recent versions too.)

-- 
Peter Geoghegan



Re: A test for replay of regression tests

From
Thomas Munro
Date:
cfbot found another source of nondeterminism in the regression tests,
due to the smaller shared_buffers used in this TAP test:

https://cirrus-ci.com/task/4611828654276608
https://api.cirrus-ci.com/v1/artifact/task/4611828654276608/log/src/test/recovery/tmp_check/regression.diffs

Turned out that we had already diagnosed that once before, when tiny
build farm animal chipmunk reported the same, but we didn't commit a
fix:

https://www.postgresql.org/message-id/flat/CA%2BhUKGLTK6ZuEkpeJ05-MEmvmgZveCh%2B_w013m7%2ByKWFSmRcDA%40mail.gmail.com



Re: A test for replay of regression tests

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> cfbot found another source of nondeterminism in the regression tests,
> due to the smaller shared_buffers used in this TAP test:

This failure seems related but not identical:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=myna&dt=2022-04-02%2004%3A00%3A26

portals.out is expecting that the "foo25ns" cursor will read
starting at the beginning of tenk1, but it's starting somewhere
else, which presumably is a syncscan effect.

I think the fundamental instability here is that this TAP test is
setting shared_buffers small enough to allow the syncscan logic
to kick in where it does not in normal testing.  Maybe we should
just disable syncscan in this test script?

            regards, tom lane



Re: A test for replay of regression tests

From
Tom Lane
Date:
I wrote:
> I think the fundamental instability here is that this TAP test is
> setting shared_buffers small enough to allow the syncscan logic
> to kick in where it does not in normal testing.  Maybe we should
> just disable syncscan in this test script?

Did that, we'll see how much it helps.

            regards, tom lane



Re: A test for replay of regression tests

From
Justin Pryzby
Date:
On Thu, Dec 09, 2021 at 12:10:23PM +1300, Thomas Munro wrote:
> This adds 2 whole minutes to the recovery check, when running with the
> Windows serial-only scripts on Cirrus CI (using Andres's CI patches).
> For Linux it adds ~20 seconds to the total of -j8 check-world.
> Hopefully that's time well spent, because it adds test coverage for
> all the redo routines, and hopefully soon we won't have to run 'em in
> series on Windows.

Should 027-stream-regress be renamed to something that starts earlier ?
Off-list earlier this year, Andres referred to 000.

-- 
Justin



Re: A test for replay of regression tests

From
Thomas Munro
Date:
On Thu, Aug 4, 2022 at 3:30 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Thu, Dec 09, 2021 at 12:10:23PM +1300, Thomas Munro wrote:
> > This adds 2 whole minutes to the recovery check, when running with the
> > Windows serial-only scripts on Cirrus CI (using Andres's CI patches).
> > For Linux it adds ~20 seconds to the total of -j8 check-world.
> > Hopefully that's time well spent, because it adds test coverage for
> > all the redo routines, and hopefully soon we won't have to run 'em in
> > series on Windows.
>
> Should 027-stream-regress be renamed to something that starts earlier ?
> Off-list earlier this year, Andres referred to 000.

Do you have any data on improved times from doing that?

I have wondered about moving it into 001_stream_rep.pl.



Re: A test for replay of regression tests

From
Justin Pryzby
Date:
On Thu, Aug 04, 2022 at 09:24:24AM +1200, Thomas Munro wrote:
> On Thu, Aug 4, 2022 at 3:30 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Thu, Dec 09, 2021 at 12:10:23PM +1300, Thomas Munro wrote:
> > > This adds 2 whole minutes to the recovery check, when running with the
> > > Windows serial-only scripts on Cirrus CI (using Andres's CI patches).
> > > For Linux it adds ~20 seconds to the total of -j8 check-world.
> > > Hopefully that's time well spent, because it adds test coverage for
> > > all the redo routines, and hopefully soon we won't have to run 'em in
> > > series on Windows.
> >
> > Should 027-stream-regress be renamed to something that starts earlier ?
> > Off-list earlier this year, Andres referred to 000.

See also: 
https://www.postgresql.org/message-id/20220213220709.vjz5rziuhfdpqxrg@alap3.anarazel.de

> Do you have any data on improved times from doing that?
> 
> I have wondered about moving it into 001_stream_rep.pl.

The immediate motive for raising the question is due to working on your cygwin
patch (where I've set PROVE_FLAGS=-j3).  The last invocation I have opened ends
like:

[20:46:47.577] [13:46:47] t/026_overwrite_contrecord.pl ........ ok    10264 ms ( 0.02 usr  0.02 sys + 11.25 cusr 35.95
csys= 47.24 CPU)
 
[20:47:08.087] [13:47:08] t/028_pitr_timelines.pl .............. ok    13153 ms ( 0.00 usr  0.00 sys +  4.03 cusr 14.79
csys= 18.82 CPU)
 
[20:47:08.999] [13:47:09] t/029_stats_restart.pl ............... ok    12631 ms ( 0.00 usr  0.02 sys +  7.40 cusr 23.30
csys= 30.71 CPU)
 
[20:47:34.353] [13:47:34] t/031_recovery_conflict.pl ........... ok    11337 ms ( 0.00 usr  0.00 sys +  3.84 cusr 11.82
csys= 15.66 CPU)
 
[20:47:35.070] [13:47:35] t/030_stats_cleanup_replica.pl ....... ok    14054 ms ( 0.02 usr  0.00 sys +  7.64 cusr 25.02
csys= 32.68 CPU)
 
[20:48:04.887] [13:48:04] t/032_relfilenode_reuse.pl ........... ok    12755 ms ( 0.00 usr  0.00 sys +  3.36 cusr 11.57
csys= 14.93 CPU)
 
[20:48:42.055] [13:48:42] t/033_replay_tsp_drops.pl ............ ok    43529 ms ( 0.00 usr  0.00 sys + 12.29 cusr 41.43
csys= 53.71 CPU)
 
[20:50:02.770] [13:50:02] t/027_stream_regress.pl .............. ok   198408 ms ( 0.02 usr  0.06 sys + 44.92 cusr
142.42csys = 187.42 CPU)
 
[20:50:02.771] [13:50:02]
[20:50:02.771] All tests successful.
[20:50:02.771] Files=33, Tests=411, 402 wallclock secs ( 0.16 usr  0.27 sys + 138.03 cusr 441.56 csys = 580.01 CPU)

If 027 had been started sooner, this test might have finished up to 78sec
earlier.  If lots of tests are added in the future, maybe it won't matter, but
it seems like it does now.

As I understand, checks are usually parallelized by "make -j" and not by
"prove".  In that case, starting a slow test later doesn't matter.  But it'd be
better for anyone who runs tap tests manually, and (I think) for meson.

As a one-off test on localhost:
time make check -C src/test/recovery
=> 11m42,790s
time make check -C src/test/recovery PROVE_FLAGS=-j2
=> 7m56,315s

After renaming it to 001:
time make check -C src/test/recovery
=> 11m33,887s (~same)
time make check -C src/test/recovery PROVE_FLAGS=-j2
=> 6m59,969s

I don't know how it affect the buildfarm (but I think that's not optimized
primarily for speed anyway).

-- 
Justin