Thread: A test for replay of regression tests
Hi, We have automated tests for many specific replication and recovery scenarios, but nothing that tests replay of a wide range of records. People working on recovery code often use installcheck (presumably along with other custom tests) to exercise it, sometimes with wal_consistency_check enabled. So, why don't we automate that? Aside from exercising the WAL decoding machinery (which brought me here), that'd hopefully provide some decent improvements in coverage of the various redo routines, many of which are not currently exercised at all. I'm not quite sure where it belongs, though. The attached initial sketch patch puts it under rc/test/recovery near other similar things, but I'm not sure if it's really OK to invoke make -C ../regress from here. I copied pg_update/test.sh's trick of using a different outputdir to avoid clobbering a concurrent run under src/test/regress, and I also needed to invent a way to stop it from running the cursed tablespace test (deferring startup of the standby also works but eats way too much space, which I learned after blowing out a smallish development VM's disk). Alternatively, we could put it under src/test/regress, which makes some kind of logical sense, but it would make a quick "make check" take more than twice as long. Perhaps it could use a different target, one that check-world somehow reaches but not check, and that also doesn't seem great. Ideas on how to structure this or improve the perl welcome.
Attachment
From: Thomas Munro <thomas.munro@gmail.com> > We have automated tests for many specific replication and recovery scenarios, > but nothing that tests replay of a wide range of records. > People working on recovery code often use installcheck (presumably along > with other custom tests) to exercise it, sometimes with > wal_consistency_check enabled. So, why don't we automate that? Aside > from exercising the WAL decoding machinery (which brought me here), that'd > hopefully provide some decent improvements in coverage of the various redo > routines, many of which are not currently exercised at all. > > I'm not quite sure where it belongs, though. The attached initial sketch patch I think that's a good attempt. It'd be better to confirm that the database contents are identical on the primary and standby. But... I remember when I ran make installcheck with a standby connected, then ran pg_dumpall on both the primaryand standby and compare the two output files, about 40 lines of difference were observed. Those differences wereall about the sequence values. I didn't think about whether I should/can remove the differences. +# XXX The tablespace tests don't currently work when the standby shares a +# filesystem with the primary due to colliding absolute paths. We'll skip +# that for now. Maybe we had better have a hidden feature that creates tablespace contents in /tablespace_location/PG_..._<some_name>/ <some_name> is the value of cluster_name or application_name. Or, we may provide a visible feature that allows users to store tablespace contents in a specified directory regardless ofthe LOCATION value in CREATE TABLESPACE. For instance, add a GUC like table_space_directory = '/some_dir' Then, the tablespace contents are stored in /some_dir/<tablespace_name>/. This may be useful if a DBaaS provider wants tooffer some tablespace-based feature, say tablespace transparent data encryption, but doesn't want users to specify filesystemdirectories. Regards Takayuki Tsunakawa
On Fri, Apr 23, 2021 at 6:27 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > From: Thomas Munro <thomas.munro@gmail.com> > > I'm not quite sure where it belongs, though. The attached initial sketch patch > > I think that's a good attempt. It'd be better to confirm that the database contents are identical on the primary and standby. But... I remember when I ran make installcheck with a standby connected, then ran pg_dumpall on both the primaryand standby and compare the two output files, about 40 lines of difference were observed. Those differences wereall about the sequence values. I didn't think about whether I should/can remove the differences. Interesting idea. I hadn't noticed the thing with sequences before. > +# XXX The tablespace tests don't currently work when the standby shares a > +# filesystem with the primary due to colliding absolute paths. We'll skip > +# that for now. > > Maybe we had better have a hidden feature that creates tablespace contents in > > /tablespace_location/PG_..._<some_name>/ > > <some_name> is the value of cluster_name or application_name. > > Or, we may provide a visible feature that allows users to store tablespace contents in a specified directory regardlessof the LOCATION value in CREATE TABLESPACE. For instance, add a GUC like > > table_space_directory = '/some_dir' > > Then, the tablespace contents are stored in /some_dir/<tablespace_name>/. This may be useful if a DBaaS provider wantsto offer some tablespace-based feature, say tablespace transparent data encryption, but doesn't want users to specifyfilesystem directories. Yeah, a few similar ideas have been put forward before, for example in this thread: https://www.postgresql.org/message-id/flat/CALfoeisEF92F5nJ-aAcuWTvF_Aogxq_1bHLem_kVfM_tHc2mfg%40mail.gmail.com ... but also others. I would definitely like to fix that problem too (having worked on several things that interact with tablespaces, I definitely want to be able to extend those tests in future patches, and get coverage on the build farm and CI), but with --skip-tests that could be done independently. Apparently the invocation of make failed for some reason on CI (not sure why, works for me), but in any case, that feels a bit clunky and might not ever work on Windows, so perhaps we should figure out how to invoke the pg_regress[.exe] program directly.
Hi, On 2021-04-23 17:37:48 +1200, Thomas Munro wrote: > We have automated tests for many specific replication and recovery > scenarios, but nothing that tests replay of a wide range of records. > People working on recovery code often use installcheck (presumably > along with other custom tests) to exercise it, sometimes with > wal_consistency_check enabled. So, why don't we automate that? Aside > from exercising the WAL decoding machinery (which brought me here), > that'd hopefully provide some decent improvements in coverage of the > various redo routines, many of which are not currently exercised at > all. Yay. > I'm not quite sure where it belongs, though. The attached initial > sketch patch puts it under rc/test/recovery near other similar things, > but I'm not sure if it's really OK to invoke make -C ../regress from > here. I'd say it's not ok, and we should just invoke pg_regress without make. > Add a new TAP test under src/test/recovery that runs the regression > tests with wal_consistency_checking=all. Hm. I wonder if running with wal_consistency_checking=all doesn't also reduce coverage of some aspects of recovery, by increasing record sizes etc. > I copied pg_update/test.sh's trick of using a different > outputdir to avoid clobbering a concurrent run under src/test/regress, > and I also needed to invent a way to stop it from running the cursed > tablespace test (deferring startup of the standby also works but eats > way too much space, which I learned after blowing out a smallish > development VM's disk). That's because you are using wal_consistency_checking=all, right? Because IIRC we don't generate that much WAL otherwise? > +# Create some content on primary and check its presence in standby 1 > +$node_primary->safe_psql('postgres', > + "CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a"); > + > +# Wait for standby to catch up > +$node_primary->wait_for_catchup($node_standby_1, 'replay', > + $node_primary->lsn('insert')); > +my $result = > + $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int"); > +print "standby 1: $result\n"; > +is($result, qq(1002), 'check streamed content on standby 1'); Why is this needed? Greetings, Andres Freund
On 4/23/21 1:37 AM, Thomas Munro wrote: > Hi, > > We have automated tests for many specific replication and recovery > scenarios, but nothing that tests replay of a wide range of records. > People working on recovery code often use installcheck (presumably > along with other custom tests) to exercise it, sometimes with > wal_consistency_check enabled. So, why don't we automate that? Aside > from exercising the WAL decoding machinery (which brought me here), > that'd hopefully provide some decent improvements in coverage of the > various redo routines, many of which are not currently exercised at > all. > > I'm not quite sure where it belongs, though. The attached initial > sketch patch puts it under rc/test/recovery near other similar things, > but I'm not sure if it's really OK to invoke make -C ../regress from > here. I copied pg_update/test.sh's trick of using a different > outputdir to avoid clobbering a concurrent run under src/test/regress, > and I also needed to invent a way to stop it from running the cursed > tablespace test (deferring startup of the standby also works but eats > way too much space, which I learned after blowing out a smallish > development VM's disk). Alternatively, we could put it under > src/test/regress, which makes some kind of logical sense, but it would > make a quick "make check" take more than twice as long. Perhaps it > could use a different target, one that check-world somehow reaches but > not check, and that also doesn't seem great. Ideas on how to > structure this or improve the perl welcome. Nice, I like adding a skip-tests option to pg_regress. The perl looks ok - I assume those print "standby 1: $result\n"; lines are there for debugging. Normally you would just process $result using the Test::More functions cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andres Freund <andres@anarazel.de> writes: > On 2021-04-23 17:37:48 +1200, Thomas Munro wrote: >> We have automated tests for many specific replication and recovery >> scenarios, but nothing that tests replay of a wide range of records. > Yay. +1 >> Add a new TAP test under src/test/recovery that runs the regression >> tests with wal_consistency_checking=all. > Hm. I wonder if running with wal_consistency_checking=all doesn't also > reduce coverage of some aspects of recovery, by increasing record sizes > etc. Yeah. I found out earlier that wal_consistency_checking=all is an absolute PIG. Running the regression tests that way requires tens of gigabytes of disk space, and a significant amount of time if your disk is not very speedy. If we put this into the buildfarm at all, it would have to be opt-in, not run-by-default, because a lot of BF animals simply don't have the horsepower. I think I'd vote against adding it to check-world, too; the cost/benefit ratio is not good unless you are specifically working on replay functions. And as you say, it alters the behavior enough to make it a little questionable whether we're exercising the "normal" code paths. The two things I'd say about this are (1) Whether to use wal_consistency_checking, and with what value, needs to be easily adjustable. (2) People will want to run other test suites than the core regression tests, eg contrib modules. regards, tom lane
Hi, On 2021-04-23 11:53:35 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Hm. I wonder if running with wal_consistency_checking=all doesn't also > > reduce coverage of some aspects of recovery, by increasing record sizes > > etc. > > Yeah. I found out earlier that wal_consistency_checking=all is an > absolute PIG. Running the regression tests that way requires tens of > gigabytes of disk space, and a significant amount of time if your > disk is not very speedy. If we put this into the buildfarm at all, > it would have to be opt-in, not run-by-default, because a lot of BF > animals simply don't have the horsepower. I think I'd vote against > adding it to check-world, too; the cost/benefit ratio is not good > unless you are specifically working on replay functions. I think it'd be a huge improvement to test recovery during check-world by default - it's a huge swath of crucial code that practically has no test coverage. I agree that testing by default with wal_consistency_checking=all isn't feasible due to the time & space overhead, so I think we should not do that. > The two things I'd say about this are (1) Whether to use > wal_consistency_checking, and with what value, needs to be > easily adjustable. (2) People will want to run other test suites > than the core regression tests, eg contrib modules. I'm not really excited about tackling 2) initially. I think it's a significant issue that we don't have test coverage for most redo routines and that we should change that with some urgency - even if we back out these WAL prefetch related changes, there've been sufficiently many changes affecting WAL that I think it's worth doing the minimal thing soon. I don't think there's actually that much need to test contrib modules through recovery - most of them don't seem like they'd add meaningful coverage? The exception is contrib/bloom, but perhaps that'd be better tackled with a dedicated test? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2021-04-23 11:53:35 -0400, Tom Lane wrote: >> Yeah. I found out earlier that wal_consistency_checking=all is an >> absolute PIG. Running the regression tests that way requires tens of >> gigabytes of disk space, and a significant amount of time if your >> disk is not very speedy. If we put this into the buildfarm at all, >> it would have to be opt-in, not run-by-default, because a lot of BF >> animals simply don't have the horsepower. I think I'd vote against >> adding it to check-world, too; the cost/benefit ratio is not good >> unless you are specifically working on replay functions. > I think it'd be a huge improvement to test recovery during check-world > by default - it's a huge swath of crucial code that practically has no > test coverage. I agree that testing by default with > wal_consistency_checking=all isn't feasible due to the time & space > overhead, so I think we should not do that. I was mainly objecting to enabling wal_consistency_checking by default. I agree it's bad that we have so little routine test coverage on WAL replay code. >> The two things I'd say about this are (1) Whether to use >> wal_consistency_checking, and with what value, needs to be >> easily adjustable. (2) People will want to run other test suites >> than the core regression tests, eg contrib modules. > I don't think there's actually that much need to test contrib modules > through recovery - most of them don't seem like they'd add meaningful > coverage? The exception is contrib/bloom, but perhaps that'd be better > tackled with a dedicated test? contrib/bloom is indeed the only(?) case within contrib, but in my mind that's a proxy for what people will be needing to test out-of-core AMs. It might not be a test to run by default, but there needs to be a way to do it. Also, I suspect that there are bits of GIST/GIN/SPGIST that are not well exercised if you don't run the contrib modules that add opclasses for those. regards, tom lane
Hi, On 2021-04-23 13:13:15 -0400, Tom Lane wrote: > contrib/bloom is indeed the only(?) case within contrib, but in my mind > that's a proxy for what people will be needing to test out-of-core AMs. > It might not be a test to run by default, but there needs to be a way > to do it. Hm. My experience in the past was that the best way to test an external AM is to run the core regression tests with a different default_table_access_method. That does require some work of ensuring the AM is installed and the relevant extension created, which in turn requires a different test schedule, or hacking template1. So likely a different test script anyway? > Also, I suspect that there are bits of GIST/GIN/SPGIST that are not > well exercised if you don't run the contrib modules that add opclasses > for those. Possible - still think it'd be best to get the minimal thing in asap, and then try to extend further afterwards... Perfect being the enemy of good and all that. Greetings, Andres Freund
Ok, here's a new version incorporating feedback so far. 1. Invoke pg_regress directly (no make). 2. Use PG_TEST_EXTRA="wal_consistency_checking" as a way to opt in to the more expensive test. 3. Use parallel schedule rather than serial. It's faster but also the non-determinism might discover more things. This required changing the TAP test max_connections setting from 10 to 25. 4. Remove some extraneous print statements and check-if-data-is-replicated-using-SELECT tests that are technically not needed (I had copied those from 001_stream_rep.pl).
Attachment
вт, 8 июн. 2021 г. в 02:25, Thomas Munro <thomas.munro@gmail.com>:
Ok, here's a new version incorporating feedback so far.
1. Invoke pg_regress directly (no make).
2. Use PG_TEST_EXTRA="wal_consistency_checking" as a way to opt in to
the more expensive test.
3. Use parallel schedule rather than serial. It's faster but also
the non-determinism might discover more things. This required
changing the TAP test max_connections setting from 10 to 25.
4. Remove some extraneous print statements and
check-if-data-is-replicated-using-SELECT tests that are technically
not needed (I had copied those from 001_stream_rep.pl).
Thank you for working on this test set!
I was especially glad to see the skip-tests option for pg_regress. I think it will become a very handy tool for hackers.
To try the patch I had to resolve a few merge conflicts, see a rebased version in attachments.
> auth_extra => [ '--create-role', 'repl_role' ]);
This line and the comment above it look like some copy-paste artifacts. Did I get it right? If so, I suggest removing them.
Other than that, the patch looks good to me.
--
Best regards,
Lubennikova Anastasia
Attachment
вт, 8 июн. 2021 г. в 02:44, Anastasia Lubennikova <lubennikovaav@gmail.com>:
вт, 8 июн. 2021 г. в 02:25, Thomas Munro <thomas.munro@gmail.com>:Ok, here's a new version incorporating feedback so far.
1. Invoke pg_regress directly (no make).
2. Use PG_TEST_EXTRA="wal_consistency_checking" as a way to opt in to
the more expensive test.
3. Use parallel schedule rather than serial. It's faster but also
the non-determinism might discover more things. This required
changing the TAP test max_connections setting from 10 to 25.
4. Remove some extraneous print statements and
check-if-data-is-replicated-using-SELECT tests that are technically
not needed (I had copied those from 001_stream_rep.pl).Thank you for working on this test set!I was especially glad to see the skip-tests option for pg_regress. I think it will become a very handy tool for hackers.To try the patch I had to resolve a few merge conflicts, see a rebased version in attachments.
> auth_extra => [ '--create-role', 'repl_role' ]);This line and the comment above it look like some copy-paste artifacts. Did I get it right? If so, I suggest removing them.Other than that, the patch looks good to me.
For some reason, it failed on cfbot, so I'll switch it back to 'Waiting on author'.
BTW, do I get it right, that cfbot CI will need some adjustments to print regression.out for this test?
See one more revision of the patch attached. It contains the following changes:
- rebase to recent main
- removed 'auth_extra' piece, that I mentioned above.
- added lacking make clean and gitignore changes.
--
Best regards,
Lubennikova Anastasia
Attachment
On Thu, Jun 10, 2021 at 7:37 PM Anastasia Lubennikova <lubennikovaav@gmail.com> wrote: > вт, 8 июн. 2021 г. в 02:44, Anastasia Lubennikova <lubennikovaav@gmail.com>: >> Thank you for working on this test set! >> I was especially glad to see the skip-tests option for pg_regress. I think it will become a very handy tool for hackers. >> >> To try the patch I had to resolve a few merge conflicts, see a rebased version in attachments. >> >> > auth_extra => [ '--create-role', 'repl_role' ]); >> This line and the comment above it look like some copy-paste artifacts. Did I get it right? If so, I suggest removingthem. >> Other than that, the patch looks good to me. > > For some reason, it failed on cfbot, so I'll switch it back to 'Waiting on author'. > BTW, do I get it right, that cfbot CI will need some adjustments to print regression.out for this test? > > See one more revision of the patch attached. It contains the following changes: > - rebase to recent main > - removed 'auth_extra' piece, that I mentioned above. > - added lacking make clean and gitignore changes. Thanks! Yeah, there does seem to be a mysterious CI failure there, not reproducible locally for me. You're right that it's not dumping enough information to diagnose the problem... I will look into it tomorrow.
On Thu, Jun 10, 2021 at 7:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Jun 10, 2021 at 7:37 PM Anastasia Lubennikova > <lubennikovaav@gmail.com> wrote: > > For some reason, it failed on cfbot, so I'll switch it back to 'Waiting on Sorry for the delay. I got stuck in a CI loop trying to make a new improved version work on Windows, so far without success. If anyone can tell me what's wrong on that OS I'd be very grateful to hear it. I don't know if it's something I haven't understood about reparse points, or something to do with "restricted tokens" and accounts privileges. The symptoms on Cirrus are: DROP TABLESPACE regress_tblspacewith; +WARNING: could not open directory "pg_tblspc/16386/PG_15_202109101": No such file or directory +WARNING: could not stat file "pg_tblspc/16386": No such file or directory The short explanation of this new version is that the tablespace tests now work on a pair of local nodes because you can do this sort of thing: postgres=# create tablespace xxx location 'pg_user_files/xxx'; ERROR: directory "pg_user_files/xxx" does not exist postgres=# create tablespace xxx new location 'pg_user_files/xxx'; CREATE TABLESPACE Patches: 0001: Allow restricted relative tablespace paths. Rationale: I really want to be able to run the tablespace test with a local replica, instead of just skipping it (including but not only from this new TAP test). After re-reading a bunch of threads about how to achieve that that never went anywhere and considering various ideas already presented, I wondered if we could agree on allowing relative paths under one specific directory "pg_user_files" (a directory that PostgreSQL itself will completely ignore). Better names welcome. I wonder if for Windows we'd want to switch to real symlinks, since, as far as I know from some light reading, reparse points are converted to absolute paths on creation, so the pgdata directory would be less portable than it would be on POSIX systems. 0002: CREATE TABLESPACE ... NEW LOCATION. The new syntax "NEW" says that it's OK if the directory doesn't exist yet, we'll just create it. Rationale: With relative paths, it's tricky for pg_regress to find the data directory of the primary server + any streaming replicas that may be downstream from it (and possibly remote) to create the directory, but the server can do it easily. Better syntax welcome. (I originally wanted to use WITH (<something>) but that syntax is tangled up with persistent relopts.) 0003: Use relative paths for tablespace regression test. Remove the pg_regress logic for creating the directory, and switch to relative paths using the above. 0004: Test replay of regression tests. Same as before, this adds a replicated run of the regression tests in src/test/recovery/t/027_stream_regress.pl, with an optional expensive mode that you can enable with PG_TEST_EXTRA="wal_consistency_checking". I removed the useless --create-role as pointed out by Anastasia. I added a step to compare the contents of the primary and replica servers with pg_dump, as suggested by Tsunakawa-san. I think the way I pass in the psql source directory to --bindir is not good, but I've reached my daily limit of Perl; how should I be specifying the tmp_install bin directory here? This is so pg_regress can find psql. system_or_bail("../regress/pg_regress", "--bindir=../../bin/psql", "--port=" . $node_primary->port, "--schedule=../regress/parallel_schedule", "--dlpath=../regress", "--inputdir=../regress");
Attachment
On Wed, Oct 6, 2021 at 7:10 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I wonder if for Windows we'd want to switch to real symlinks, since, > as far as I know from some light reading, reparse points are converted > to absolute paths on creation, so the pgdata directory would be less > portable than it would be on POSIX systems. I looked into that a bit, and it seems that the newer "real" symlinks can't be created without admin privileges, unlike junction points, so that wouldn't help. I still need to figure out what this junction-based patch set is doing wrong on Windows though trial-by-CI. In the meantime, here is a rebase over recent changes to the Perl testing APIs.
Attachment
On 11/23/21 04:07, Thomas Munro wrote: > On Wed, Oct 6, 2021 at 7:10 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> I wonder if for Windows we'd want to switch to real symlinks, since, >> as far as I know from some light reading, reparse points are converted >> to absolute paths on creation, so the pgdata directory would be less >> portable than it would be on POSIX systems. > I looked into that a bit, and it seems that the newer "real" symlinks > can't be created without admin privileges, unlike junction points, so > that wouldn't help. I still need to figure out what this > junction-based patch set is doing wrong on Windows though trial-by-CI. > In the meantime, here is a rebase over recent changes to the Perl > testing APIs. More exactly you need to "Create Symbolic Links" privilege. see <https://github.com/git-for-windows/git/wiki/Symbolic-Links> I haven't yet worked out a way of granting that from the command line, but if it's working the buildfarm client (as of git tip) will use it on windows for the workdirs space saving feature. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 11/23/21 10:47, Andrew Dunstan wrote: > On 11/23/21 04:07, Thomas Munro wrote: >> On Wed, Oct 6, 2021 at 7:10 PM Thomas Munro <thomas.munro@gmail.com> wrote: >>> I wonder if for Windows we'd want to switch to real symlinks, since, >>> as far as I know from some light reading, reparse points are converted >>> to absolute paths on creation, so the pgdata directory would be less >>> portable than it would be on POSIX systems. >> I looked into that a bit, and it seems that the newer "real" symlinks >> can't be created without admin privileges, unlike junction points, so >> that wouldn't help. I still need to figure out what this >> junction-based patch set is doing wrong on Windows though trial-by-CI. >> In the meantime, here is a rebase over recent changes to the Perl >> testing APIs. > > More exactly you need to "Create Symbolic Links" privilege. see > <https://github.com/git-for-windows/git/wiki/Symbolic-Links> > > > I haven't yet worked out a way of granting that from the command line, > but if it's working the buildfarm client (as of git tip) will use it on > windows for the workdirs space saving feature. Update: There is a PowerShell module called Carbon which provides this and a whole lot more. It can be installed in numerous ways, including via Chocolatey. Here's what I am using: choco install -y Carbon Import-Module Carbon Grant-CPrivilege -Identity myuser -Privilege SeCreateSymbolicLinkPrivilege See <https://get-carbon.org/Grant-Privilege.html> The command name I used above is now the preferred spelling, although that's not reflected on the manual page. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Nov 30, 2021 at 3:14 AM Andrew Dunstan <andrew@dunslane.net> wrote: > choco install -y Carbon > Import-Module Carbon > Grant-CPrivilege -Identity myuser -Privilege SeCreateSymbolicLinkPrivilege Interesting. Well, I found the problem with my last patch (to wit: junction points must be absolute, unlike real symlinks, which I'd considered already but I missed that tmp_check's DataDir had a stray internal \.\), and now I'm wondering whether these newer real symlinks could help. The constraints are pretty hard to work with... I thought about a couple of options: 1. We could try to use real symlinks, and fall back to junction points if that fails. That means that these new tests I'm proposing would fail unless you grant that privilege or run in developer mode as you were saying. It bothers me a bit that developers and the BF would be testing a different code path than production databases run... unless you're thinking we should switch to symlinks with no fallback, and require that privilege to be granted by end users to production servers at least if they want to use tablespaces, and also drop pre-Win10 support in the same release? That's bigger than I was thinking. 2. We could convert relative paths to absolute paths at junction point creation time, which I tried, and "check" now passes. Problems: (1) now you can't move pgdata around, (2) the is-the-path-too-long check performed on a primary is not sufficient to check if the transformed absolute path will be too long on a replica. The most conservative simple idea I have so far is: go with option 2, but make this whole thing an undocumented developer-only mode, and turn it on in the regression tests. Here's a patch set like that. Thoughts? Another option would be to stop using operating system symlinks, and build the target paths ourselves; I didn't investigate that as it seemed like a bigger change than this warrants. Next problem: The below is clearly not the right way to find the pg_regress binary and bindir, and doesn't work on Windows or VPATH. Any suggestions for how to do this? I feel like something like $node->installed_command() or similar logic is needed... # Run the regression tests against the primary. # XXX How should we find the pg_regress binary and bindir? system_or_bail("../regress/pg_regress", "--bindir=../../bin/psql", "--port=" . $node_primary->port, "--schedule=../regress/parallel_schedule", "--dlpath=../regress", "--inputdir=../regress"); BTW 0002 is one of those renaming patches from git that GNU patch doesn't seem to apply correctly, sorry cfbot...
Attachment
On 12/3/21 23:21, Thomas Munro wrote: > > Next problem: The below is clearly not the right way to find the > pg_regress binary and bindir, and doesn't work on Windows or VPATH. > Any suggestions for how to do this? I feel like something like > $node->installed_command() or similar logic is needed... > > # Run the regression tests against the primary. > # XXX How should we find the pg_regress binary and bindir? > system_or_bail("../regress/pg_regress", > "--bindir=../../bin/psql", > "--port=" . $node_primary->port, > "--schedule=../regress/parallel_schedule", > "--dlpath=../regress", > "--inputdir=../regress"); > TAP tests are passed a path to pg_regress as $ENV{PG_REGRESS}. You should be using that. On non-MSVC, the path to a non-installed psql is in this case "$TESTDIR/../../bin/psql" - this should work for VPATH builds as well as non-VPATH. On MSVC it's a bit harder - it's "$top_builddir/$releasetype/psql" but we don't expose that. Perhaps we should. c.f. commit f4ce6c4d3a cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sun, Dec 5, 2021 at 4:16 AM Andrew Dunstan <andrew@dunslane.net> wrote: > TAP tests are passed a path to pg_regress as $ENV{PG_REGRESS}. You > should be using that. On non-MSVC, the path to a non-installed psql is > in this case "$TESTDIR/../../bin/psql" - this should work for VPATH > builds as well as non-VPATH. On MSVC it's a bit harder - it's > "$top_builddir/$releasetype/psql" but we don't expose that. Perhaps we > should. c.f. commit f4ce6c4d3a Thanks, that helped. Here's a new version that passes on Windows, Unix and Unix with VPATH. I also had to figure out where the DLLs are, and make sure that various output files go to the build directory, not source directory, if different, which I did by passing down another similar environment variable. Better ideas? (It confused me for some time that make follows the symlink and runs the perl code from inside the source directory.) This adds 2 whole minutes to the recovery check, when running with the Windows serial-only scripts on Cirrus CI (using Andres's CI patches). For Linux it adds ~20 seconds to the total of -j8 check-world. Hopefully that's time well spent, because it adds test coverage for all the redo routines, and hopefully soon we won't have to run 'em in series on Windows. Does anyone want to object to the concept of the "pg_user_files" directory or the developer-only GUC "allow_relative_tablespaces"? There's room for discussion about names; maybe initdb shouldn't create the directory unless you ask it to, or something.
Attachment
On 12/8/21 18:10, Thomas Munro wrote: > On Sun, Dec 5, 2021 at 4:16 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> TAP tests are passed a path to pg_regress as $ENV{PG_REGRESS}. You >> should be using that. On non-MSVC, the path to a non-installed psql is >> in this case "$TESTDIR/../../bin/psql" - this should work for VPATH >> builds as well as non-VPATH. On MSVC it's a bit harder - it's >> "$top_builddir/$releasetype/psql" but we don't expose that. Perhaps we >> should. c.f. commit f4ce6c4d3a > Thanks, that helped. Here's a new version that passes on Windows, > Unix and Unix with VPATH. I also had to figure out where the DLLs > are, and make sure that various output files go to the build > directory, not source directory, if different, which I did by passing > down another similar environment variable. Better ideas? (It > confused me for some time that make follows the symlink and runs the > perl code from inside the source directory.) The new version appears to set an empty --bindir for pg_regress. Is that right? > This adds 2 whole minutes to the recovery check, when running with the > Windows serial-only scripts on Cirrus CI (using Andres's CI patches). > For Linux it adds ~20 seconds to the total of -j8 check-world. > Hopefully that's time well spent, because it adds test coverage for > all the redo routines, and hopefully soon we won't have to run 'em in > series on Windows. > > Does anyone want to object to the concept of the "pg_user_files" > directory or the developer-only GUC "allow_relative_tablespaces"? > There's room for discussion about names; maybe initdb shouldn't create > the directory unless you ask it to, or something. I'm slightly worried that some bright spark will discover it and think it's a good idea for a production setup. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2021-12-09 08:12:14 -0500, Andrew Dunstan wrote: > > Does anyone want to object to the concept of the "pg_user_files" > > directory or the developer-only GUC "allow_relative_tablespaces"? > > There's room for discussion about names; maybe initdb shouldn't create > > the directory unless you ask it to, or something. Personally I'd rather put relative tablespaces into a dedicated directory or just into pg_tblspc, but without a symlink. Some tools need to understand tablespace layout etc, and having them in a directory that, by the name, will also contain other things seems likely to cause confusion. > I'm slightly worried that some bright spark will discover it and think > it's a good idea for a production setup. It'd not really be worse than the current situation of accidentally corrupting a local replica or such :/. Greetings, Andres Freund
On Fri, Dec 10, 2021 at 2:12 AM Andrew Dunstan <andrew@dunslane.net> wrote: > The new version appears to set an empty --bindir for pg_regress. Is that > right? It seems to be necessary to find eg psql, since --bindir='' means "expect $PATH to contain the installed binaries", and that's working on both build systems. The alternative would be to export yet another environment variable, $PG_INSTALL or such -- do you think that'd be better, or did I miss something that exists already like that?
On Fri, Dec 10, 2021 at 8:38 AM Andres Freund <andres@anarazel.de> wrote: > On 2021-12-09 08:12:14 -0500, Andrew Dunstan wrote: > > > Does anyone want to object to the concept of the "pg_user_files" > > > directory or the developer-only GUC "allow_relative_tablespaces"? > > > There's room for discussion about names; maybe initdb shouldn't create > > > the directory unless you ask it to, or something. > > Personally I'd rather put relative tablespaces into a dedicated directory or > just into pg_tblspc, but without a symlink. Some tools need to understand > tablespace layout etc, and having them in a directory that, by the name, will > also contain other things seems likely to cause confusion. Alright, let me try it that way... more soon.
Hi, On 2021-12-09 12:10:23 +1300, Thomas Munro wrote: > From a60ada37f3ff7d311d56fe453b2abeadf0b350e5 Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Wed, 4 Aug 2021 22:17:54 +1200 > Subject: [PATCH v8 2/5] Use relative paths for tablespace regression test. > > Remove the machinery from pg_regress that manages the testtablespace > directory. Instead, use a relative path. > > Discussion: https://postgr.es/m/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com Seems like we ought to add a tiny tap test or such for this - otherwise we won't have any coverage of "normal" tablespaces? I don't think they need to be exercised as part of the normal tests, but having some very basic testing in a tap test seems worthwhile. > diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm > index 9467a199c8..5cfa137cde 100644 > --- a/src/test/perl/PostgreSQL/Test/Cluster.pm > +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm > @@ -460,7 +460,7 @@ sub init > print $conf "hot_standby = on\n"; > # conservative settings to ensure we can run multiple postmasters: > print $conf "shared_buffers = 1MB\n"; > - print $conf "max_connections = 10\n"; > + print $conf "max_connections = 25\n"; > # limit disk space consumption, too: > print $conf "max_wal_size = 128MB\n"; > } What's the relation of this to the rest? > +# Perform a logical dump of primary and standby, and check that they match > +command_ok( > + [ "pg_dump", '-f', $outputdir . '/primary.dump', '--no-sync', > + '-p', $node_primary->port, 'regression' ], > + "dump primary server"); > +command_ok( > + [ "pg_dump", '-f', $outputdir . '/standby.dump', '--no-sync', > + '-p', $node_standby_1->port, 'regression' ], > + "dump standby server"); > +command_ok( > + [ "diff", $outputdir . '/primary.dump', $outputdir . '/standby.dump' ], > + "compare primary and standby dumps"); > + Absurd nitpick: What's the deal with using "" for one part, and '' for the rest? Separately: I think the case of seeing diffs will be too hard to debug like this, as the difference isn't shown afaict? Greetings, Andres Freund
On Fri, Dec 10, 2021 at 10:36 AM Andres Freund <andres@anarazel.de> wrote: > Seems like we ought to add a tiny tap test or such for this - otherwise we > won't have any coverage of "normal" tablespaces? I don't think they need to be > exercised as part of the normal tests, but having some very basic testing > in a tap test seems worthwhile. Good idea, that was bothering me too. Done. > > - print $conf "max_connections = 10\n"; > > + print $conf "max_connections = 25\n"; > What's the relation of this to the rest? Someone decided that allow_streaming should imply max_connections = 10, but we need ~20 to run the parallel regression test schedule. However, I can just as easily move that to a local adjustment in the TAP test file. Done, like so: +$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1); > Absurd nitpick: What's the deal with using "" for one part, and '' for the > rest? Fixed. > Separately: I think the case of seeing diffs will be too hard to debug like > this, as the difference isn't shown afaict? Seems to be OK. The output goes to src/test/recovery/tmp_check/log/regress_log_027_stream_regress, so for example if you comment out the bit that deals with SEQUENCE caching you'll see: # Running: pg_dump -f /usr/home/tmunro/projects/postgresql/src/test/recovery/primary.dump --no-sync -p 63693 regression ok 2 - dump primary server # Running: pg_dump -f /usr/home/tmunro/projects/postgresql/src/test/recovery/standby.dump --no-sync -p 63694 regression ok 3 - dump standby server # Running: diff /usr/home/tmunro/projects/postgresql/src/test/recovery/primary.dump /usr/home/tmunro/projects/postgresql/src/test/recovery/standby.dump 436953c436953 < SELECT pg_catalog.setval('public.clstr_tst_s_rf_a_seq', 32, true); --- > SELECT pg_catalog.setval('public.clstr_tst_s_rf_a_seq', 33, true); ... more hunks ... And from the previous email: On Fri, Dec 10, 2021 at 10:35 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Dec 10, 2021 at 8:38 AM Andres Freund <andres@anarazel.de> wrote: > > Personally I'd rather put relative tablespaces into a dedicated directory or > > just into pg_tblspc, but without a symlink. Some tools need to understand > > tablespace layout etc, and having them in a directory that, by the name, will > > also contain other things seems likely to cause confusion. Ok, in this version I have a developer-only GUC allow_in_place_tablespaces instead. If you turn it on, you can do: CREATE TABLESPACE regress_blah LOCATION = ''; ... and then pg_tblspc/OID is created directly as a directory. Not allowed by default or documented.
Attachment
On Fri, Dec 10, 2021 at 12:58 PM Thomas Munro <thomas.munro@gmail.com> wrote: > +$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1); Erm, in fact this requirement came about in an earlier version where I was invoking make and getting --max-concurrent-tests=20 from src/test/regress/GNUmakefile. Which I should probably replicate here...
On 12/9/21 15:15, Thomas Munro wrote: > On Fri, Dec 10, 2021 at 2:12 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> The new version appears to set an empty --bindir for pg_regress. Is that >> right? > It seems to be necessary to find eg psql, since --bindir='' means > "expect $PATH to contain the installed binaries", and that's working > on both build systems. The alternative would be to export yet another > environment variable, $PG_INSTALL or such -- do you think that'd be > better, or did I miss something that exists already like that? No, that seems ok. Might be worth a comment. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2021-12-10 12:58:01 +1300, Thomas Munro wrote: > > What's the relation of this to the rest? > > Someone decided that allow_streaming should imply max_connections = > 10, but we need ~20 to run the parallel regression test schedule. > However, I can just as easily move that to a local adjustment in the > TAP test file. Done, like so: > > +$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1); Possible that this will cause problem on some *BSD platform with a limited count of semaphores. But we can deal with that if / when it happens. > > Separately: I think the case of seeing diffs will be too hard to debug like > > this, as the difference isn't shown afaict? > > Seems to be OK. The output goes to > src/test/recovery/tmp_check/log/regress_log_027_stream_regress, so for > example if you comment out the bit that deals with SEQUENCE caching > you'll see: Ah, ok. Not sure what I thought there... > On Fri, Dec 10, 2021 at 10:35 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Dec 10, 2021 at 8:38 AM Andres Freund <andres@anarazel.de> wrote: > > > Personally I'd rather put relative tablespaces into a dedicated directory or > > > just into pg_tblspc, but without a symlink. Some tools need to understand > > > tablespace layout etc, and having them in a directory that, by the name, will > > > also contain other things seems likely to cause confusion. > > Ok, in this version I have a developer-only GUC > allow_in_place_tablespaces instead. If you turn it on, you can do: > > CREATE TABLESPACE regress_blah LOCATION = ''; > ... and then pg_tblspc/OID is created directly as a directory. Not > allowed by default or documented. WFM. I think we might eventually want to allow it by default, but we can deal with that whenever somebody wants to spend the energy doing so. > @@ -590,16 +595,35 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid) > char *linkloc; > char *location_with_version_dir; > struct stat st; > + bool in_place; > > linkloc = psprintf("pg_tblspc/%u", tablespaceoid); > - location_with_version_dir = psprintf("%s/%s", location, > + > + /* > + * If we're asked to make an 'in place' tablespace, create the directory > + * directly where the symlink would normally go. This is a developer-only > + * option for now, to facilitate regression testing. > + */ > + in_place = > + (allow_in_place_tablespaces || InRecovery) && strlen(location) == 0; Why is in_place set to true by InRecovery? ISTM that allow_in_place_tablespaces should be checked in CreateTableSpace(), and create_tablespace_directories() should just do whatever it's told? Otherwise it seems there's ample potential for confusion, e.g. because of the GUC differing between primary and replica, or between crash and a new start. > + if (in_place) > + { > + if (MakePGDirectory(linkloc) < 0 && errno != EEXIST) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not create directory \"%s\": %m", > + linkloc))); > + } > + > + location_with_version_dir = psprintf("%s/%s", in_place ? linkloc : location, > TABLESPACE_VERSION_DIRECTORY); > > /* > * Attempt to coerce target directory to safe permissions. If this fails, > * it doesn't exist or has the wrong owner. > */ > - if (chmod(location, pg_dir_create_mode) != 0) > + if (!in_place && chmod(location, pg_dir_create_mode) != 0) > { > if (errno == ENOENT) > ereport(ERROR, Maybe add a coment saying that we don't need to chmod here because MakePGDirectory() takes care of that? > @@ -648,13 +672,13 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid) > /* > * In recovery, remove old symlink, in case it points to the wrong place. > */ > - if (InRecovery) > + if (!in_place && InRecovery) > remove_tablespace_symlink(linkloc); I don't think it's right to check !in_place as you do here, given that it currently depends on a GUC setting that's possibly differs between WAL generation and replay time? > --- a/src/test/regress/output/tablespace.source > +++ b/src/test/regress/expected/tablespace.out > @@ -1,7 +1,18 @@ > +-- relative tablespace locations are not allowed > +CREATE TABLESPACE regress_tblspace LOCATION 'relative'; -- fail > +ERROR: tablespace location must be an absolute path > +-- empty tablespace locations are not usually allowed > +CREATE TABLESPACE regress_tblspace LOCATION ''; -- fail > +ERROR: tablespace location must be an absolute path > +-- as a special developer-only option to allow us to use tablespaces > +-- with streaming replication on the same server, an empty location > +-- can be allowed as a way to say that the tablespace should be created > +-- as a directory in pg_tblspc, rather than being a symlink > +SET allow_in_place_tablespaces = true; > -- create a tablespace using WITH clause > -CREATE TABLESPACE regress_tblspacewith LOCATION '@testtablespace@' WITH (some_nonexistent_parameter = true); -- fail > +CREATE TABLESPACE regress_tblspacewith LOCATION '' WITH (some_nonexistent_parameter = true); -- fail > ERROR: unrecognized parameter "some_nonexistent_parameter" > -CREATE TABLESPACE regress_tblspacewith LOCATION '@testtablespace@' WITH (random_page_cost = 3.0); -- ok > +CREATE TABLESPACE regress_tblspacewith LOCATION '' WITH (random_page_cost = 3.0); -- ok Perhaps also add a test that we catch "in-place" tablespace creation with allow_in_place_tablespaces = false? Although perhaps that's better done in the previous commit... > +++ b/src/test/modules/test_misc/t/002_tablespace.pl Two minor things that I think would be worth testing here: 1) moving between two "absolute" tablespaces. That could conceivably break differently between in-place and relative tablespaces. 2) Moving between absolute and relative tablespace. > +# required for 027_stream_regress.pl > +REGRESS_OUTPUTDIR=$(abs_top_builddir)/src/test/recovery > +export REGRESS_OUTPUTDIR Why do we need this? > +# Initialize primary node > +my $node_primary = PostgreSQL::Test::Cluster->new('primary'); > +$node_primary->init(allows_streaming => 1); > +$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1); Probably should set at least max_prepared_transactions > 0, so the tests requiring prepared xacts can work. They have nontrivial replay routines, so that seems worthwhile? > +# Perform a logical dump of primary and standby, and check that they match > +command_ok( > + [ 'pg_dump', '-f', $outputdir . '/primary.dump', '--no-sync', > + '-p', $node_primary->port, 'regression' ], > + 'dump primary server'); > +command_ok( > + [ 'pg_dump', '-f', $outputdir . '/standby.dump', '--no-sync', > + '-p', $node_standby_1->port, 'regression' ], > + 'dump standby server'); > +command_ok( > + [ 'diff', $outputdir . '/primary.dump', $outputdir . '/standby.dump' ], > + 'compare primary and standby dumps'); > + > +$node_standby_1->stop; > +$node_primary->stop; This doesn't verify if global objects are replayed correctly. Perhaps it'd be better to use pg_dumpall? Greetings, Andres Freund
On Fri, Dec 10, 2021 at 12:58:01PM +1300, Thomas Munro wrote: > -# required for 017_shm.pl > +# required for 017_shm.pl and 027_stream_regress.pl > REGRESS_SHLIB=$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX) > export REGRESS_SHLIB Hmm. FWIW, I am working on doing similar for pg_upgrade to switch to TAP there, and we share a lot in terms of running pg_regress on an exising cluster. Wouldn't it be better to move this definition to src/Makefile.global.in rather than src/test/recovery/? My pg_regress command is actually very similar to yours, so I am wondering if this would be better if somehow centralized, perhaps in Cluster.pm. -- Michael
Attachment
On Wed, Dec 15, 2021 at 05:50:45PM +0900, Michael Paquier wrote: > Hmm. FWIW, I am working on doing similar for pg_upgrade to switch to > TAP there, and we share a lot in terms of running pg_regress on an > exising cluster. Wouldn't it be better to move this definition to > src/Makefile.global.in rather than src/test/recovery/? > > My pg_regress command is actually very similar to yours, so I am > wondering if this would be better if somehow centralized, perhaps in > Cluster.pm. By the way, while I was sorting out my things, I have noticed that v4 does not handle EXTRA_REGRESS_OPT. Is that wanted? You could just add that into your patch set and push the extra options to the pg_regress command: my $extra_opts_val = $ENV{EXTRA_REGRESS_OPT} || ""; my @extra_opts = split(/\s+/, $extra_opts_val); -- Michael
Attachment
On Wed, Dec 22, 2021 at 11:41 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Rebased and updated based on feedback. Responses to multiple emails below: Pushed, but the build farm doesn't like it with a couple of different ways of failing. I'll collect some results and revert shortly.
On Sat, Jan 15, 2022 at 12:39 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Dec 22, 2021 at 11:41 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Rebased and updated based on feedback. Responses to multiple emails below: > > Pushed, but the build farm doesn't like it with a couple of different > ways of failing. I'll collect some results and revert shortly. Problems: 1. The way I invoke pg_regress doesn't seem to work correctly under the build farm client (though it works fine under make), not sure why yet, but reproduced here and will figure it out tomorrow. 2. The new test in src/test/modules/t/002_tablespace.pl apparently has some path-related problem on Windows that I didn't know about, because CI isn't even running the TAP tests under src/test/module/test_misc (and various other locations), but the BF is :-/ And I was happy because modulescheck was passing... I reverted the two commits responsible for those failures to keep the build farm green, and I'll try to fix them tomorrow.
Hi, On 2022-01-15 02:32:35 +1300, Thomas Munro wrote: > 1. The way I invoke pg_regress doesn't seem to work correctly under > the build farm client (though it works fine under make), not sure why > yet, but reproduced here and will figure it out tomorrow. I think it's just a problem of the buildfarm specifying port names in extra_opts. E.g. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2022-01-14%2011%3A49%3A36 has # Checking port 58074 # Found port 58074 Name: primary ... # Running: /home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/regress/pg_regress --dlpath="/home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/regress"--bindir= --port=58074 --schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress --outputdir="/home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/recovery"--port=5678 (using postmaster on /tmp/1W6qVPVyCv, port 5678) Note how there's both --port=58074 and --port=5678 in the pg_regress invocation. The latter comming from EXTRA_REGRESS_OPTS, which the buildfarm client sets. The quickest fix would probably be to just move the 027_stream_regress.pl added --port until after $extra_opts? > 2. The new test in src/test/modules/t/002_tablespace.pl apparently has > some path-related problem on Windows This is the failure you're talking about? https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2022-01-14%2012%3A04%3A55 > that I didn't know about, because CI isn't even running the TAP tests under > src/test/module/test_misc (and various other locations), but the BF is :-/ > And I was happy because modulescheck was passing... This we need to fix... But if you're talking about fairywren's failure, it's more than not running some tests, it's that we do not test windows mingw outside of cross compilation. Greetings, Andres Freund
On Sat, Jan 15, 2022 at 12:49 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-01-15 02:32:35 +1300, Thomas Munro wrote: > > 1. The way I invoke pg_regress doesn't seem to work correctly under > > the build farm client (though it works fine under make), not sure why > > yet, but reproduced here and will figure it out tomorrow. > > I think it's just a problem of the buildfarm specifying port names in > extra_opts. E.g. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2022-01-14%2011%3A49%3A36 > has > > # Checking port 58074 > # Found port 58074 > Name: primary > ... > # Running: /home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/regress/pg_regress --dlpath="/home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/regress"--bindir= --port=58074 --schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress --outputdir="/home/tmunro/build-farm/buildroot/HEAD/pgsql.build/src/test/recovery"--port=5678 > (using postmaster on /tmp/1W6qVPVyCv, port 5678) > > Note how there's both --port=58074 and --port=5678 in the pg_regress > invocation. The latter comming from EXTRA_REGRESS_OPTS, which the buildfarm > client sets. > > The quickest fix would probably be to just move the 027_stream_regress.pl > added --port until after $extra_opts? Thanks, I figured it was an environment variable biting me, and indeed it was that one. I reordered the arguments, tested locally under the buildfarm client script, and pushed. I'll keep an eye on the build farm. One thing I noticed is that the pg_dump output files should really be rm'd by the clean target; I may push something for that later. > > 2. The new test in src/test/modules/t/002_tablespace.pl apparently has > > some path-related problem on Windows > > This is the failure you're talking about? > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2022-01-14%2012%3A04%3A55 > > > that I didn't know about, because CI isn't even running the TAP tests under > > src/test/module/test_misc (and various other locations), but the BF is :-/ > > And I was happy because modulescheck was passing... > > This we need to fix... But if you're talking about fairywren's failure, it's > more than not running some tests, it's that we do not test windows mingw > outside of cross compilation. I'm temporarily stumped by complete ignorance of MSYS. I tried the test on plain old Windows/MSVC by cherry-picking the reverted commit d1511fe1 and running .\src\tools\msvc\vcregress.bat taptest .\src\test\modules\test_misc in my Windows 10 VM, and that passed with flying colours (so Windows CI would have passed too, if we weren't ignoring TAP tests in unusual locations, I assume). I'll look into installing MSYS to work this out if necessary, but it may take me a few days. Here's how it failed on fairywren, in case someone knowledgeable of MSYS path translation etc can spot the problem: psql:<stdin>:1: ERROR: directory "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/modules/test_misc/tmp_check/t_002_tablespace_main_data/ts1" does not exist not ok 1 - create tablespace with absolute path I think that means chmod() failed with ENOENT. That's weird, because the .pl does: +my $TS1_LOCATION = $node->basedir() . "/ts1"; +mkdir($TS1_LOCATION);
On Mon, Jan 17, 2022 at 05:25:19PM +1300, Thomas Munro wrote: > Here's how it failed on fairywren, in case someone knowledgeable of > MSYS path translation etc can spot the problem: > > psql:<stdin>:1: ERROR: directory > "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/modules/test_misc/tmp_check/t_002_tablespace_main_data/ts1" > does not exist > not ok 1 - create tablespace with absolute path > > I think that means chmod() failed with ENOENT. That's weird, because > the .pl does: > > +my $TS1_LOCATION = $node->basedir() . "/ts1"; > +mkdir($TS1_LOCATION); You likely need a PostgreSQL::Test::Utils::perl2host() call. MSYS Perl understands Cygwin-style names like /home/... as well as Windows-style names, but this PostgreSQL configuration understands only Windows-style names.
On Mon, Jan 17, 2022 at 6:53 PM Noah Misch <noah@leadboat.com> wrote: > On Mon, Jan 17, 2022 at 05:25:19PM +1300, Thomas Munro wrote: > > Here's how it failed on fairywren, in case someone knowledgeable of > > MSYS path translation etc can spot the problem: > You likely need a PostgreSQL::Test::Utils::perl2host() call. MSYS Perl > understands Cygwin-style names like /home/... as well as Windows-style names, > but this PostgreSQL configuration understands only Windows-style names. Thanks. I added that and pushed. Let's see if fairywren likes it when it comes back online. I also learned that in the CI environment, node->basedir() is a path containing an internal "." component (I mean "something/./something"). I added a regex to collapse those, because they're unacceptable in Windows junction point targets. I'm aware that there is something happening in another CF entry that might address that sort of thing[1], so then perhaps I could remove the kludge. I tested that with a throw-away change to .cirrus.yml, like so. The CI thread[2] is discussing a proper solution to these Windows CI blind spots. test_modules_script: - perl src/tools/msvc/vcregress.pl modulescheck + - perl src/tools/msvc/vcregress.pl taptest ./src/test/modules/test_misc [1] https://commitfest.postgresql.org/36/3331/ [2] https://www.postgresql.org/message-id/20220114235457.GQ14051%40telsasoft.com
On Fri, Jan 21, 2022 at 3:42 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Thanks. I added that and pushed. Let's see if fairywren likes it > when it comes back online. A watched pot never boils, but I wonder why Andrew's 4 Windows configurations jacana, bowerbird, fairywren and drongo have stopped returning results.
Hi, On 2022-01-17 17:25:19 +1300, Thomas Munro wrote: > I reordered the arguments, tested locally under the buildfarm client script, > and pushed. I'll keep an eye on the build farm. After the reloptions fix the tests seem much more likely to succeed than before. Progress! Unfortunately we don't quite seem there yet: I saw a couple failures like: https://api.cirrus-ci.com/v1/artifact/task/5394938773897216/regress_diffs/build/testrun/recovery/t/027_stream_regress/regression.diffs (from https://cirrus-ci.com/task/5394938773897216?logs=check_world#L183 ) -- Should succeed DROP TABLESPACE regress_tblspace_renamed; +ERROR: tablespace "regress_tblspace_renamed" is not empty I assume the reason we see this semi-regularly when the regression tests run as part of 027_stream_regress, but not in the main regression test run, is similar to the reloptions problem, namely that we run with a much smaller shared buffers. I assume what happens is that this just makes the known problem of bgwriter or some other process keeping open a filehandle to an already deleted relation, preventing the deletion to "fully" take effect, worse. Greetings, Andres Freund
On Sat, Jan 22, 2022 at 8:48 AM Andres Freund <andres@anarazel.de> wrote: > Unfortunately we don't quite seem there yet: > > I saw a couple failures like: > https://api.cirrus-ci.com/v1/artifact/task/5394938773897216/regress_diffs/build/testrun/recovery/t/027_stream_regress/regression.diffs > (from https://cirrus-ci.com/task/5394938773897216?logs=check_world#L183 ) > > -- Should succeed > DROP TABLESPACE regress_tblspace_renamed; > +ERROR: tablespace "regress_tblspace_renamed" is not empty > > > I assume the reason we see this semi-regularly when the regression tests run > as part of 027_stream_regress, but not in the main regression test run, is > similar to the reloptions problem, namely that we run with a much smaller > shared buffers. > > I assume what happens is that this just makes the known problem of bgwriter or > some other process keeping open a filehandle to an already deleted relation, > preventing the deletion to "fully" take effect, worse. Right, I assume this would be fixed by [1]. I need to re-convince myself of that patch's correctness and make some changes after Robert's feedback; I'll look into committing it next week. From a certain point of view it's now quite good that we hit this case occasionally in CI. [1] https://commitfest.postgresql.org/36/2962/
On 1/21/22 13:58, Thomas Munro wrote: > On Fri, Jan 21, 2022 at 3:42 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Thanks. I added that and pushed. Let's see if fairywren likes it >> when it comes back online. > A watched pot never boils, but I wonder why Andrew's 4 Windows > configurations jacana, bowerbird, fairywren and drongo have stopped > returning results. I think I have unstuck both machines. I will keep an eye on them and make sure they don't get stuck again. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sat, Jan 22, 2022 at 8:48 AM Andres Freund <andres@anarazel.de> wrote: > Unfortunately we don't quite seem there yet: And another way to fail: pg_dump: error: query failed: ERROR: canceling statement due to conflict with recovery https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-01-22%2003%3A06%3A42 Probably needs hot_standby_feedback on. Will adjust this soon. One more failure seen in today's crop was a "stats" failure on seawasp, which must be the well known pre-existing problem. (Probably just needs someone to rewrite the stats subsystem to use shared memory instead of UDP).
On 1/21/22 16:22, Andrew Dunstan wrote: > On 1/21/22 13:58, Thomas Munro wrote: >> On Fri, Jan 21, 2022 at 3:42 PM Thomas Munro <thomas.munro@gmail.com> wrote: >>> Thanks. I added that and pushed. Let's see if fairywren likes it >>> when it comes back online. >> A watched pot never boils, but I wonder why Andrew's 4 Windows >> configurations jacana, bowerbird, fairywren and drongo have stopped >> returning results. > > > I think I have unstuck both machines. I will keep an eye on them and > make sure they don't get stuck again. > > fairywren is not happy with the recovery tests still. I have noticed on a different setup that this test adds 11 minutes to the runtime of the recovery tests, effectively doubling it. The doubling is roughly true on faster setups, too. At least I would like a simple way to disable the test. I'm not very happy about that. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2022-01-27 15:27:17 -0500, Andrew Dunstan wrote: > fairywren is not happy with the recovery tests still. Any more details? > I have noticed on a different setup that this test adds 11 minutes to the > runtime of the recovery tests, effectively doubling it. The doubling is > roughly true on faster setups, too Does a normal regress run take roughly that long? Or is the problem that the 027_stream_regress.pl ends up defaulting to shared_buffers=1MB, causing lots of unnecessary IO? > . At least I would like a simple > way to disable the test. One thing we could do to speed up the overall runtime would be to move 027_stream_regress.pl to something numbered earlier. Combined with PROVE_FLAGS=-j2 that could at least run them in parallel with the rest of the recovery tests. Greetings, Andres Freund
On Fri, Jan 28, 2022 at 9:27 AM Andrew Dunstan <andrew@dunslane.net> wrote: > I have noticed on a different setup that this test adds 11 minutes to > the runtime of the recovery tests, effectively doubling it. The doubling > is roughly true on faster setups, too. At least I would like a simple > way to disable the test. Ouch, that's ... a lot. Some randomly selected times: ~20 seconds (dev machines), ~40 seconds (Cirrus CI's Windows image), ~2-3 minutes (very cheap cloud host accounts), ~3 minutes (my Rapsberry Pi pinned onto two CPU cores), ~11 minutes (your Windows number). It would be good to understand why that's such an outlier. Re skipping, I've also been wondering about an exclusion list to skip parts of the regression tests that don't really add recovery coverage but take non-trivial time, like the join tests.
Hi, On 2022-01-27 12:47:08 -0800, Andres Freund wrote: > > I have noticed on a different setup that this test adds 11 minutes to the > > runtime of the recovery tests, effectively doubling it. The doubling is > > roughly true on faster setups, too > > Does a normal regress run take roughly that long? Or is the problem that the > 027_stream_regress.pl ends up defaulting to shared_buffers=1MB, causing lots > of unnecessary IO? In my msys install a normal regress run takes 57s, 027_stream_regress.pl takes 194s. It's *not* shared_buffers. Or any of the other postgresql.conf settings. As far as I can tell. < tries a bunch of things > ARGH. It's the utterly broken handling of refused connections on windows. The pg_regress invocation doesn't specify the host address, just the port. Now you might reasonably ask, why does that slow things down so much, rather than working or not working? The problem is that a tcp connect() on windows doesn't immediately fail when a connection establishment is rejected, but instead internally retries several times. Which takes 2s. The reason there are rejected connections without specifying the host is that Cluster.pm configures to listen to 127.0.0.1. But the default for libpq/psql is to try "localhost". Which name resolution returns first as ipv6 (i.e. ::1). Which takes 2s to fail, upon which libpq goes and tries 127.0.0.1, which works. That means every single psql started by 027_stream_regress.pl's pg_regress takes 2s. Which of course adds up... I'll go and sob in a corner. Greetings, Andres Freund
On 2022-01-27 14:03:51 -0800, Andres Freund wrote: > In my msys install a normal regress run takes 57s, 027_stream_regress.pl takes > 194s. > > That means every single psql started by 027_stream_regress.pl's pg_regress > takes 2s. Which of course adds up... Oh, forgot: After adding --host to the pg_regress invocation 027_stream_regress.pl takes 75s (from 194s before).
On 1/27/22 15:47, Andres Freund wrote: > Hi, > > On 2022-01-27 15:27:17 -0500, Andrew Dunstan wrote: >> fairywren is not happy with the recovery tests still. > Any more details? I'll go back and get some. > > >> I have noticed on a different setup that this test adds 11 minutes to the >> runtime of the recovery tests, effectively doubling it. The doubling is >> roughly true on faster setups, too > Does a normal regress run take roughly that long? Or is the problem that the > 027_stream_regress.pl ends up defaulting to shared_buffers=1MB, causing lots > of unnecessary IO? On crake (slowish fedora 34), a normal check run took 95s, and this test took 114s. On my windows test instance where I noticed this (w10, msys2/ucrt), check took 516s and this test took 685s. > > >> . At least I would like a simple >> way to disable the test. > One thing we could do to speed up the overall runtime would be to move > 027_stream_regress.pl to something numbered earlier. Combined with > PROVE_FLAGS=-j2 that could at least run them in parallel with the rest of the > recovery tests. > > Seems like a bandaid. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Jan 28, 2022 at 11:03 AM Andres Freund <andres@anarazel.de> wrote: > That means every single psql started by 027_stream_regress.pl's pg_regress > takes 2s. Which of course adds up... That is very surprising, thanks. Will fix. I've been experimenting with reusing psql sessions and backends for queries in TAP tests, since some Windows animals seem to take a significant fraction of a second *per query* due to forking and startup costs. ~100ms or whatever is nothing compared to that ~2000ms silliness, but it still adds up over thousands of queries. I'll post an experimental patch soon, but this discussion has given me the idea that pg_regress might ideally be able to reuse processes too, at least sometimes...
Hi, On 2022-01-27 17:16:17 -0500, Andrew Dunstan wrote: > On crake (slowish fedora 34), a normal check run took 95s, and this test > took 114s. That's roughly what I see on msys after the fix. > On my windows test instance where I noticed this (w10, > msys2/ucrt), check took 516s and this test took 685s. Hm. That's both excruciatingly slow. Way way slower than what I see here, also w10, msys2/ucrt. Any chance the test instance has windows defender running, without a directory exclusion? I saw that trash performance to a near standstill. Does it get better with the attached patch? I was confused why this didn't fail fatally on CI, which uses PG_TEST_USE_UNIX_SOCKETS. I think he reason is that pg_regress' use of PGHOST is busted, btw. It says it'll use PGHOST if --host isn't specified, but it doesn't work. * When testing an existing install, we honor existing environment * variables, except if they're overridden by command line options. */ if (hostname != NULL) { setenv("PGHOST", hostname, 1); unsetenv("PGHOSTADDR"); } but hostname is initialized in the existing-install case: #if !defined(HAVE_UNIX_SOCKETS) use_unix_sockets = false; #elif defined(WIN32) /* * We don't use Unix-domain sockets on Windows by default, even if the * build supports them. (See comment at remove_temp() for a reason.) * Override at your own risk. */ use_unix_sockets = getenv("PG_TEST_USE_UNIX_SOCKETS") ? true : false; #else use_unix_sockets = true; #endif if (!use_unix_sockets) hostname = "localhost"; Greetings, Andres Freund
Attachment
On 1/27/22 15:47, Andres Freund wrote: > Hi, > > On 2022-01-27 15:27:17 -0500, Andrew Dunstan wrote: >> fairywren is not happy with the recovery tests still. > Any more details? (Not actually fairywren, but equivalent) It's hung at src/test/recovery/t/009_twophase.pl line 84: $psql_rc = $cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_1'"); This is an Amazon EC2 WS2019 instance, of type t3.large i.e. 8Gb of memory (not the same machine I reported test times from). Perhaps I need to test on another instance. Note though that when I tested with a ucrt64 build, including use of the ucrt64 perl/prove, the recovery test passed on an equivalent instance, so that's probably another reason to switch fairywren to using the ucrt64 environment. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2022-01-27 14:36:32 -0800, Andres Freund wrote: > > On my windows test instance where I noticed this (w10, > > msys2/ucrt), check took 516s and this test took 685s. > > Hm. That's both excruciatingly slow. Way way slower than what I see here, also > w10, msys2/ucrt. Any chance the test instance has windows defender running, > without a directory exclusion? I saw that trash performance to a near > standstill. Could you post the regression test output with the timings? Unless it's AV, I don't see why a windows VM with a moderate amount of memory should take that long. Do the test times get less bad if you use PG_TEST_USE_UNIX_SOCKETS=1 PG_REGRESS_SOCK_DIR: "c:/some-dir/"? I see there's reports that the connection-timeout problem can be a lot worse on windows, because several applications, e.g. docker, add additional names for localhost. Are there any non-commented entries in C:\Windows\System32\drivers\etc\hosts > Does it get better with the attached patch? I pushed something like it now - seemed to be no reason to wait, given it makes think less slow on my VM. Greetings, Andres Freund
Hi, On 2022-01-27 17:51:52 -0500, Andrew Dunstan wrote: > (Not actually fairywren, but equivalent) It's hung at > src/test/recovery/t/009_twophase.pl line 84: > > > $psql_rc = $cur_primary->psql('postgres', "COMMIT PREPARED > 'xact_009_1'"); That very likely is the socket-shutdown bug that lead to: commit 64b2c6507e5714b5c688b9c5cc551fbedb7b3b58 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2022-01-25 12:17:40 -0500 Revert "graceful shutdown" changes for Windows, in back branches only. This reverts commits 6051857fc and ed52c3707, but only in the back branches. Further testing has shown that while those changes do fix some things, they also break others; in particular, it looks like walreceivers fail to detect walsender-initiated connection close reliably if the walsender shuts down this way. We'll keep trying to improve matters in HEAD, but it now seems unwise to push these changes into stable releases. Discussion: https://postgr.es/m/CA+hUKG+OeoETZQ=Qw5Ub5h3tmwQhBmDA=nuNO3KG=zWfUypFAw@mail.gmail.com If you apply that commit, does the problem go away? That's why I'd suggested to revert them in https://postgr.es/m/20220125023609.5ohu3nslxgoygihl%40alap3.anarazel.de > This is an Amazon EC2 WS2019 instance, of type t3.large i.e. 8Gb of > memory (not the same machine I reported test times from). Perhaps I need > to test on another instance. Note though that when I tested with a > ucrt64 build, including use of the ucrt64 perl/prove, the recovery test > passed on an equivalent instance, so that's probably another reason to > switch fairywren to using the ucrt64 environment. Without the revert I do get through the tests some of the time - imo likely that the hang isn't related to the specific msys/mingw environment. Greetings, Andres Freund
On Fri, Jan 28, 2022 at 12:03 PM Andres Freund <andres@anarazel.de> wrote: > Revert "graceful shutdown" changes for Windows, in back branches only. FTR I'm actively working on a fix for that one for master now (see that other thread where the POC survived Alexander's torture testing).
On 1/27/22 18:24, Thomas Munro wrote: > On Fri, Jan 28, 2022 at 12:03 PM Andres Freund <andres@anarazel.de> wrote: >> Revert "graceful shutdown" changes for Windows, in back branches only. > FTR I'm actively working on a fix for that one for master now (see > that other thread where the POC survived Alexander's torture testing). OK, good. A further data point on that: I am not seeing a recovery test hang or commit_ts test failure on real W10 machines, including jacana. I am only getting them on WS2019 VMs e.g. drongo/fairywren. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sat, Jan 22, 2022 at 6:00 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Jan 22, 2022 at 8:48 AM Andres Freund <andres@anarazel.de> wrote: > > Unfortunately we don't quite seem there yet: > > And another way to fail: > > pg_dump: error: query failed: ERROR: canceling statement due to > conflict with recovery > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-01-22%2003%3A06%3A42 > > Probably needs hot_standby_feedback on. Will adjust this soon. Seen again today on prairiedog. Erm, scratch that idea, HS feedback interferes with test results. I guess max_standby_streaming_delay should be increased to 'forever', like in the attached, since pg_dump runs for a very long time on prairiedog: 2022-02-01 04:47:59.294 EST [3670:15] 027_stream_regress.pl LOG: statement: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ ONLY ... 2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR: canceling statement due to conflict with recovery
Attachment
Hi, On 2022-02-02 13:59:56 +1300, Thomas Munro wrote: > Seen again today on prairiedog. Erm, scratch that idea, HS feedback > interferes with test results. It'd not be sufficient anyway, I think. E.g. autovacuum truncating a table would not be prevented by hs_f I think? > I guess max_standby_streaming_delay > should be increased to 'forever', like in the attached Seems reasonable. > , since pg_dump runs for a very long time on prairiedog: > 2022-02-01 04:47:59.294 EST [3670:15] 027_stream_regress.pl LOG: > statement: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ ONLY > ... > 2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR: > canceling statement due to conflict with recovery That, uh, seems slow. Is it perhaps waiting for a lock? Seems Cluster.pm::init() should add at least log_lock_waits... Greetings, Andres Freund
On Wed, Feb 2, 2022 at 2:14 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-02-02 13:59:56 +1300, Thomas Munro wrote: > > 2022-02-01 04:47:59.294 EST [3670:15] 027_stream_regress.pl LOG: > > statement: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ ONLY > > ... > > 2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR: > > canceling statement due to conflict with recovery > > That, uh, seems slow. Is it perhaps waiting for a lock? Seems > Cluster.pm::init() should add at least log_lock_waits... I quoted the wrong lines, let me try that again this time for the same session, the one with pid 3683: 2022-02-01 04:48:38.352 EST [3683:15] 027_stream_regress.pl LOG: statement: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ ONLY ... 2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR: canceling statement due to conflict with recovery It looks like it's processing statements fairly consistently slowly through the whole period. Each non-trivial statement takes a bit under ~10ms, so it would make sense if by the time we've processed ~2.5k lines we've clocked up 30 seconds and a VACUUM replay whacks us.
Thomas Munro <thomas.munro@gmail.com> writes: > It looks like it's processing statements fairly consistently slowly > through the whole period. Each non-trivial statement takes a bit > under ~10ms, so it would make sense if by the time we've processed > ~2.5k lines we've clocked up 30 seconds and a VACUUM replay whacks us. This test is set up to time out after 30 seconds? We've long had an unofficial baseline that no timeouts under 180 seconds should be used in the buildfarm. regards, tom lane
Hi, On February 1, 2022 6:11:24 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Thomas Munro <thomas.munro@gmail.com> writes: >> It looks like it's processing statements fairly consistently slowly >> through the whole period. Each non-trivial statement takes a bit >> under ~10ms, so it would make sense if by the time we've processed >> ~2.5k lines we've clocked up 30 seconds and a VACUUM replay whacks us. > >This test is set up to time out after 30 seconds? We've long had >an unofficial baseline that no timeouts under 180 seconds should >be used in the buildfarm. 30s is the default value of the streaming replay conflict timeout. After that the startup process cancelled the session runningpg_dump. So it's not an intentional timeout in the test. It's not surprising that pg_dump takes 30s on that old a machine. But more than 2min still surprised me. Is that really dobe expected? - Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > It's not surprising that pg_dump takes 30s on that old a machine. But more than 2min still surprised me. Is that reallydo be expected? In the previous buildfarm run, that dump took just under 31s: 2022-01-31 14:21:10.358 EST [19325:1] [unknown] LOG: connection received: host=[local] 2022-01-31 14:21:10.367 EST [19325:2] [unknown] LOG: connection authorized: user=buildfarm database=regression application_name=027_stream_regress.pl ... 2022-01-31 14:21:41.139 EST [19325:2663] 027_stream_regress.pl LOG: disconnection: session time: 0:00:30.782 user=buildfarmdatabase=regression host=[local] In the failing run, we have: 2022-02-01 04:48:37.757 EST [3683:1] [unknown] LOG: connection received: host=[local] 2022-02-01 04:48:37.767 EST [3683:2] [unknown] LOG: connection authorized: user=buildfarm database=regression application_name=027_stream_regress.pl ... 2022-02-01 04:49:09.719 EST [3683:2584] 027_stream_regress.pl LOG: statement: COPY public.tenk1 (unique1, unique2, two,four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4) TO stdout; 2022-02-01 04:49:09.881 EST [3683:2585] 027_stream_regress.pl ERROR: canceling statement due to conflict with recovery 2022-02-01 04:49:09.881 EST [3683:2586] 027_stream_regress.pl DETAIL: User query might have needed to see row versions thatmust be removed. 2022-02-01 04:49:09.881 EST [3683:2587] 027_stream_regress.pl STATEMENT: COPY public.tenk1 (unique1, unique2, two, four,ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4) TO stdout; 2022-02-01 04:49:09.889 EST [3685:1] [unknown] LOG: connection received: host=[local] 2022-02-01 04:49:09.905 EST [3683:2588] 027_stream_regress.pl LOG: could not send data to client: Broken pipe 2022-02-01 04:49:09.905 EST [3683:2589] 027_stream_regress.pl ERROR: canceling statement due to conflict with recovery 2022-02-01 04:49:09.905 EST [3683:2590] 027_stream_regress.pl DETAIL: User query might have needed to see row versions thatmust be removed. 2022-02-01 04:49:09.906 EST [3683:2591] 027_stream_regress.pl FATAL: connection to client lost 2022-02-01 04:49:09.935 EST [3683:2592] 027_stream_regress.pl LOG: disconnection: session time: 0:00:32.179 user=buildfarmdatabase=regression host=[local] That's only a little over 30s. Where are you getting 2m from? regards, tom lane
Thomas Munro <thomas.munro@gmail.com> writes: > Seen again today on prairiedog. Erm, scratch that idea, HS feedback > interferes with test results. I guess max_standby_streaming_delay > should be increased to 'forever', like in the attached, since pg_dump > runs for a very long time on prairiedog: FWIW, I'd vote for keeping a finite timeout, but making it say ten minutes. If the thing gets stuck for some reason, you don't really want the test waiting forever. (Some buildfarm animals have overall-test-time limits, but I think it's not the default, and the behavior when that gets hit is pretty unfriendly anyway -- you don't get any report of the run at all.) regards, tom lane
On Wed, Feb 2, 2022 at 3:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Seen again today on prairiedog. Erm, scratch that idea, HS feedback > > interferes with test results. I guess max_standby_streaming_delay > > should be increased to 'forever', like in the attached, since pg_dump > > runs for a very long time on prairiedog: > > FWIW, I'd vote for keeping a finite timeout, but making it say > ten minutes. If the thing gets stuck for some reason, you don't > really want the test waiting forever. (Some buildfarm animals > have overall-test-time limits, but I think it's not the default, > and the behavior when that gets hit is pretty unfriendly anyway > -- you don't get any report of the run at all.) Ok, I've set it to 10 minutes. Thanks.
Another failure under 027_stream_regress.pl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-16%2005%3A58%3A05 vacuum ... FAILED 3463 ms I'll try to come up with the perl needed to see the regression.diffs next time...
On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Another failure under 027_stream_regress.pl: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-16%2005%3A58%3A05 > > vacuum ... FAILED 3463 ms > > I'll try to come up with the perl needed to see the regression.diffs > next time... Here's my proposed change to achieve that. Here's an example of where it shows up if it fails (from my deliberately sabotaged CI run https://cirrus-ci.com/build/6730380228165632 where I was verifying that it also works on Windows): Unix: https://api.cirrus-ci.com/v1/artifact/task/5421419923243008/log/src/test/recovery/tmp_check/log/regress_log_027_stream_regress Windows: https://api.cirrus-ci.com/v1/artifact/task/4717732481466368/log/src/test/recovery/tmp_check/log/regress_log_027_stream_regress
Attachment
On 3/20/22 05:36, Thomas Munro wrote: > On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Another failure under 027_stream_regress.pl: >> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-16%2005%3A58%3A05 >> >> vacuum ... FAILED 3463 ms >> >> I'll try to come up with the perl needed to see the regression.diffs >> next time... > Here's my proposed change to achieve that. I think that's OK. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Mar 21, 2022 at 2:34 AM Andrew Dunstan <andrew@dunslane.net> wrote: > On 3/20/22 05:36, Thomas Munro wrote: > > On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > >> I'll try to come up with the perl needed to see the regression.diffs > >> next time... > > Here's my proposed change to achieve that. > > I think that's OK. Thanks for looking! Pushed.
i, On Mon, Mar 21, 2022 at 5:45 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Mon, Mar 21, 2022 at 2:34 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > On 3/20/22 05:36, Thomas Munro wrote: > > > On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > >> I'll try to come up with the perl needed to see the regression.diffs > > >> next time... > > > Here's my proposed change to achieve that. > > > > I think that's OK. > > Thanks for looking! Pushed. FYI idiacanthus failed 027_stream_regress.pl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2022-03-22%2001%3A58%3A04 The log shows: === dumping /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/src/test/recovery/tmp_check/regression.diffs === diff -U3 /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql/src/test/regress/expected/vacuum.out /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/src/test/recovery/tmp_check/results/vacuum.out --- /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql/src/test/regress/expected/vacuum.out 2021-07-01 19:00:01.936659446 +0200 +++ /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/src/test/recovery/tmp_check/results/vacuum.out 2022-03-22 03:28:09.813377179 +0100 @@ -181,7 +181,7 @@ SELECT pg_relation_size('vac_truncate_test') = 0; ?column? ---------- - t + f (1 row) VACUUM (TRUNCATE FALSE, FULL TRUE) vac_truncate_test; === EOF === not ok 2 - regression tests pass Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Mar 22, 2022 at 4:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > SELECT pg_relation_size('vac_truncate_test') = 0; > ?column? > ---------- > - t > + f Thanks. Ahh, déjà vu... this probably needs the same treatment as b700f96c and 3414099c provided for the reloptions test. Well, at least the first one. Here's a patch like that.
Attachment
On Mon, Mar 21, 2022 at 8:59 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Thanks. Ahh, déjà vu... this probably needs the same treatment as > b700f96c and 3414099c provided for the reloptions test. Well, at > least the first one. Here's a patch like that. If you want to know whether or not the buildfarm will have problems due to VACUUM failing to get a cleanup lock randomly, then I suggest that you use an approach like the one from my patch here: https://postgr.es/m/CAH2-WzkiB-qcsBmWrpzP0nxvrQExoUts1d7TYShg_DrkOHeg4Q@mail.gmail.com I recently tried it again myself. With the gizmo in place the tests fail in exactly the same way you've had problems with on the buildfarm. On the first try, even. -- Peter Geoghegan
On Fri, Mar 25, 2022 at 4:03 PM Peter Geoghegan <pg@bowt.ie> wrote: > If you want to know whether or not the buildfarm will have problems > due to VACUUM failing to get a cleanup lock randomly, then I suggest > that you use an approach like the one from my patch here: > > https://postgr.es/m/CAH2-WzkiB-qcsBmWrpzP0nxvrQExoUts1d7TYShg_DrkOHeg4Q@mail.gmail.com > > I recently tried it again myself. With the gizmo in place the tests > fail in exactly the same way you've had problems with on the > buildfarm. On the first try, even. Interesting. IIUC your chaos gizmo shows that particular vacuum test still failing on master, but that wouldn't happen in real life because since 383f2221 it's a temp table. Your gizmo should probably detect temp rels, as your comment says. I was sort of thinking that perhaps if DISABLE_PAGE_SKIPPING is eventually made to do what its name sounds like it does, we could remove TEMP from that test and it'd still pass with the gizmo...
On Thu, Mar 24, 2022 at 8:56 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Interesting. IIUC your chaos gizmo shows that particular vacuum test > still failing on master, but that wouldn't happen in real life because > since 383f2221 it's a temp table. Your gizmo should probably detect > temp rels, as your comment says. I was sort of thinking that perhaps > if DISABLE_PAGE_SKIPPING is eventually made to do what its name sounds > like it does, we could remove TEMP from that test and it'd still pass > with the gizmo... Why not just use VACUUM FREEZE? That should work, because it won't settle for a cleanup lock on any page with an XID < OldestXmin. And even if there were only LP_DEAD items on a page, that wouldn't matter either, because we don't need a cleanup lock to get rid of those anymore. And we consistently do all the same steps for rel truncation in the no-cleanup-lock path (lazy_scan_noprune) now. I think that DISABLE_PAGE_SKIPPING isn't appropriate for this kind of thing. It mostly just makes VACUUM not trust the visibility map, which isn't going to help. While DISABLE_PAGE_SKIPPING also forces aggressive mode, that isn't going to help either, unless you somehow also make sure that FreezeLimit is OldestXmin (e.g. by setting vacuum_freeze_min_age to 0). VACUUM FREEZE (without DISABLE_PAGE_SKIPPING) seems like it would do everything you want, without using a temp table. At least on the master branch. -- Peter Geoghegan
Hi, On 2022-03-24 21:06:21 -0700, Peter Geoghegan wrote: > On Thu, Mar 24, 2022 at 8:56 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Interesting. IIUC your chaos gizmo shows that particular vacuum test > > still failing on master, but that wouldn't happen in real life because > > since 383f2221 it's a temp table. Your gizmo should probably detect > > temp rels, as your comment says. I was sort of thinking that perhaps > > if DISABLE_PAGE_SKIPPING is eventually made to do what its name sounds > > like it does, we could remove TEMP from that test and it'd still pass > > with the gizmo... > > Why not just use VACUUM FREEZE? That should work, because it won't > settle for a cleanup lock on any page with an XID < OldestXmin. And > even if there were only LP_DEAD items on a page, that wouldn't matter > either, because we don't need a cleanup lock to get rid of those > anymore. And we consistently do all the same steps for rel truncation > in the no-cleanup-lock path (lazy_scan_noprune) now. > > I think that DISABLE_PAGE_SKIPPING isn't appropriate for this kind of > thing. It mostly just makes VACUUM not trust the visibility map, which > isn't going to help. While DISABLE_PAGE_SKIPPING also forces > aggressive mode, that isn't going to help either, unless you somehow > also make sure that FreezeLimit is OldestXmin (e.g. by setting > vacuum_freeze_min_age to 0). > > VACUUM FREEZE (without DISABLE_PAGE_SKIPPING) seems like it would do > everything you want, without using a temp table. At least on the > master branch. We tried that in a prior case: https://postgr.es/m/20220120052404.sonrhq3f3qgplpzj%40alap3.anarazel.de I don't know if the same danger applies here though. Greetings, Andres Freund
On Fri, Mar 25, 2022 at 5:16 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-03-24 21:06:21 -0700, Peter Geoghegan wrote: > > VACUUM FREEZE (without DISABLE_PAGE_SKIPPING) seems like it would do > > everything you want, without using a temp table. At least on the > > master branch. > > We tried that in a prior case: > https://postgr.es/m/20220120052404.sonrhq3f3qgplpzj%40alap3.anarazel.de Yeah, or really, it was Michael that tried that in commit fe246d1c, and then we tried more things with 3414099c and b700f96c. It's a bit of a belt-and-braces setup admittedly...
On Thu, Mar 24, 2022 at 9:16 PM Andres Freund <andres@anarazel.de> wrote: > > VACUUM FREEZE (without DISABLE_PAGE_SKIPPING) seems like it would do > > everything you want, without using a temp table. At least on the > > master branch. > > We tried that in a prior case: > https://postgr.es/m/20220120052404.sonrhq3f3qgplpzj%40alap3.anarazel.de Oh, yeah. If some other backend is holding back OldestXmin, and you can't find a way of dealing with that, then you'll need a temp table. (Mind you, that trick only works on recent versions too.) -- Peter Geoghegan
cfbot found another source of nondeterminism in the regression tests, due to the smaller shared_buffers used in this TAP test: https://cirrus-ci.com/task/4611828654276608 https://api.cirrus-ci.com/v1/artifact/task/4611828654276608/log/src/test/recovery/tmp_check/regression.diffs Turned out that we had already diagnosed that once before, when tiny build farm animal chipmunk reported the same, but we didn't commit a fix: https://www.postgresql.org/message-id/flat/CA%2BhUKGLTK6ZuEkpeJ05-MEmvmgZveCh%2B_w013m7%2ByKWFSmRcDA%40mail.gmail.com
Thomas Munro <thomas.munro@gmail.com> writes: > cfbot found another source of nondeterminism in the regression tests, > due to the smaller shared_buffers used in this TAP test: This failure seems related but not identical: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=myna&dt=2022-04-02%2004%3A00%3A26 portals.out is expecting that the "foo25ns" cursor will read starting at the beginning of tenk1, but it's starting somewhere else, which presumably is a syncscan effect. I think the fundamental instability here is that this TAP test is setting shared_buffers small enough to allow the syncscan logic to kick in where it does not in normal testing. Maybe we should just disable syncscan in this test script? regards, tom lane
I wrote: > I think the fundamental instability here is that this TAP test is > setting shared_buffers small enough to allow the syncscan logic > to kick in where it does not in normal testing. Maybe we should > just disable syncscan in this test script? Did that, we'll see how much it helps. regards, tom lane
On Thu, Dec 09, 2021 at 12:10:23PM +1300, Thomas Munro wrote: > This adds 2 whole minutes to the recovery check, when running with the > Windows serial-only scripts on Cirrus CI (using Andres's CI patches). > For Linux it adds ~20 seconds to the total of -j8 check-world. > Hopefully that's time well spent, because it adds test coverage for > all the redo routines, and hopefully soon we won't have to run 'em in > series on Windows. Should 027-stream-regress be renamed to something that starts earlier ? Off-list earlier this year, Andres referred to 000. -- Justin
On Thu, Aug 4, 2022 at 3:30 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Thu, Dec 09, 2021 at 12:10:23PM +1300, Thomas Munro wrote: > > This adds 2 whole minutes to the recovery check, when running with the > > Windows serial-only scripts on Cirrus CI (using Andres's CI patches). > > For Linux it adds ~20 seconds to the total of -j8 check-world. > > Hopefully that's time well spent, because it adds test coverage for > > all the redo routines, and hopefully soon we won't have to run 'em in > > series on Windows. > > Should 027-stream-regress be renamed to something that starts earlier ? > Off-list earlier this year, Andres referred to 000. Do you have any data on improved times from doing that? I have wondered about moving it into 001_stream_rep.pl.
On Thu, Aug 04, 2022 at 09:24:24AM +1200, Thomas Munro wrote: > On Thu, Aug 4, 2022 at 3:30 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Thu, Dec 09, 2021 at 12:10:23PM +1300, Thomas Munro wrote: > > > This adds 2 whole minutes to the recovery check, when running with the > > > Windows serial-only scripts on Cirrus CI (using Andres's CI patches). > > > For Linux it adds ~20 seconds to the total of -j8 check-world. > > > Hopefully that's time well spent, because it adds test coverage for > > > all the redo routines, and hopefully soon we won't have to run 'em in > > > series on Windows. > > > > Should 027-stream-regress be renamed to something that starts earlier ? > > Off-list earlier this year, Andres referred to 000. See also: https://www.postgresql.org/message-id/20220213220709.vjz5rziuhfdpqxrg@alap3.anarazel.de > Do you have any data on improved times from doing that? > > I have wondered about moving it into 001_stream_rep.pl. The immediate motive for raising the question is due to working on your cygwin patch (where I've set PROVE_FLAGS=-j3). The last invocation I have opened ends like: [20:46:47.577] [13:46:47] t/026_overwrite_contrecord.pl ........ ok 10264 ms ( 0.02 usr 0.02 sys + 11.25 cusr 35.95 csys= 47.24 CPU) [20:47:08.087] [13:47:08] t/028_pitr_timelines.pl .............. ok 13153 ms ( 0.00 usr 0.00 sys + 4.03 cusr 14.79 csys= 18.82 CPU) [20:47:08.999] [13:47:09] t/029_stats_restart.pl ............... ok 12631 ms ( 0.00 usr 0.02 sys + 7.40 cusr 23.30 csys= 30.71 CPU) [20:47:34.353] [13:47:34] t/031_recovery_conflict.pl ........... ok 11337 ms ( 0.00 usr 0.00 sys + 3.84 cusr 11.82 csys= 15.66 CPU) [20:47:35.070] [13:47:35] t/030_stats_cleanup_replica.pl ....... ok 14054 ms ( 0.02 usr 0.00 sys + 7.64 cusr 25.02 csys= 32.68 CPU) [20:48:04.887] [13:48:04] t/032_relfilenode_reuse.pl ........... ok 12755 ms ( 0.00 usr 0.00 sys + 3.36 cusr 11.57 csys= 14.93 CPU) [20:48:42.055] [13:48:42] t/033_replay_tsp_drops.pl ............ ok 43529 ms ( 0.00 usr 0.00 sys + 12.29 cusr 41.43 csys= 53.71 CPU) [20:50:02.770] [13:50:02] t/027_stream_regress.pl .............. ok 198408 ms ( 0.02 usr 0.06 sys + 44.92 cusr 142.42csys = 187.42 CPU) [20:50:02.771] [13:50:02] [20:50:02.771] All tests successful. [20:50:02.771] Files=33, Tests=411, 402 wallclock secs ( 0.16 usr 0.27 sys + 138.03 cusr 441.56 csys = 580.01 CPU) If 027 had been started sooner, this test might have finished up to 78sec earlier. If lots of tests are added in the future, maybe it won't matter, but it seems like it does now. As I understand, checks are usually parallelized by "make -j" and not by "prove". In that case, starting a slow test later doesn't matter. But it'd be better for anyone who runs tap tests manually, and (I think) for meson. As a one-off test on localhost: time make check -C src/test/recovery => 11m42,790s time make check -C src/test/recovery PROVE_FLAGS=-j2 => 7m56,315s After renaming it to 001: time make check -C src/test/recovery => 11m33,887s (~same) time make check -C src/test/recovery PROVE_FLAGS=-j2 => 6m59,969s I don't know how it affect the buildfarm (but I think that's not optimized primarily for speed anyway). -- Justin