Thread: slowest tap tests - split or accelerate?
Hi, cfbot now runs most tests on windows, the windows task is by far the slowest, and the task limitted most in concurrency [2]. Running tap tests is the biggest part of that. This is a bigger issue on windows because we don't have infrastructure (yet) to run tests in parallel. There's a few tests which stand out in their slowness, which seem worth addressing even if we tackle test parallelism on windows at some point. I often find them to be the slowest tests on linux too. Picking a random successful cfbot run [1] I see the following tap tests taking more than 20 seconds: 67188 ms pg_basebackup t/010_pg_basebackup.pl 59710 ms recovery t/001_stream_rep.pl 57542 ms pg_rewind t/001_basic.pl 56179 ms subscription t/001_rep_changes.pl 42146 ms pgbench t/001_pgbench_with_server.pl 38264 ms recovery t/018_wal_optimize.pl 33642 ms subscription t/013_partition.pl 29129 ms pg_dump t/002_pg_dump.pl 25751 ms pg_verifybackup t/002_algorithm.pl 20628 ms subscription t/011_generated.pl It would be good if we could make those tests faster, or if not easily possible, at least split those tests into smaller tap tests. Splitting a longer test into smaller ones is preferrable even if they take the same time, because we can use prove level concurrency on windows to gain some test parallelism. As a nice side-effect it makes it also quicker to run a split test isolated during development. Greetings, Andres Freund [1] https://cirrus-ci.com/task/5207126145499136 [2] https://cirrus-ci.org/faq/#are-there-any-limits
Hi, On 2021-12-31 11:25:28 -0800, Andres Freund wrote: > cfbot now runs most tests on windows, the windows task is by far the slowest, > and the task limitted most in concurrency [2]. Running tap tests is the > biggest part of that. This is a bigger issue on windows because we don't have > infrastructure (yet) to run tests in parallel. > > There's a few tests which stand out in their slowness, which seem worth > addressing even if we tackle test parallelism on windows at some point. I > often find them to be the slowest tests on linux too. > > Picking a random successful cfbot run [1] I see the following tap tests taking > more than 20 seconds: > > 67188 ms pg_basebackup t/010_pg_basebackup.pl > 25751 ms pg_verifybackup t/002_algorithm.pl The reason these in particular are slow is that they do a lot of pg_basebackups without either / one-of -cfast / --no-sync. The lack of -cfast in particularly is responsible for a significant proportion of the test time. The only reason this didn't cause the tests to take many minutes is that spread checkpoints only throttle when writing out a buffer and there aren't that many dirty buffers... Attached is a patch changing the parameters in all the instances I found. Testing on a local instance it about halves the runtime of t/010_pg_basebackup.pl on linux and windows (but there's still a 2x time difference between the two), it's less when running the tests concurrently CI. It might be worth having one explicit use of -cspread. Perhaps combined with an explicit checkpoint beforehand? Greetings, Andres Freund
Attachment
On Mon, Jan 17, 2022 at 1:41 PM Andres Freund <andres@anarazel.de> wrote: > The reason these in particular are slow is that they do a lot of > pg_basebackups without either / one-of -cfast / --no-sync. The lack of -cfast > in particularly is responsible for a significant proportion of the test > time. The only reason this didn't cause the tests to take many minutes is that > spread checkpoints only throttle when writing out a buffer and there aren't > that many dirty buffers... Adding -cfast to 002_algorithm.pl seems totally reasonable. I'm not sure what else can realistically be done to speed it up without losing the point of the test. And it's basically just a single loop, so splitting it up doesn't seem to make a lot of sense either. pg_basebackup's 010_pg_basebackup.pl looks like it could be split up, though. That one, at least to me, looks like people have just kept adding semi-related things into the same test file. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-01-17 14:05:17 -0500, Robert Haas wrote: > On Mon, Jan 17, 2022 at 1:41 PM Andres Freund <andres@anarazel.de> wrote: > > The reason these in particular are slow is that they do a lot of > > pg_basebackups without either / one-of -cfast / --no-sync. The lack of -cfast > > in particularly is responsible for a significant proportion of the test > > time. The only reason this didn't cause the tests to take many minutes is that > > spread checkpoints only throttle when writing out a buffer and there aren't > > that many dirty buffers... > > Adding -cfast to 002_algorithm.pl seems totally reasonable. I'm not > sure what else can realistically be done to speed it up without losing > the point of the test. And it's basically just a single loop, so > splitting it up doesn't seem to make a lot of sense either. It's also not that slow compared other tests after the -cfast addition. However, I'm a bit surprised at how long the individual pg_verifybackup calls take on windows - about as long as the pg_basebackup call itself. Running: pg_basebackup -D C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/sha224--manifest-checksums sha224--no-sync -cfast # timing: [4.798 + 0.704s]: complete # Running: pg_verifybackup -e C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/sha224 backup successfully verified # timing: [5.507 + 0.697s]: completed Interestingly, with crc32c, this is not so: # Running: pg_basebackup -D C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/crc32c--manifest-checksums crc32c--no-sync -cfast # timing: [3.500 + 0.688s]: completed ok 5 - backup ok with algorithm "crc32c" ok 6 - crc32c is mentioned many times in the manifest # Running: pg_verifybackup -e C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/crc32c backup successfully verified # timing: [4.194 + 0.197s]: completed I wonder if there's something explaining why pg_verifybackup is greatly slowed down by sha224 but not crc32c, but the server's runtime only differs by ~20ms? It seems incongruous that pg_basebackup, with all the complexity of needing to communicate with the server, transferring the backup over network, and *also* computing checksums, takes as long as the pg_verifybackup invocation? > pg_basebackup's 010_pg_basebackup.pl looks like it could be split up, > though. That one, at least to me, looks like people have just kept > adding semi-related things into the same test file. Yea. It's generally interesting how much time initdb takes in these tests. It's about 1.1s on my linux workstation, and 2.1s on windows. I've occasionally pondered caching initdb results and reusing them across tests - just the locking around it seems a bit nasty, but perhaps that could be done as part of the tmp_install step. Of course, it'd need to deal with different options etc... Greetings, Andres Freund
On Mon, Jan 17, 2022 at 2:57 PM Andres Freund <andres@anarazel.de> wrote: > I wonder if there's something explaining why pg_verifybackup is greatly slowed > down by sha224 but not crc32c, but the server's runtime only differs by ~20ms? > It seems incongruous that pg_basebackup, with all the complexity of needing to > communicate with the server, transferring the backup over network, and *also* > computing checksums, takes as long as the pg_verifybackup invocation? I guess there must be something explaining it, but I don't know what it could be. The client and the server are each running the checksum algorithm over the same data. If that's not the same speed then .... I don't get it. Unless, somehow, they're using different implementations of it? > I've occasionally pondered caching initdb results and reusing them across > tests - just the locking around it seems a bit nasty, but perhaps that could > be done as part of the tmp_install step. Of course, it'd need to deal with > different options etc... It's a thought, but it does seem like a bit of a pain to implement. -- Robert Haas EDB: http://www.enterprisedb.com
Andres Freund <andres@anarazel.de> writes: > I've occasionally pondered caching initdb results and reusing them across > tests - just the locking around it seems a bit nasty, but perhaps that could > be done as part of the tmp_install step. Of course, it'd need to deal with > different options etc... I'd actually built a prototype to do that, based on making a reference cluster and then "cp -a"'ing it instead of re-running initdb. I gave up when I found than on slower, disk-bound machines it was hardly any faster. Thinking about it now, I wonder why not just re-use one cluster for many tests, only dropping and re-creating the database in which the testing happens. regards, tom lane
Hi, On 2022-01-17 15:13:57 -0500, Robert Haas wrote: > I guess there must be something explaining it, but I don't know what > it could be. The client and the server are each running the checksum > algorithm over the same data. If that's not the same speed then .... I > don't get it. Unless, somehow, they're using different implementations > of it? I think that actually might be the issue. On linux a test a pg_verifybackup was much faster than on windows (as in 10x). But if I disable openssl, it's only 2x. On the windows instance I *do* have openssl enabled. But I suspect something is off and the windows buildsystem ends up with our hand-rolled implementation on the client side, but not the server side. Which'd explain the times I'm seeing: We have a fast CRC implementation, but the rest is pretty darn unoptimized. Greetings, Andres Freund
On Mon, Jan 17, 2022 at 2:57 PM Andres Freund <andres@anarazel.de> wrote: > > pg_basebackup's 010_pg_basebackup.pl looks like it could be split up, > > though. That one, at least to me, looks like people have just kept > > adding semi-related things into the same test file. > > Yea. Here's a patch that splits up that file. Essentially the first half of the file is concerned with testing that a backup ends up in the state it expects, while the second half is concerned with checking that various options to pg_basebackup work. So I split it that way, plus I moved some of the really basic stuff to a completely separate file with a very brief runtime. The test results are interesting. Unpatched: [12:33:33] t/010_pg_basebackup.pl ... ok 16161 ms ( 0.02 usr 0.00 sys + 2.07 cusr 7.80 csys = 9.89 CPU) [12:33:49] t/020_pg_receivewal.pl ... ok 4115 ms ( 0.00 usr 0.00 sys + 0.89 cusr 1.73 csys = 2.62 CPU) [12:33:53] t/030_pg_recvlogical.pl .. ok 1857 ms ( 0.01 usr 0.01 sys + 0.63 cusr 0.73 csys = 1.38 CPU) [12:33:55] All tests successful. Files=3, Tests=177, 22 wallclock secs ( 0.04 usr 0.02 sys + 3.59 cusr 10.26 csys = 13.91 CPU) Pached: [12:32:03] t/010_pg_basebackup_basic.pl ...... ok 192 ms ( 0.01 usr 0.00 sys + 0.10 cusr 0.05 csys = 0.16 CPU) [12:32:03] t/011_pg_basebackup_integrity.pl .. ok 5530 ms ( 0.00 usr 0.00 sys + 0.87 cusr 2.51 csys = 3.38 CPU) [12:32:09] t/012_pg_basebackup_options.pl .... ok 13117 ms ( 0.00 usr 0.00 sys + 1.87 cusr 6.31 csys = 8.18 CPU) [12:32:22] t/020_pg_receivewal.pl ............ ok 4314 ms ( 0.01 usr 0.00 sys + 0.97 cusr 1.77 csys = 2.75 CPU) [12:32:26] t/030_pg_recvlogical.pl ........... ok 1908 ms ( 0.00 usr 0.00 sys + 0.64 cusr 0.77 csys = 1.41 CPU) [12:32:28] All tests successful. Files=5, Tests=177, 25 wallclock secs ( 0.04 usr 0.02 sys + 4.45 cusr 11.41 csys = 15.92 CPU) Sadly, we've gained about 2.5 seconds of runtime as the price of splitting the test. Arguably the options part could be split up a lot more finely than this, but that would drive up the runtime even more, basically because we'd need more initdbs. So I don't know whether it's better to leave things as they are, split them this much, or split them more. I think this amount of splitting might be justified simply in the interests of clarity, but I'm reluctant to go further unless we get some nifty initdb-caching system. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Hi, On 2022-01-18 12:49:16 -0500, Robert Haas wrote: > Here's a patch that splits up that file. Ah, nice! The split seems sensible to me. > Sadly, we've gained about 2.5 seconds of runtime as the price of > splitting the test. Arguably the options part could be split up a lot > more finely than this, but that would drive up the runtime even more, > basically because we'd need more initdbs. So I don't know whether it's > better to leave things as they are, split them this much, or split > them more. I think this amount of splitting might be justified simply > in the interests of clarity, but I'm reluctant to go further unless we > get some nifty initdb-caching system. Hm. From the buildfarm / CF perspective it might still be a win, because the different pieces can run concurrently. But it's not great :(. Maybe we really should do at least the most simplistic caching for initdbs, by doing one initdb as part of the creation of temp_install. Then Cluster::init would need logic to only use that if $params{extra} is empty. Greetings, Andres Freund
Hi, On 2022-01-18 13:40:40 -0800, Andres Freund wrote: > Maybe we really should do at least the most simplistic caching for initdbs, by > doing one initdb as part of the creation of temp_install. Then Cluster::init > would need logic to only use that if $params{extra} is empty. I hacked this together. And the wins are bigger than I thought. On my workstation, with plenty cpu and storage bandwidth, according to /usr/bin/time check-world NO_TEMP_INSTALL=1 things go from 321.56user 74.00system 2:26.22elapsed 270%CPU (0avgtext+0avgdata 93768maxresident)k 24inputs+32781336outputs (2254major+8717121minor)pagefaults 0swaps to 86.62user 57.10system 1:57.83elapsed 121%CPU (0avgtext+0avgdata 93752maxresident)k 8inputs+32683408outputs (1360major+6672618minor)pagefaults 0swaps The difference in elapsed and system time is pretty good, but the user time difference is quite staggering. This doesn't yet actually address the case of the basebackup tests, because that specifies a "non-default" option, preventing the use of the template initdb. But the effects are already big enough that I thought it's worth sharing. On CI for windows this reduces the time for the subscription tests from 03:24, to 2:39. There's some run-to-run variation, but it's a pretty clear signal... Greetings, Andres Freund
Attachment
Andres Freund <andres@anarazel.de> writes: > On 2022-01-18 13:40:40 -0800, Andres Freund wrote: >> Maybe we really should do at least the most simplistic caching for initdbs, by >> doing one initdb as part of the creation of temp_install. Then Cluster::init >> would need logic to only use that if $params{extra} is empty. > I hacked this together. And the wins are bigger than I thought. Me too ;-). As I remarked earlier, I'd tried this once before and gave up because it didn't seem to be winning much. But that was before we had so many initdb's triggered by TAP tests, I think. I tried this patch on florican's host, which seems mostly disk-bound when doing check-world. It barely gets any win from parallelism: $ time make -s check-world -j1 >/dev/null 3809.60 real 584.44 user 282.23 sys $ time make -s check-world -j2 >/dev/null 3789.90 real 610.60 user 289.60 sys Adding v2-0001-hack-use-template-initdb-in-tap-tests.patch: $ time make -s check-world -j1 >/dev/null 3193.46 real 221.32 user 226.11 sys $ time make -s check-world -j2 >/dev/null 3211.19 real 224.31 user 230.07 sys (Note that all four runs have the "fsync = on" removed from 008_fsm_truncation.pl.) So this looks like it'll be a nice win for low-end hardware, too. regards, tom lane
Hi, On 2022-01-19 11:54:01 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-01-18 13:40:40 -0800, Andres Freund wrote: > >> Maybe we really should do at least the most simplistic caching for initdbs, by > >> doing one initdb as part of the creation of temp_install. Then Cluster::init > >> would need logic to only use that if $params{extra} is empty. > > > I hacked this together. And the wins are bigger than I thought. > > Me too ;-). As I remarked earlier, I'd tried this once before and > gave up because it didn't seem to be winning much. But that was > before we had so many initdb's triggered by TAP tests, I think. What approach did you use? Do you have a better idea than generating tmp_install/initdb_template? I for a bit wondered whether initdb should do this internally instead. But it seemed more work than I wanted to tackle. The bit in the patch about generating initdb_template in Install.pm definitely needs to be made conditional, but I don't precisely know on what. The buildfarm just calls it as perl install.pl "$installdir > So this looks like it'll be a nice win for low-end hardware, too. Nice! > (Note that all four runs have the "fsync = on" removed from > 008_fsm_truncation.pl.) I assume you're planning on comitting that? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-01-19 11:54:01 -0500, Tom Lane wrote: >> Me too ;-). As I remarked earlier, I'd tried this once before and >> gave up because it didn't seem to be winning much. But that was >> before we had so many initdb's triggered by TAP tests, I think. > What approach did you use? Do you have a better idea than generating > tmp_install/initdb_template? No, it was largely the same as what you have here, I think. I dug up my WIP patch and attach it below, just in case there's any ideas worth borrowing. >> (Note that all four runs have the "fsync = on" removed from >> 008_fsm_truncation.pl.) > I assume you're planning on comitting that? Yeah, will do that shortly. regards, tom lane diff --git a/src/Makefile.global.in b/src/Makefile.global.in index dc8a89af8e..a28a03ac72 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -332,6 +332,7 @@ ifeq ($(MAKELEVEL),0) rm -rf '$(abs_top_builddir)'/tmp_install $(MKDIR_P) '$(abs_top_builddir)'/tmp_install/log $(MAKE) -C '$(top_builddir)' DESTDIR='$(abs_top_builddir)'/tmp_install install >'$(abs_top_builddir)'/tmp_install/log/install.log2>&1 + '$(abs_top_builddir)/tmp_install$(bindir)/initdb' -D '$(abs_top_builddir)/tmp_install/proto_pgdata' --no-clean --no-sync-A trust >'$(abs_top_builddir)'/tmp_install/log/initdb.log 2>&1 endif $(if $(EXTRA_INSTALL),for extra in $(EXTRA_INSTALL); do $(MAKE) -C '$(top_builddir)'/$$extra DESTDIR='$(abs_top_builddir)'/tmp_installinstall >>'$(abs_top_builddir)'/tmp_install/log/install.log || exit; done) endif @@ -355,7 +356,7 @@ $(if $(filter $(PORTNAME),darwin),DYLD_LIBRARY_PATH,$(if $(filter $(PORTNAME),ai endef define with_temp_install -PATH="$(abs_top_builddir)/tmp_install$(bindir):$$PATH" $(call add_to_path,$(ld_library_path_var),$(abs_top_builddir)/tmp_install$(libdir)) +PATH="$(abs_top_builddir)/tmp_install$(bindir):$$PATH" $(call add_to_path,$(ld_library_path_var),$(abs_top_builddir)/tmp_install$(libdir)) PG_PROTO_DATADIR='$(abs_top_builddir)/tmp_install/proto_pgdata' endef ifeq ($(enable_tap_tests),yes) diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index f4fa600951..22c1a726c7 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -404,8 +404,23 @@ sub init mkdir $self->backup_dir; mkdir $self->archive_dir; - TestLib::system_or_bail('initdb', '-D', $pgdata, '-A', 'trust', '-N', - @{ $params{extra} }); + # If we're using default initdb parameters, and the top-level "make check" + # created a prototype data directory for us, just copy that. + if (!defined($params{extra}) && + defined($ENV{PG_PROTO_DATADIR}) && + -d $ENV{PG_PROTO_DATADIR}) + { + rmdir($pgdata); + RecursiveCopy::copypath($ENV{PG_PROTO_DATADIR}, $pgdata); + chmod(0700, $pgdata); + } + else + { + TestLib::system_or_bail('initdb', '-D', $pgdata, + '--no-clean', '--no-sync', '-A', 'trust', + @{ $params{extra} }); + } + TestLib::system_or_bail($ENV{PG_REGRESS}, '--config-auth', $pgdata); open my $conf, '>>', "$pgdata/postgresql.conf"; diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c index abb742b1ed..04d7fb910b 100644 --- a/src/test/regress/pg_regress.c +++ b/src/test/regress/pg_regress.c @@ -2214,6 +2214,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc if (temp_instance) { + char *pg_proto_datadir; + struct stat ppst; FILE *pg_conf; const char *env_wait; int wait_seconds; @@ -2243,20 +2245,43 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc if (!directory_exists(buf)) make_directory(buf); - /* initdb */ - header(_("initializing database system")); - snprintf(buf, sizeof(buf), - "\"%s%sinitdb\" -D \"%s/data\" --no-clean --no-sync%s%s > \"%s/log/initdb.log\" 2>&1", - bindir ? bindir : "", - bindir ? "/" : "", - temp_instance, - debug ? " --debug" : "", - nolocale ? " --no-locale" : "", - outputdir); - if (system(buf)) + /* see if we should run initdb or just copy a prototype datadir */ + pg_proto_datadir = getenv("PG_PROTO_DATADIR"); + if (!nolocale && + pg_proto_datadir && + stat(pg_proto_datadir, &ppst) == 0 && + S_ISDIR(ppst.st_mode)) { - fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname,outputdir, buf); - exit(2); + /* copy prototype */ + header(_("copying prototype data directory")); + snprintf(buf, sizeof(buf), + "cp -a \"%s\" \"%s/data\" > \"%s/log/initdb.log\" 2>&1", + pg_proto_datadir, + temp_instance, + outputdir); + if (system(buf)) + { + fprintf(stderr, _("\n%s: cp failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname,outputdir, buf); + exit(2); + } + } + else + { + /* run initdb */ + header(_("initializing database system")); + snprintf(buf, sizeof(buf), + "\"%s%sinitdb\" -D \"%s/data\" --no-clean --no-sync%s%s > \"%s/log/initdb.log\" 2>&1", + bindir ? bindir : "", + bindir ? "/" : "", + temp_instance, + debug ? " --debug" : "", + nolocale ? " --no-locale" : "", + outputdir); + if (system(buf)) + { + fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"),progname, outputdir, buf); + exit(2); + } } /*
Hi, On 2022-01-19 12:14:21 -0500, Tom Lane wrote: > No, it was largely the same as what you have here, I think. I dug > up my WIP patch and attach it below, just in case there's any ideas > worth borrowing. Heh, it does look quite similar. > + "cp -a \"%s\" \"%s/data\" > \"%s/log/initdb.log\" 2>&1", > + pg_proto_datadir, > + temp_instance, > + outputdir); > + if (system(buf)) > + { > + fprintf(stderr, _("\n%s: cp failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname,outputdir, buf); > + exit(2); > + } Both ours have this. Unfortunately on windows cp doesn't natively exist. Although git does provide it. I tried a few things that appear to be natively available (time is best of three executions): gnu cp from git, cp -a tmp_install\initdb_template t\ 670ms xcopy.exe /E /Q tmp_install\initdb_template t\ 638ms robocopy /e /NFL /NDL tmp_install\initdb_template t\ 575ms So I guess we could use robocopy? That's shipped as part of windows starting in windows 10... xcopy has been there for longer, so I might just default to that. Greetings, Andres Freund
Hi, On 2022-01-19 09:42:31 -0800, Andres Freund wrote: > Both ours have this. Unfortunately on windows cp doesn't natively > exist. Although git does provide it. I tried a few things that appear to be > natively available (time is best of three executions): > > gnu cp from git, cp -a tmp_install\initdb_template t\ > 670ms > > xcopy.exe /E /Q tmp_install\initdb_template t\ > 638ms This errors out if there's any forward slashes in paths, thinking it's a flag. Seems out. > robocopy /e /NFL /NDL tmp_install\initdb_template t\ > 575ms > > So I guess we could use robocopy? That's shipped as part of windows starting in > windows 10... xcopy has been there for longer, so I might just default to that. It's part of of the OS back to at least windows 2016. I've found some random links on the webs saying that it's included "This command is available in Vista and Windows 7 by default. For Windows XP and Server 2003 this tool can be downloaded as part of Server 2003 Windows Resource Kit tools. ". Given that our oldest supported msvc version only runs on Windows 7 upwards [2], I think we should be good? Alternatively we could lift copydir() to src/common? But that seems like a bit more work than I want to put in. For a second I was thinking that using something like copy --reflink=auto could make a lot of sense for machines like florican, removing most of the IO from a "templated initdb". But it looks like freebsd doesn't have that, and it'd be a pain to figure out whether cp has --reflink. Greetings, Andres Freund [1] https://www.windows-commandline.com/download-robocopy/ [2] https://docs.microsoft.com/en-us/visualstudio/releases/2013/vs2013-sysrequirements-vs
On 1/19/22 21:18, Andres Freund wrote: > Hi, > > On 2022-01-19 09:42:31 -0800, Andres Freund wrote: >> Both ours have this. Unfortunately on windows cp doesn't natively >> exist. Although git does provide it. I tried a few things that appear to be >> natively available (time is best of three executions): >> >> gnu cp from git, cp -a tmp_install\initdb_template t\ >> 670ms >> >> xcopy.exe /E /Q tmp_install\initdb_template t\ >> 638ms > This errors out if there's any forward slashes in paths, thinking it's a > flag. Seems out. > > >> robocopy /e /NFL /NDL tmp_install\initdb_template t\ >> 575ms >> >> So I guess we could use robocopy? That's shipped as part of windows starting in >> windows 10... xcopy has been there for longer, so I might just default to that. > It's part of of the OS back to at least windows 2016. I've found some random > links on the webs saying that it's included "This command is available in > Vista and Windows 7 by default. For Windows XP and Server 2003 this tool can > be downloaded as part of Server 2003 Windows Resource Kit tools. ". > > Given that our oldest supported msvc version only runs on Windows 7 upwards > [2], I think we should be good? > > > Alternatively we could lift copydir() to src/common? But that seems like a bit > more work than I want to put in. > > > For a second I was thinking that using something like copy --reflink=auto > could make a lot of sense for machines like florican, removing most of the IO > from a "templated initdb". But it looks like freebsd doesn't have that, and > it'd be a pain to figure out whether cp has --reflink. FYI, the buildfarm code has this. It doesn't need backslashed paths, you just need to quote the paths, which you should probably do anyway: sub copydir { my ($from, $to, $logfile) = @_; my ($cp, $rd); if ($PGBuild::conf{using_msvc}) { $cp = "robocopy /nfl /ndl /np /e /sec "; $rd = qq{/LOG+:"$logfile" >nul}; } else { $cp = "cp -r"; $rd = qq{> "$logfile"}; } system(qq{$cp "$from" "$to" $rd 2>&1}); ## no critic (RequireLocalizedPunctuationVars) $? = 0 if ($cp =~ /robocopy/ && $? >> 8 == 1); return; } cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2022-01-19 18:18:59 -0800, Andres Freund wrote: > > robocopy /e /NFL /NDL tmp_install\initdb_template t\ > > 575ms > > > > So I guess we could use robocopy? That's shipped as part of windows starting in > > windows 10... xcopy has been there for longer, so I might just default to that. > > It's part of of the OS back to at least windows 2016. I've found some random > links on the webs saying that it's included "This command is available in > Vista and Windows 7 by default. For Windows XP and Server 2003 this tool can > be downloaded as part of Server 2003 Windows Resource Kit tools. ". > > Given that our oldest supported msvc version only runs on Windows 7 upwards > [2], I think we should be good? One thing I'm not sure about is where to perform the creation of the "template" for the msvc scripts. The prototype upthread created it unconditionally in Install.pm, but that's clearly not right. The buildfarm currently creates the temporary installation using a generic perl install.pl "$installdir" and then uses NO_TEMP_INSTALL. I don't really have a better idea than to introduce a dedicated vcregress.pl command to create the temporary installation? :( Greetings, Andres Freund
Hi, We have some issues with CI on macos and windows being too expensive (more on that soon in a separate email), which reminded me of this thread (with original title: [1]) I've attached a somewhat cleaned up version of the patch to cache initdb across runs. The results are still fairly impressive in my opinion. One thing I do not like, but don't have a good idea for how to improve, is that there's a bunch of duplicated logic in pg_regress.c and Cluster.pm. I've tried to move that into initdb.c itself, but that ends up pretty ugly, because we need to be a lot more careful about checking whether options are compatible etc. I've also thought about just putting this into a separate perl script, but right now we still allow basic regression tests without perl being available. So I concluded that for now just having the copies is the best answer. Times for running all tests under meson, on my workstation (20 cores / 40 threads): cassert build -O2: Before: real 0m44.638s user 7m58.780s sys 2m48.773s After: real 0m38.938s user 2m37.615s sys 2m0.570s cassert build -O0: Before: real 1m11.290s user 13m9.817s sys 2m54.946s After: real 1m2.959s user 3m5.835s sys 1m59.887s non-cassert build: Before: real 0m34.579s user 5m30.418s sys 2m40.507s After: real 0m27.710s user 2m20.644s sys 1m55.770s On CI this reduces the test times substantially: Freebsd 8:51 -> 5:35 Debian w/ asan, autoconf 6:43 -> 4:55 Debian w/ alignmentsan, ubsan 4:02 -> 2:33 macos 5:07 -> 4:29 windows 10:21 -> 9:49 This is ignoring a bit of run-to-run variance, but the trend is obvious enough that it's not worth worrying about that. Greetings, Andres Freund [1] https://postgr.es/m/20220120021859.3zpsfqn4z7ob7afz%40alap3.anarazel.de
Attachment
Andres Freund <andres@anarazel.de> writes: > Times for running all tests under meson, on my workstation (20 cores / 40 > threads): > cassert build -O2: > Before: > real 0m44.638s > user 7m58.780s > sys 2m48.773s > After: > real 0m38.938s > user 2m37.615s > sys 2m0.570s Impressive results. Even though your bottom-line time doesn't change that much, the big reduction in CPU time should translate to a nice speedup on slower buildfarm animals. (Disclaimer: I've not read the patch.) regards, tom lane
Hi, On 2023-08-05 16:58:38 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Times for running all tests under meson, on my workstation (20 cores / 40 > > threads): > > > cassert build -O2: > > > Before: > > real 0m44.638s > > user 7m58.780s > > sys 2m48.773s > > > After: > > real 0m38.938s > > user 2m37.615s > > sys 2m0.570s > > Impressive results. Even though your bottom-line time doesn't change that > much Unfortunately we have a few tests that take quite a while - for those the initdb removal doesn't make that much of a difference. Particularly because this machine has enough CPUs to not be fully busy except for the first few seconds... E.g. for a run with the patch applied: 258/265 postgresql:pg_basebackup / pg_basebackup/010_pg_basebackup OK 16.58s 187subtests passed 259/265 postgresql:subscription / subscription/100_bugs OK 6.69s 12 subtestspassed 260/265 postgresql:regress / regress/regress OK 24.95s 215subtests passed 261/265 postgresql:ssl / ssl/001_ssltests OK 7.97s 205subtests passed 262/265 postgresql:pg_dump / pg_dump/002_pg_dump OK 19.65s 11262subtests passed 263/265 postgresql:recovery / recovery/027_stream_regress OK 29.34s 6 subtestspassed 264/265 postgresql:isolation / isolation/isolation OK 33.94s 112subtests passed 265/265 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 38.22s 18 subtestspassed The pg_upgrade test is faster in isolation (29s), but not that much. The overall runtime is reduces due to the reduced "competing" cpu usage, but other than that... Looking at where the time is spent when running the pg_upgrade test on its own: grep -E '^\[' testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade |sed -E -e 's/.*\(([0-9.]+)s\)(.*)/\1 \2/g'|sort-n -r cassert: 13.094 ok 5 - regression tests pass 6.147 ok 14 - run of pg_upgrade for new instance 2.340 ok 6 - dump before running pg_upgrade 1.638 ok 17 - dump after running pg_upgrade 1.375 ok 12 - run of pg_upgrade --check for new instance 0.798 ok 1 - check locales in original cluster 0.371 ok 9 - invalid database causes failure status (got 1 vs expected 1) 0.149 ok 7 - run of pg_upgrade --check for new instance with incorrect binary path 0.131 ok 16 - check that locales in new cluster match original cluster optimized: 8.372 ok 5 - regression tests pass 3.641 ok 14 - run of pg_upgrade for new instance 1.371 ok 12 - run of pg_upgrade --check for new instance 1.104 ok 6 - dump before running pg_upgrade 0.636 ok 17 - dump after running pg_upgrade 0.594 ok 1 - check locales in original cluster 0.359 ok 9 - invalid database causes failure status (got 1 vs expected 1) 0.148 ok 7 - run of pg_upgrade --check for new instance with incorrect binary path 0.127 ok 16 - check that locales in new cluster match original cluster The time for "dump before running pg_upgrade" is misleadingly high - there's no output between starting initdb and the dump, so the timing includes initdb and a bunch of other work. But it's still not fast (1.637s) after. A small factor is that the initdb times are not insignificant, because the template initdb can't be used due to a bunch of parameters passed to initdb :) > the big reduction in CPU time should translate to a nice speedup on slower > buildfarm animals. Yea. It's a particularly large win when using valgrind. Under valgrind, a very large portion of the time for many tests is just spent doing initdb... So I am hoping to see some nice gains for skink. Greetings, Andres Freund
> On 5 Aug 2023, at 21:56, Andres Freund <andres@anarazel.de> wrote: > We have some issues with CI on macos and windows being too expensive (more on > that soon in a separate email), which reminded me of this thread (with > original title: [1]) > > I've attached a somewhat cleaned up version of the patch to cache initdb > across runs. The results are still fairly impressive in my opinion. > > One thing I do not like, but don't have a good idea for how to improve, is > that there's a bunch of duplicated logic in pg_regress.c and Cluster.pm. I've > tried to move that into initdb.c itself, but that ends up pretty ugly, because > we need to be a lot more careful about checking whether options are compatible > etc. I've also thought about just putting this into a separate perl script, > but right now we still allow basic regression tests without perl being > available. So I concluded that for now just having the copies is the best > answer. I had a look at this today and have been running a lot of tests with it without finding anything that breaks. The duplicated code is unfortunate, but after playing around with some options I agree that it's likely the best option. While looking I did venture down the rabbithole of making it support extra params as well, but I don't think moving the goalposts there is doing us any favors, it's clearly chasing diminishing returns. My only small gripe is that I keep thinking about template databases for CREATE DATABASE when reading the error messages in this patch, which is clearly not related to what this does. + note("initializing database system by copying initdb template"); I personally would've used cache instead of template in the user facing parts to keep concepts separated, but thats personal taste. All in all, I think this is committable as is. -- Daniel Gustafsson
Hi, On 2023-08-22 23:47:24 +0200, Daniel Gustafsson wrote: > I had a look at this today and have been running a lot of tests with it without > finding anything that breaks. Thanks! > The duplicated code is unfortunate, but after playing around with some > options I agree that it's likely the best option. Good and bad to hear :) > While looking I did venture down the rabbithole of making it support extra > params as well, but I don't think moving the goalposts there is doing us any > favors, it's clearly chasing diminishing returns. Agreed. I also went down that rabbithole, but it quickly gets a lot more code and complexity - and there just aren't that many tests using non-default options. > My only small gripe is that I keep thinking about template databases for CREATE > DATABASE when reading the error messages in this patch, which is clearly not > related to what this does. > > + note("initializing database system by copying initdb template"); > > I personally would've used cache instead of template in the user facing parts > to keep concepts separated, but thats personal taste. I am going back and forth on that one (as one can notice with $subject). It doesn't quite seem like a cache, as it's not "created" on demand and only usable when the exactly same parameters are used repeatedly. But template is overloaded as you say... > All in all, I think this is committable as is. Cool. Planning to do that tomorrow. We can easily extend / adjust this later, it just affects testing infrastructure. Greetings, Andres Freund
> On 23 Aug 2023, at 03:17, Andres Freund <andres@anarazel.de> wrote: > On 2023-08-22 23:47:24 +0200, Daniel Gustafsson wrote: >> My only small gripe is that I keep thinking about template databases for CREATE >> DATABASE when reading the error messages in this patch, which is clearly not >> related to what this does. >> >> + note("initializing database system by copying initdb template"); >> >> I personally would've used cache instead of template in the user facing parts >> to keep concepts separated, but thats personal taste. > > I am going back and forth on that one (as one can notice with $subject). It > doesn't quite seem like a cache, as it's not "created" on demand and only > usable when the exactly same parameters are used repeatedly. But template is > overloaded as you say... That's a fair point, cache is not a good word to describe a stored copy of something prefabricated. Let's go with template, we can always refine in-tree if a better wording comes along. -- Daniel Gustafsson
Hi, On 2023-08-23 10:10:31 +0200, Daniel Gustafsson wrote: > > On 23 Aug 2023, at 03:17, Andres Freund <andres@anarazel.de> wrote: > > On 2023-08-22 23:47:24 +0200, Daniel Gustafsson wrote: > > >> My only small gripe is that I keep thinking about template databases for CREATE > >> DATABASE when reading the error messages in this patch, which is clearly not > >> related to what this does. > >> > >> + note("initializing database system by copying initdb template"); > >> > >> I personally would've used cache instead of template in the user facing parts > >> to keep concepts separated, but thats personal taste. > > > > I am going back and forth on that one (as one can notice with $subject). It > > doesn't quite seem like a cache, as it's not "created" on demand and only > > usable when the exactly same parameters are used repeatedly. But template is > > overloaded as you say... > > That's a fair point, cache is not a good word to describe a stored copy of > something prefabricated. Let's go with template, we can always refine in-tree > if a better wording comes along. Cool. Pushed that way. Only change I made is to redirect the output of cp (and/or robocopy) in pg_regress, similar to how that was done for initdb proper. Let's see what the buildfarm says - it's not inconceivable that it'll show some issues. Greetings, Andres Freund
On Fri, Aug 25, 2023 at 10:10 AM Andres Freund <andres@anarazel.de> wrote: > Let's see what the buildfarm says - it's not inconceivable that it'll show > some issues. Apparently Solaris doesn't like "cp -a", per animal "margay". I think "cp -RPp" should be enough everywhere? https://docs.oracle.com/cd/E88353_01/html/E37839/cp-1.html https://pubs.opengroup.org/onlinepubs/9699919799.2013edition/utilities/cp.html
Attachment
> On 25 Aug 2023, at 07:50, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Aug 25, 2023 at 10:10 AM Andres Freund <andres@anarazel.de> wrote: >> Let's see what the buildfarm says - it's not inconceivable that it'll show >> some issues. > > Apparently Solaris doesn't like "cp -a", per animal "margay". I think > "cp -RPp" should be enough everywhere? Agreed, AFAICT that should work equally well on all supported platforms. -- Daniel Gustafsson
Hi, On 2023-08-25 09:00:24 +0200, Daniel Gustafsson wrote: > > On 25 Aug 2023, at 07:50, Thomas Munro <thomas.munro@gmail.com> wrote: > > > > On Fri, Aug 25, 2023 at 10:10 AM Andres Freund <andres@anarazel.de> wrote: > >> Let's see what the buildfarm says - it's not inconceivable that it'll show > >> some issues. > > > > Apparently Solaris doesn't like "cp -a", per animal "margay". I think > > "cp -RPp" should be enough everywhere? Thanks for noticing the issue and submitting the patch. > Agreed, AFAICT that should work equally well on all supported platforms. Also agreed. Unsurprisingly, CI didn't find anything on the tested platforms. Pushed. Greetings, Andres Freund
On Thu, Aug 24, 2023 at 03:10:00PM -0700, Andres Freund wrote: > Cool. Pushed that way. I just noticed the tests running about 30% faster on my machine due to this. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, 25 Aug 2023 at 00:16, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-08-23 10:10:31 +0200, Daniel Gustafsson wrote: > > > On 23 Aug 2023, at 03:17, Andres Freund <andres@anarazel.de> wrote: > > > On 2023-08-22 23:47:24 +0200, Daniel Gustafsson wrote: > > > > >> My only small gripe is that I keep thinking about template databases for CREATE > > >> DATABASE when reading the error messages in this patch, which is clearly not > > >> related to what this does. > > >> > > >> + note("initializing database system by copying initdb template"); > > >> > > >> I personally would've used cache instead of template in the user facing parts > > >> to keep concepts separated, but thats personal taste. > > > > > > I am going back and forth on that one (as one can notice with $subject). It > > > doesn't quite seem like a cache, as it's not "created" on demand and only > > > usable when the exactly same parameters are used repeatedly. But template is > > > overloaded as you say... > > > > That's a fair point, cache is not a good word to describe a stored copy of > > something prefabricated. Let's go with template, we can always refine in-tree > > if a better wording comes along. > > Cool. Pushed that way. Only change I made is to redirect the output of cp > (and/or robocopy) in pg_regress, similar to how that was done for initdb > proper. While working on some things that are prone to breaking initdb, I noticed that this template isn't generated with --no-clean, while pg_regress does do that. This meant `make check` didn't have any meaningful debuggable output when I broke the processes in initdb, which is undesirable. Attached a patch that fixes this for both make and meson, by adding --no-clean to the initdb template. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Attachment
> On 7 Dec 2023, at 14:50, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Attached a patch that fixes this for both make and meson, by adding > --no-clean to the initdb template. Makes sense. While in there I think we should rename -N to the long optoin --no-sync to make it easier to grep for and make the buildfiles more self-documenting. -- Daniel Gustafsson
On Thu, 7 Dec 2023 at 15:06, Daniel Gustafsson <daniel@yesql.se> wrote: > > > On 7 Dec 2023, at 14:50, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > > Attached a patch that fixes this for both make and meson, by adding > > --no-clean to the initdb template. > > Makes sense. While in there I think we should rename -N to the long optoin > --no-sync to make it easier to grep for and make the buildfiles more > self-documenting. Then that'd be the attached patch, which also includes --auth instead of -A, for the same reason as -N vs --no-sync Kind regards, Matthias van de Meent Neon (https://neon.tech)
Attachment
> On 7 Dec 2023, at 15:27, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Then that'd be the attached patch, which also includes --auth instead > of -A, for the same reason as -N vs --no-sync Applied to master, thanks! -- Daniel Gustafsson