Thread: why is pg_upgrade's regression run so slow?
As part of its 002_pgupgrade.pl, the pg_upgrade tests do a rerun of the normal regression tests. That in itself is sad, but what concerns me here is why it's so much slower than the regular run? This is apparent everywhere (e.g. on crake the standard run takes about 30 to 90 s, but pg_upgrade's run takes 5 minutes or more). On Windows, it's catastrophic, and only hasn't been noticed because the buildfarm client wasn't counting a timeout as a failure. That was an error on my part and I have switched a few of my machines to code that checks more robustly for failure of meson tests - specifically by looking for the absence of test.success rather than the presence of test.fail. That means that drongo and fairywren are getting timeout errors. e.g. on the latest run on fairywren, the regular regression run took 226s, but pg_upgrade's run of what should be the same set of tests took 2418s. What the heck is going on here? Is it because there are the concurrent tests running? That doesn't seem enough to make the tests run more than 10 times as long. I have a strong suspicion this is exacerbated by "debug_parallel_query = regress", especially since the tests run much faster on REL_17_STABLE where I am not setting that, but that can't be the whole explanation, since that setting should apply to both sets of tests. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > As part of its 002_pgupgrade.pl, the pg_upgrade tests do a rerun of the > normal regression tests. That in itself is sad, but what concerns me > here is why it's so much slower than the regular run? This is apparent > everywhere (e.g. on crake the standard run takes about 30 to 90 s, but > pg_upgrade's run takes 5 minutes or more). Just to add some more fuel to the fire: I do *not* observe such an effect on my own animals. For instance, sifaka's last run shows 00:09 for install-check-C and the same (within rounding error) for the regression test step in 002_pgupgrade; on the much slower mamba, the last run took 07:33 for install-check-C and 07:40 for the 002_pgupgrade regression test step. I'm still using the makefile-based build, and I'm not using debug_parallel_query, but it doesn't make a lot of sense to me that either of those things should affect this. Looking at crake's last run, there are other oddities: why does the "check" step take 00:24 while "install-check-en_US.utf8" (which should be doing strictly less work) takes 01:00? Again, there are not similar discrepancies on my animals. Are you sure there's not background load on the machine? regards, tom lane
On 2024-07-27 Sa 10:20 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> As part of its 002_pgupgrade.pl, the pg_upgrade tests do a rerun of the >> normal regression tests. That in itself is sad, but what concerns me >> here is why it's so much slower than the regular run? This is apparent >> everywhere (e.g. on crake the standard run takes about 30 to 90 s, but >> pg_upgrade's run takes 5 minutes or more). > Just to add some more fuel to the fire: I do *not* observe such an > effect on my own animals. For instance, sifaka's last run shows > 00:09 for install-check-C and the same (within rounding error) > for the regression test step in 002_pgupgrade; on the much slower > mamba, the last run took 07:33 for install-check-C and 07:40 for the > 002_pgupgrade regression test step. > > I'm still using the makefile-based build, and I'm not using > debug_parallel_query, but it doesn't make a lot of sense to me > that either of those things should affect this. > > Looking at crake's last run, there are other oddities: why does > the "check" step take 00:24 while "install-check-en_US.utf8" (which > should be doing strictly less work) takes 01:00? Again, there are > not similar discrepancies on my animals. Are you sure there's not > background load on the machine? > > Quite sure. Running crake and koel all it does. It's Fedora 40 running on bare metal, a Lenovo Y70 with an Intel Core i7-4720HQ CPU @ 2.60GHz and a Samsung SSD. The culprit appears to be meson. When I tested running crake with "using_meson => 0" I got results in line with yours. The regression test times were consistent, including the installcheck tests. That's especially ugly on Windows as we don't have any alternative way of running, and the results are so much more catastrophic. "debug_parallel_query" didn't seem to matter. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2024-07-27 Sa 10:20 AM, Tom Lane wrote: >> Just to add some more fuel to the fire: I do *not* observe such an >> effect on my own animals. > The culprit appears to be meson. When I tested running crake with > "using_meson => 0" I got results in line with yours. Interesting. Maybe meson is over-aggressively trying to run these test suites in parallel? regards, tom lane
On Sun, Jul 28, 2024 at 10:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Interesting. Maybe meson is over-aggressively trying to run these > test suites in parallel? Hypothesis: NTFS might not be as good at linking/unlinking lots of files concurrently due to forced synchronous I/O, causing queuing? That's what [1] was about for FreeBSD CI. I couldn't immediately see how to make a RAM disks on Windows but at the time I had a hunch that that OS was struggling in the same way. Could some tuning help? Disable 8dot3name (a thing that creates a ye-olde-MSDOS-compatible second directory entry for every file), adjust disablelastaccess (something like noatime), disable USN journal (a kind of secondary journal of all file system operations that is used to drive the change notification API that we don't care about), disable write cache flush so that any synchronous I/O operations don't wait for that (at the risk of corruption on power loss, but maybe it's OK on a device dedicated to temporary workspace)? This is just from some quick googling, but perhaps someone who actually knows how to drive Windows locally and use the performance monitoring tools could tell us what it's actually waiting on... I noticed there is a new thing called Dev Drive[2] on Windows 11, which claims to be faster for developer workloads and there are graphs[3] showing various projects' test suites going faster. It's ReFS, a COW file system. From some quick googling, the CopyFile() system does a fast clone, and that should affect the robocopy command in Cluster.pm (note: Unixoid cp in there also uses COW cloning on at least xfs, zfs, probably apfs too). So I would be interested to know if that goes faster ... or slower. I'm also interested in how it reacts to the POSIX-semantics mode[4]; that might affect whether we can ever pull the trigger on that idea. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BFXLcEg1dyTqJjDiNQ8pGom4KrJj4wF38C90thti9dVA%40mail.gmail.com [2] https://learn.microsoft.com/en-us/windows/dev-drive/ [3] https://devblogs.microsoft.com/visualstudio/devdrive/ [4] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com
On 2024-07-27 Sa 6:48 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 2024-07-27 Sa 10:20 AM, Tom Lane wrote: >>> Just to add some more fuel to the fire: I do *not* observe such an >>> effect on my own animals. >> The culprit appears to be meson. When I tested running crake with >> "using_meson => 0" I got results in line with yours. > Interesting. Maybe meson is over-aggressively trying to run these > test suites in parallel? Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on HEAD for fairywren and drongo - fairwren has just gone green, and I expect drongo will when it reports in a few hours. I'm at a loss for an explanation. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hello Andrew, 28.07.2024 22:43, Andrew Dunstan wrote: > > Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on HEAD for fairywren and drongo - fairwren has > just gone green, and I expect drongo will when it reports in a few hours. > > I'm at a loss for an explanation. > I'm observing the same here (using a Windows 10 VM). With no TEMP_CONFIG set, `meson test` gives me these numbers: test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade duration: 72.34s test: postgresql:regress / regress/regress duration: 41.98s But with debug_parallel_query=regress in TEMP_CONFIG, I see: test: postgresql:regress / regress/regress duration: 50.08s test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade duration: 398.45s With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many parallel workers were started during the test: ...\postgresql\build>grep "starting background worker process" -r testrun/pg_upgrade | wc -l 17532 And another interesting fact is that TEMP_CONFIG is apparently ignored by `meson test regress/regress`. For example, with temp.config containing invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but `meson test regress/regress` passes just fine. Best regards, Alexander
On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote: > Hello Andrew, > > 28.07.2024 22:43, Andrew Dunstan wrote: >> >> Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on >> HEAD for fairywren and drongo - fairwren has just gone green, and I >> expect drongo will when it reports in a few hours. >> >> I'm at a loss for an explanation. >> > > I'm observing the same here (using a Windows 10 VM). > > With no TEMP_CONFIG set, `meson test` gives me these numbers: > test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade > duration: 72.34s > > test: postgresql:regress / regress/regress > duration: 41.98s > > But with debug_parallel_query=regress in TEMP_CONFIG, I see: > test: postgresql:regress / regress/regress > duration: 50.08s > > test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade > duration: 398.45s > > With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many > parallel workers were started during the test: > ...\postgresql\build>grep "starting background worker process" -r > testrun/pg_upgrade | wc -l > 17532 > > And another interesting fact is that TEMP_CONFIG is apparently ignored by > `meson test regress/regress`. For example, with temp.config containing > invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but > `meson test regress/regress` passes just fine. > > Well, that last fact explains the discrepancy I originally complained about. How the heck did that happen? It looks like we just ignored its use in Makefile.global.in :-( cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-08-19 Mo 8:00 AM, Alexander Lakhin wrote: > Hello Andrew, > > 29.07.2024 13:54, Andrew Dunstan wrote: >> >> On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote: >>> >>> And another interesting fact is that TEMP_CONFIG is apparently >>> ignored by >>> `meson test regress/regress`. For example, with temp.config containing >>> invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but >>> `meson test regress/regress` passes just fine. >>> >>> >> >> Well, that last fact explains the discrepancy I originally complained >> about. How the heck did that happen? It looks like we just ignored >> its use in Makefile.global.in :-( > > Please look at the attached patch (partially based on ideas from [1]) for > meson.build, that aligns it with `make` in regard to use of TEMP_CONFIG. > > Maybe it could be implemented via a separate meson option, but that would > also require modifying at least the buildfarm client... > > [1] > https://www.postgresql.org/message-id/CAN55FZ304Kp%2B510-iU5-Nx6hh32ny9jgGn%2BOB5uqPExEMK1gQQ%40mail.gmail.com > > I think this is the way to go. The patch LGTM. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com