Thread: why is pg_upgrade's regression run so slow?

why is pg_upgrade's regression run so slow?

From
Andrew Dunstan
Date:
As part of its 002_pgupgrade.pl, the pg_upgrade tests do a rerun of the 
normal regression tests. That in itself is sad, but what concerns me 
here is why it's so much slower than the regular run? This is apparent 
everywhere (e.g. on crake the standard run takes about 30 to 90 s, but 
pg_upgrade's run takes 5 minutes or more). On Windows, it's 
catastrophic, and only hasn't been noticed because the buildfarm client 
wasn't counting a timeout as a failure. That was an error on my part and 
I have switched a few of my machines to code that checks more robustly 
for failure of meson tests - specifically by looking for the absence of 
test.success rather than the presence of test.fail. That means that 
drongo and fairywren are getting timeout errors. e.g. on the latest run 
on fairywren, the regular regression run took 226s, but pg_upgrade's run 
of what should be the same set of tests took 2418s. What the heck is 
going on here? Is it because there are the concurrent tests running? 
That doesn't seem enough to make the tests run more than 10 times as long.

I have a strong suspicion this is exacerbated by "debug_parallel_query = 
regress", especially since the tests run much faster on REL_17_STABLE 
where I am not setting that, but that can't be the whole explanation, 
since that setting should apply to both sets of tests.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: why is pg_upgrade's regression run so slow?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> As part of its 002_pgupgrade.pl, the pg_upgrade tests do a rerun of the 
> normal regression tests. That in itself is sad, but what concerns me 
> here is why it's so much slower than the regular run? This is apparent 
> everywhere (e.g. on crake the standard run takes about 30 to 90 s, but 
> pg_upgrade's run takes 5 minutes or more).

Just to add some more fuel to the fire: I do *not* observe such an
effect on my own animals.  For instance, sifaka's last run shows
00:09 for install-check-C and the same (within rounding error)
for the regression test step in 002_pgupgrade; on the much slower
mamba, the last run took 07:33 for install-check-C and 07:40 for the
002_pgupgrade regression test step.

I'm still using the makefile-based build, and I'm not using
debug_parallel_query, but it doesn't make a lot of sense to me
that either of those things should affect this.

Looking at crake's last run, there are other oddities: why does
the "check" step take 00:24 while "install-check-en_US.utf8" (which
should be doing strictly less work) takes 01:00?  Again, there are
not similar discrepancies on my animals.  Are you sure there's not
background load on the machine?

            regards, tom lane



Re: why is pg_upgrade's regression run so slow?

From
Andrew Dunstan
Date:
On 2024-07-27 Sa 10:20 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> As part of its 002_pgupgrade.pl, the pg_upgrade tests do a rerun of the
>> normal regression tests. That in itself is sad, but what concerns me
>> here is why it's so much slower than the regular run? This is apparent
>> everywhere (e.g. on crake the standard run takes about 30 to 90 s, but
>> pg_upgrade's run takes 5 minutes or more).
> Just to add some more fuel to the fire: I do *not* observe such an
> effect on my own animals.  For instance, sifaka's last run shows
> 00:09 for install-check-C and the same (within rounding error)
> for the regression test step in 002_pgupgrade; on the much slower
> mamba, the last run took 07:33 for install-check-C and 07:40 for the
> 002_pgupgrade regression test step.
>
> I'm still using the makefile-based build, and I'm not using
> debug_parallel_query, but it doesn't make a lot of sense to me
> that either of those things should affect this.
>
> Looking at crake's last run, there are other oddities: why does
> the "check" step take 00:24 while "install-check-en_US.utf8" (which
> should be doing strictly less work) takes 01:00?  Again, there are
> not similar discrepancies on my animals.  Are you sure there's not
> background load on the machine?
>
>             


Quite sure. Running crake and koel all it does. It's Fedora 40 running 
on bare metal, a Lenovo Y70 with an Intel Core i7-4720HQ CPU @ 2.60GHz 
and a Samsung SSD.

The culprit appears to be meson. When I tested running crake with 
"using_meson => 0" I got results in line with yours. The regression test 
times were consistent, including the installcheck tests.  That's 
especially ugly on Windows as we don't have any alternative way of 
running, and the results are so much more catastrophic. 
"debug_parallel_query" didn't seem to matter.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: why is pg_upgrade's regression run so slow?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2024-07-27 Sa 10:20 AM, Tom Lane wrote:
>> Just to add some more fuel to the fire: I do *not* observe such an
>> effect on my own animals.

> The culprit appears to be meson. When I tested running crake with 
> "using_meson => 0" I got results in line with yours.

Interesting.  Maybe meson is over-aggressively trying to run these
test suites in parallel?

            regards, tom lane



Re: why is pg_upgrade's regression run so slow?

From
Thomas Munro
Date:
On Sun, Jul 28, 2024 at 10:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Interesting.  Maybe meson is over-aggressively trying to run these
> test suites in parallel?

Hypothesis: NTFS might not be as good at linking/unlinking lots of
files concurrently due to forced synchronous I/O, causing queuing?

That's what [1] was about for FreeBSD CI.  I couldn't immediately see
how to make a RAM disks on Windows but at the time I had a hunch that
that OS was struggling in the same way.

Could some tuning help?  Disable 8dot3name (a thing that creates a
ye-olde-MSDOS-compatible second directory entry for every file),
adjust disablelastaccess (something like noatime), disable USN journal
(a kind of secondary journal of all file system operations that is
used to drive the change notification API that we don't care about),
disable write cache flush so that any synchronous I/O operations don't
wait for that (at the risk of corruption on power loss, but maybe it's
OK on a device dedicated to temporary workspace)?  This is just from
some quick googling, but perhaps someone who actually knows how to
drive Windows locally and use the performance monitoring tools could
tell us what it's actually waiting on...

I noticed there is a new thing called Dev Drive[2] on Windows 11,
which claims to be faster for developer workloads and there are
graphs[3] showing various projects' test suites going faster.  It's
ReFS, a COW file system.  From some quick googling, the CopyFile()
system does a fast clone, and that should affect the robocopy command
in Cluster.pm (note: Unixoid cp in there also uses COW cloning on at
least xfs, zfs, probably apfs too).  So I would be interested to know
if that goes faster ... or slower.  I'm also interested in how it
reacts to the POSIX-semantics mode[4]; that might affect whether we
can ever pull the trigger on that idea.

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BFXLcEg1dyTqJjDiNQ8pGom4KrJj4wF38C90thti9dVA%40mail.gmail.com
[2] https://learn.microsoft.com/en-us/windows/dev-drive/
[3] https://devblogs.microsoft.com/visualstudio/devdrive/
[4]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com



Re: why is pg_upgrade's regression run so slow?

From
Andrew Dunstan
Date:
On 2024-07-27 Sa 6:48 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 2024-07-27 Sa 10:20 AM, Tom Lane wrote:
>>> Just to add some more fuel to the fire: I do *not* observe such an
>>> effect on my own animals.
>> The culprit appears to be meson. When I tested running crake with
>> "using_meson => 0" I got results in line with yours.
> Interesting.  Maybe meson is over-aggressively trying to run these
> test suites in parallel?



Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on 
HEAD for fairywren and drongo -  fairwren has just gone green, and I 
expect drongo will when it reports in a few hours.

I'm at a loss for an explanation.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: why is pg_upgrade's regression run so slow?

From
Alexander Lakhin
Date:
Hello Andrew,

28.07.2024 22:43, Andrew Dunstan wrote:
>
> Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on HEAD for fairywren and drongo -  fairwren has 
> just gone green, and I expect drongo will when it reports in a few hours.
>
> I'm at a loss for an explanation.
>

I'm observing the same here (using a Windows 10 VM).

With no TEMP_CONFIG set, `meson test` gives me these numbers:
test:         postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration:     72.34s

test:         postgresql:regress / regress/regress
duration:     41.98s

But with debug_parallel_query=regress in TEMP_CONFIG, I see:
test:         postgresql:regress / regress/regress
duration:     50.08s

test:         postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration:     398.45s

With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
parallel workers were started during the test:
...\postgresql\build>grep "starting background worker process" -r testrun/pg_upgrade | wc -l
17532

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.

Best regards,
Alexander



Re: why is pg_upgrade's regression run so slow?

From
Andrew Dunstan
Date:
On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote:
> Hello Andrew,
>
> 28.07.2024 22:43, Andrew Dunstan wrote:
>>
>> Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on 
>> HEAD for fairywren and drongo -  fairwren has just gone green, and I 
>> expect drongo will when it reports in a few hours.
>>
>> I'm at a loss for an explanation.
>>
>
> I'm observing the same here (using a Windows 10 VM).
>
> With no TEMP_CONFIG set, `meson test` gives me these numbers:
> test:         postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
> duration:     72.34s
>
> test:         postgresql:regress / regress/regress
> duration:     41.98s
>
> But with debug_parallel_query=regress in TEMP_CONFIG, I see:
> test:         postgresql:regress / regress/regress
> duration:     50.08s
>
> test:         postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
> duration:     398.45s
>
> With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
> parallel workers were started during the test:
> ...\postgresql\build>grep "starting background worker process" -r 
> testrun/pg_upgrade | wc -l
> 17532
>
> And another interesting fact is that TEMP_CONFIG is apparently ignored by
> `meson test regress/regress`. For example, with temp.config containing
> invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
> `meson test regress/regress` passes just fine.
>
>

Well, that last fact explains the discrepancy I originally complained 
about. How the heck did that happen? It looks like we just ignored its 
use in Makefile.global.in :-(


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com