Re: Test to dump and restore objects left behind by regression - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Test to dump and restore objects left behind by regression
Date
Msg-id 4nuo6wzmidxv7q5qayywazp6iqk7b66psyznjvij3s3jgqoq5p@kxk76xvgwsna
Whole thread Raw
In response to Re: Test to dump and restore objects left behind by regression  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: Test to dump and restore objects left behind by regression
List pgsql-hackers
Hi,

On 2025-04-03 19:14:02 +0200, Alvaro Herrera wrote:
> On 2025-Apr-03, Andres Freund wrote:
>
> > I've increased the timeout even further, but I can't say that I am happy about
> > the slowest test getting even slower. Adding test time in the serially slowest
> > test is way worse than adding the same time in a concurrent test.
>
> Yeah.  We discussed strategies to shorten the runtime, but the agreement
> upthread was that we'd look for more elaborate ways to do that
> afterwards.

I think it's not good to just say "we'll maybe somehow fix it in the future",
particularly if the solution is by no means agreed to.  If this were a test
that wasn't already the bottleneck for test cycles, it'd would be a different
story, but it is.

I'm already unhappy that 002_pg_upgrade got noticeably slower due to the stats
dump changes. This made it even worse.


> As I mentioned, I can see adding something like PG_TEST_EXCLUDE that we
> could use to suppress this test on slow hosts.  Would that work for you?

Not particularly well. For one I don't actually think it's good to exclude it
from something like valgrind testing. But it's also something that wouldn't
really be usable locally, I think, given that we'd be expected to keep the
test working.  As outlined below, it really affects the test-hack-test loop
times.


> (We also discussed the fact that this was part of 002_pg_upgrade.pl
> instead of being elsewhere.  The reason is that this depends on the
> regression tests having run, and this is the only TAP test that does
> that.   Well, this one and 027_stream_regress.pl which is even slower.)

FWIW, for me 027 is actually considerably faster. In an cassert -O0 build (my
normal development env, I find even -Og too problematic for debugging):

pg_upgrade/002_pg_upgrade    96.61s
recovery/027_stream_regress  66.04s

After
  git revert 8806e4e8deb1e21715e031e17181d904825a410e abe56227b2e213755dd3e194c530f5f06467bd7c
172259afb563d35001410dc6daad78b250924038

pg_upgrade/002_pg_upgrade    75.09s

Slower by 29%, far from the 3s increased time I saw mentioned somewhere.


And this really affects the overall test time:

All tests before:
    real    1m38.173s
    user    5m52.500s
    sys    4m23.574s

All tests after:
    real    2m0.397s
    user    5m53.625s
    sys    4m30.518s

The CPU time increase is rather minimal, but the wall clock time increase is
22%.

17:
    real    1m14.822s
    user    4m2.630s
    sys    3m22.384s

We regressed wall clock time *60%* from 17->18. Some test cycle increase is
reasonable and can largely be compensated with hardware, but this cycle we're
growing way faster than hardware gets faster.  I don't think that's
sustainable.


On valgrind it's also is very much true that 002_pg_upgrade is slower:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-04-04%2006%3A53%3A47&stg=recovery-check

276/333 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade                               OK             10337.79s   22
subtestspassed
 
233/333 postgresql:recovery / recovery/027_stream_regress                               OK              7642.90s   9
subtestspassed
 


To look at the times in a bit more detail:

[09:19:31.371](6146.167s) ok 5 - regression tests pass
...
# Running: pg_dump -Fd -j2 --no-sync -d port=31947 host=/tmp/HlLiWBwv6T dbname='regression' --create -f
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/tmp_test_lFqZ/regression.dump
[09:26:35.178](201.199s) ok 6 - pg_dump on source instance
# Running: pg_restore --create -j2 -d postgres
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/tmp_test_lFqZ/regression.dump
[09:35:53.079](557.901s) ok 7 - pg_restore to destination instance
# Running: pg_dump --no-sync -d port=31947 host=/tmp/HlLiWBwv6T dbname='regression' -f
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/tmp_test_lFqZ/src_dump.sql
# Running: pg_dump --no-sync -d port=31949 host=/tmp/HlLiWBwv6T dbname='regression' -f
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/tmp_test_lFqZ/dest_dump.sql
[09:43:18.485](445.407s) ok 8 - dump outputs from original and restored regression databases match
# Running: pg_dumpall --no-sync --dbname port=31947 host=/tmp/HlLiWBwv6T dbname='postgres' --file
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/tmp_test_lFqZ/dump1.sql
[09:43:45.895](27.410s) ok 9 - dump before running pg_upgrade

I think the time for "pg_dump on source instance" is somewhat misleading, it
includes initdb and starting the server.

But it's pretty obvious that the newly added steps cost quite a bit of time.


> > I suspect that the test will go a bit faster if log_statement weren't forced
> > on, printing that many log lines, with context, does make valgrind slower,
> > IME. But Cluster.pm forces it to on, and I suspect that putting a global
> > log_statement=false into TEMP_CONFIG would have it's own disadvantages.
>
> I'm sure we can make this change as well somehow, overridding the
> setting just 002_pg_upgrade.pl, as attached.  I don't think it's
> relevant for this particular test.  The log files go from 21 MB to
> 2.4 MB.  It's not nothing ...

That is a nice improvement. I have to run a few errands, can check how that
affects valgrind times of a dump and restore of the regression db.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Replace IN VALUES with ANY in WHERE clauses during optimization
Next
From: Peter Geoghegan
Date:
Subject: Re: Replace IN VALUES with ANY in WHERE clauses during optimization