Re: A test for replay of regression tests - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: A test for replay of regression tests
Date
Msg-id 20220804011300.GJ19644@telsasoft.com
Whole thread Raw
In response to Re: A test for replay of regression tests  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Thu, Aug 04, 2022 at 09:24:24AM +1200, Thomas Munro wrote:
> On Thu, Aug 4, 2022 at 3:30 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Thu, Dec 09, 2021 at 12:10:23PM +1300, Thomas Munro wrote:
> > > This adds 2 whole minutes to the recovery check, when running with the
> > > Windows serial-only scripts on Cirrus CI (using Andres's CI patches).
> > > For Linux it adds ~20 seconds to the total of -j8 check-world.
> > > Hopefully that's time well spent, because it adds test coverage for
> > > all the redo routines, and hopefully soon we won't have to run 'em in
> > > series on Windows.
> >
> > Should 027-stream-regress be renamed to something that starts earlier ?
> > Off-list earlier this year, Andres referred to 000.

See also: 
https://www.postgresql.org/message-id/20220213220709.vjz5rziuhfdpqxrg@alap3.anarazel.de

> Do you have any data on improved times from doing that?
> 
> I have wondered about moving it into 001_stream_rep.pl.

The immediate motive for raising the question is due to working on your cygwin
patch (where I've set PROVE_FLAGS=-j3).  The last invocation I have opened ends
like:

[20:46:47.577] [13:46:47] t/026_overwrite_contrecord.pl ........ ok    10264 ms ( 0.02 usr  0.02 sys + 11.25 cusr 35.95
csys= 47.24 CPU)
 
[20:47:08.087] [13:47:08] t/028_pitr_timelines.pl .............. ok    13153 ms ( 0.00 usr  0.00 sys +  4.03 cusr 14.79
csys= 18.82 CPU)
 
[20:47:08.999] [13:47:09] t/029_stats_restart.pl ............... ok    12631 ms ( 0.00 usr  0.02 sys +  7.40 cusr 23.30
csys= 30.71 CPU)
 
[20:47:34.353] [13:47:34] t/031_recovery_conflict.pl ........... ok    11337 ms ( 0.00 usr  0.00 sys +  3.84 cusr 11.82
csys= 15.66 CPU)
 
[20:47:35.070] [13:47:35] t/030_stats_cleanup_replica.pl ....... ok    14054 ms ( 0.02 usr  0.00 sys +  7.64 cusr 25.02
csys= 32.68 CPU)
 
[20:48:04.887] [13:48:04] t/032_relfilenode_reuse.pl ........... ok    12755 ms ( 0.00 usr  0.00 sys +  3.36 cusr 11.57
csys= 14.93 CPU)
 
[20:48:42.055] [13:48:42] t/033_replay_tsp_drops.pl ............ ok    43529 ms ( 0.00 usr  0.00 sys + 12.29 cusr 41.43
csys= 53.71 CPU)
 
[20:50:02.770] [13:50:02] t/027_stream_regress.pl .............. ok   198408 ms ( 0.02 usr  0.06 sys + 44.92 cusr
142.42csys = 187.42 CPU)
 
[20:50:02.771] [13:50:02]
[20:50:02.771] All tests successful.
[20:50:02.771] Files=33, Tests=411, 402 wallclock secs ( 0.16 usr  0.27 sys + 138.03 cusr 441.56 csys = 580.01 CPU)

If 027 had been started sooner, this test might have finished up to 78sec
earlier.  If lots of tests are added in the future, maybe it won't matter, but
it seems like it does now.

As I understand, checks are usually parallelized by "make -j" and not by
"prove".  In that case, starting a slow test later doesn't matter.  But it'd be
better for anyone who runs tap tests manually, and (I think) for meson.

As a one-off test on localhost:
time make check -C src/test/recovery
=> 11m42,790s
time make check -C src/test/recovery PROVE_FLAGS=-j2
=> 7m56,315s

After renaming it to 001:
time make check -C src/test/recovery
=> 11m33,887s (~same)
time make check -C src/test/recovery PROVE_FLAGS=-j2
=> 6m59,969s

I don't know how it affect the buildfarm (but I think that's not optimized
primarily for speed anyway).

-- 
Justin



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH] postgresql.conf.sample comment alignment.
Next
From: Andres Freund
Date:
Subject: Re: Cleaning up historical portability baggage