Re: standby recovery fails (tablespace related) (tentative patch and discussion) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Date
Msg-id CA+hUKGKU3p1=eSE7gQHaH=GxPLsVYCRnDdUaHcu_aV2r4GwdPA@mail.gmail.com
Whole thread Raw
In response to Re: standby recovery fails (tablespace related) (tentative patch and discussion)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: standby recovery fails (tablespace related) (tentative patch and discussion)
List pgsql-hackers
On Sun, Jul 31, 2022 at 4:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, quite aside from stability, is it really necessary for this test to
> be so freakin' slow?  florican for instance reports
>
> [12:43:38] t/025_stuck_on_old_timeline.pl ....... ok    49010 ms ( 0.00 usr  0.00 sys +  3.64 cusr  2.49 csys =  6.13
CPU)
> [12:44:12] t/026_overwrite_contrecord.pl ........ ok    34751 ms ( 0.01 usr  0.00 sys +  3.14 cusr  1.76 csys =  4.91
CPU)
> [12:49:00] t/027_stream_regress.pl .............. ok   287278 ms ( 0.00 usr  0.00 sys +  9.66 cusr  6.95 csys = 16.60
CPU)
> [12:50:04] t/028_pitr_timelines.pl .............. ok    64543 ms ( 0.00 usr  0.00 sys +  3.59 cusr  3.20 csys =  6.78
CPU)
> [12:50:17] t/029_stats_restart.pl ............... ok    12505 ms ( 0.02 usr  0.00 sys +  3.16 cusr  1.40 csys =  4.57
CPU)
> [12:50:51] t/030_stats_cleanup_replica.pl ....... ok    33933 ms ( 0.01 usr  0.01 sys +  3.55 cusr  2.46 csys =  6.03
CPU)
> [12:51:25] t/031_recovery_conflict.pl ........... ok    34249 ms ( 0.00 usr  0.00 sys +  3.37 cusr  2.20 csys =  5.57
CPU)
> [12:52:09] t/032_relfilenode_reuse.pl ........... ok    44274 ms ( 0.01 usr  0.00 sys +  3.21 cusr  2.05 csys =  5.27
CPU)
> [12:54:07] t/033_replay_tsp_drops.pl ............ ok   117840 ms ( 0.01 usr  0.00 sys +  8.72 cusr  5.41 csys = 14.14
CPU)
>
> 027 is so bloated because it runs the core regression tests YA time,
> which I'm not very happy about either; but that's no excuse for
> every new test to contribute an additional couple of minutes.

Complaints about 027 noted, I'm thinking about what we could do about that.

As for 033, I worried that it might be the new ProcSignalBarrier stuff
around tablespaces, but thankfully the DEBUG logging I added there
recently shows those all completing in single digit milliseconds.  I
also confirmed there are no unexpected fsync'd being produced here.

That is quite a lot of CPU, but it's a huge amount of total runtime.
It runs in 5-8 seconds on various modern systems, 19 seconds on my
Linux RPi4, and 50 seconds on my Celeron-powered NAS box with spinning
disks.

I noticed this is a 32 bit FBSD system.  Is it running on UFS, perhaps
on slow storage?  Are soft updates enabled (visible as options in
output of "mount")?  Without soft updates, a lot more file system ops
perform synchronous I/O, which really slows down our tests.  In
general, UFS isn't as good as modern file systems at avoiding I/O for
short-lived files, and we set up and tear down a lot of them in our
testing.  Another thing that makes a difference is to use a filesystem
with 8KB block size.  This has been a subject of investigation for
speeding up CI (see src/tools/ci/gcp_freebsd_repartition.sh), but
several mysteries remain unsolved...



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Reducing the maintenance overhead of test_oat_hooks
Next
From: Tom Lane
Date:
Subject: Regression coverage for contrib/tcn