Re: Recent 027_streaming_regress.pl hangs - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: Recent 027_streaming_regress.pl hangs
Date
Msg-id f61976a9-b045-42c0-9a5a-c142c2142641@dunslane.net
Whole thread Raw
In response to Re: Recent 027_streaming_regress.pl hangs  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: Recent 027_streaming_regress.pl hangs
List pgsql-hackers


On 2024-07-25 Th 6:33 PM, Andrew Dunstan wrote:


On 2024-07-25 Th 5:14 PM, Tom Lane wrote:
I wrote:
I'm confused by crake's buildfarm logs.  AFAICS it is not running
recovery-check at all in most of the runs; at least there is no
mention of that step, for example here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02
Oh, I see it: the log file that is called recovery-check in a
failing run is called misc-check if successful.  That seems
mighty bizarre, and it's not how my own animals behave.
Something weird about the meson code path, perhaps?


Yes, it was discussed some time ago. As suggested by Andres, we run the meson test suite more or less all together (in effect like "make checkworld" but without the main regression suite, which is run on its own first), rather than in the separate (and serialized) way we do with the configure/make tests. That results in significant speedup. If the tests fail we report the failure as happening at the first failure we encounter, which is possibly less than ideal, but I haven't got a better idea.


Anyway, in this successful run:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check

here are some salient test timings:
  1/297 postgresql:pg_upgrade / pg_upgrade/001_basic                                    OK                0.18s   9 subtests passed  2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots                            OK               15.95s   12 subtests passed  3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription                             OK               16.29s   14 subtests passed 17/297 postgresql:isolation / isolation/isolation                                      OK               71.60s   119 subtests passed 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade                               OK              169.13s   18 subtests passed
140/297 postgresql:initdb / initdb/001_initdb                                           OK               41.34s   52 subtests passed
170/297 postgresql:recovery / recovery/027_stream_regress                               OK              469.49s   9 subtests passed

while in the next, failing run

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check

the same tests took:
  1/297 postgresql:pg_upgrade / pg_upgrade/001_basic                                    OK                0.22s   9 subtests passed  2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots                            OK               56.62s   12 subtests passed  3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription                             OK               71.92s   14 subtests passed 21/297 postgresql:isolation / isolation/isolation                                      OK              299.12s   119 subtests passed 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade                               OK              344.42s   18 subtests passed
159/297 postgresql:initdb / initdb/001_initdb                                           OK              344.46s   52 subtests passed
162/297 postgresql:recovery / recovery/027_stream_regress                               ERROR           840.84s   exit status 29

Based on this, it seems fairly likely that crake is simply timing out
as a consequence of intermittent heavy background activity.



The latest failure is this:


Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on primary
[16:40:29.481](208.545s) # poll_query_until timed out executing this query:
# SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for catchup at /home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 103.


Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my house on it. Pretty much nothing else runs on this machine. 

I have added a mild throttle to the buildfarm config so it doesn't try to run every branch at once. Maybe I also need to bring down the number or meson jobs too? But I suspect there's something deeper. Prior to the failure of this test 10 days ago it hadn't failed in a very long time. The OS was upgraded a month ago. Eight or so days ago I changed PG_TEST_EXTRA. I can't think of anything else.





There seem to be a bunch of recent failures, and not just on crake, and not just on HEAD: <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit>


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Jim Vanns
Date:
Subject: Re: Suggestions to overcome 'multixact "members" limit exceeded' in temporary tables
Next
From: Tom Lane
Date:
Subject: Re: Recent 027_streaming_regress.pl hangs