Re: Recent 027_streaming_regress.pl hangs - Mailing list pgsql-hackers
From | Andrew Dunstan |
---|---|
Subject | Re: Recent 027_streaming_regress.pl hangs |
Date | |
Msg-id | f61976a9-b045-42c0-9a5a-c142c2142641@dunslane.net Whole thread Raw |
In response to | Re: Recent 027_streaming_regress.pl hangs (Andrew Dunstan <andrew@dunslane.net>) |
Responses |
Re: Recent 027_streaming_regress.pl hangs
|
List | pgsql-hackers |
On 2024-07-25 Th 5:14 PM, Tom Lane wrote:I wrote:I'm confused by crake's buildfarm logs. AFAICS it is not running recovery-check at all in most of the runs; at least there is no mention of that step, for example here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02Oh, I see it: the log file that is called recovery-check in a failing run is called misc-check if successful. That seems mighty bizarre, and it's not how my own animals behave. Something weird about the meson code path, perhaps?
Yes, it was discussed some time ago. As suggested by Andres, we run the meson test suite more or less all together (in effect like "make checkworld" but without the main regression suite, which is run on its own first), rather than in the separate (and serialized) way we do with the configure/make tests. That results in significant speedup. If the tests fail we report the failure as happening at the first failure we encounter, which is possibly less than ideal, but I haven't got a better idea.
Anyway, in this successful run: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check here are some salient test timings: 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.18s 9 subtests passed 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 15.95s 12 subtests passed 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 16.29s 14 subtests passed 17/297 postgresql:isolation / isolation/isolation OK 71.60s 119 subtests passed 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 169.13s 18 subtests passed 140/297 postgresql:initdb / initdb/001_initdb OK 41.34s 52 subtests passed 170/297 postgresql:recovery / recovery/027_stream_regress OK 469.49s 9 subtests passed while in the next, failing run https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check the same tests took: 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.22s 9 subtests passed 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 56.62s 12 subtests passed 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 71.92s 14 subtests passed 21/297 postgresql:isolation / isolation/isolation OK 299.12s 119 subtests passed 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 344.42s 18 subtests passed 159/297 postgresql:initdb / initdb/001_initdb OK 344.46s 52 subtests passed 162/297 postgresql:recovery / recovery/027_stream_regress ERROR 840.84s exit status 29 Based on this, it seems fairly likely that crake is simply timing out as a consequence of intermittent heavy background activity.
The latest failure is this:
Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on primary [16:40:29.481](208.545s) # poll_query_until timed out executing this query: # SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('standby_1', 'walreceiver') # expecting this output: # t # last actual query output: # f # with stderr: timed out waiting for catchup at /home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 103. Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my house on it. Pretty much nothing else runs on this machine.I have added a mild throttle to the buildfarm config so it doesn't try to run every branch at once. Maybe I also need to bring down the number or meson jobs too? But I suspect there's something deeper. Prior to the failure of this test 10 days ago it hadn't failed in a very long time. The OS was upgraded a month ago. Eight or so days ago I changed PG_TEST_EXTRA. I can't think of anything else.
There seem to be a bunch of recent failures, and not just on crake, and not just on HEAD: <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit>
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
pgsql-hackers by date: