Re: Recent 027_streaming_regress.pl hangs - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Recent 027_streaming_regress.pl hangs |
Date | |
Msg-id | 20240321004145.obrbakt4ugzcsphy@awork3.anarazel.de Whole thread Raw |
In response to | Re: Recent 027_streaming_regress.pl hangs (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Recent 027_streaming_regress.pl hangs
Re: Recent 027_streaming_regress.pl hangs Re: Recent 027_streaming_regress.pl hangs |
List | pgsql-hackers |
Hi, On 2024-03-14 16:56:39 -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > >> Could it be that the timeout (360 sec?) is just not enough for the test > >> under the current (changed due to switch to meson) conditions? > > > But you're right that under meson the test takes a lot longer, I guess > > due to increased concurrency: > > What it seems to be is highly variable. Looking at calliphoridae's > last half dozen successful runs, I see reported times for > 027_stream_regress anywhere from 183 to 704 seconds. I wonder what > else is happening on that machine. There's a lot of other animals on the same machine, however it's rarely fuly loaded (with either CPU or IO). I don't think the test just being slow is the issue here, e.g. in the last failing iteration https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-20%2022%3A03%3A15 the tests completed 2024-03-20 22:07:50.239 UTC [3937667][client backend][22/3255:0] LOG: statement: DROP ROLE regress_tablespace_user2; 2024-03-20 22:07:50.251 UTC [3937667][client backend][:0] LOG: disconnection: session time: 0:00:12.431 user=bf database=regressionhost=[local] and we waited to replicate for quite a while: 2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG: connection authorized: user=bf database=postgres application_name=027_stream_regress.pl 2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG: statement: SELECT '0/15BA21B0' <= replay_lsn AND state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name IN ('standby_1', 'walreceiver') 2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG: disconnection: session time: 0:00:00.063 user=bf database=postgreshost=[local] 2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG: received immediate shutdown request 2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG: database system is shut down There was no activity for 7 minutes. System statistics show relatively low load CPU and IO load for the period from 22:00 - 22:10. I suspect we have some more fundamental instability at our hands, there have been failures like this going back a while, and on various machines. I think at the very least we should make Cluster.pm's wait_for_catchup() print some information when it times out - right now it's neigh on undebuggable, because we don't even log what we were waiting for and what the actual replication position was. > Also, this is probably not > helping anything: > > 'extra_config' => { > ... > 'fsync = on' At some point we had practically no test coverage of fsync, so I made my animals use fsync. I think we still have little coverage. I probably could reduce the number of animals using it though. Greetings, Andres Freund
pgsql-hackers by date: