Re: Recent 027_streaming_regress.pl hangs - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Recent 027_streaming_regress.pl hangs
Date
Msg-id 20240321004145.obrbakt4ugzcsphy@awork3.anarazel.de
Whole thread Raw
In response to Re: Recent 027_streaming_regress.pl hangs  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Recent 027_streaming_regress.pl hangs
Re: Recent 027_streaming_regress.pl hangs
Re: Recent 027_streaming_regress.pl hangs
List pgsql-hackers
Hi,

On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> >> Could it be that the timeout (360 sec?) is just not enough for the test
> >> under the current (changed due to switch to meson) conditions?
> 
> > But you're right that under meson the test takes a lot longer, I guess
> > due to increased concurrency:
> 
> What it seems to be is highly variable.  Looking at calliphoridae's
> last half dozen successful runs, I see reported times for
> 027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
> else is happening on that machine.

There's a lot of other animals on the same machine, however it's rarely fuly
loaded (with either CPU or IO).

I don't think the test just being slow is the issue here, e.g. in the last
failing iteration
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-20%2022%3A03%3A15

the tests completed

2024-03-20 22:07:50.239 UTC [3937667][client backend][22/3255:0] LOG:  statement: DROP ROLE regress_tablespace_user2;
2024-03-20 22:07:50.251 UTC [3937667][client backend][:0] LOG:  disconnection: session time: 0:00:12.431 user=bf
database=regressionhost=[local]
 

and we waited to replicate for quite a while:

2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG:  connection authorized: user=bf database=postgres
application_name=027_stream_regress.pl
2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG:  statement: SELECT '0/15BA21B0' <= replay_lsn AND
state= 'streaming'
 
             FROM pg_catalog.pg_stat_replication
             WHERE application_name IN ('standby_1', 'walreceiver')
2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG:  disconnection: session time: 0:00:00.063 user=bf
database=postgreshost=[local]
 
2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG:  received immediate shutdown request
2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG:  database system is shut down

There was no activity for 7 minutes.

System statistics show relatively low load CPU and IO load for the period from
22:00 - 22:10.


I suspect we have some more fundamental instability at our hands, there have
been failures like this going back a while, and on various machines.



I think at the very least we should make Cluster.pm's wait_for_catchup() print
some information when it times out - right now it's neigh on undebuggable,
because we don't even log what we were waiting for and what the actual
replication position was.



> Also, this is probably not
> helping anything:
> 
>                    'extra_config' => {
>                                                       ...
>                                                       'fsync = on'

At some point we had practically no test coverage of fsync, so I made my
animals use fsync. I think we still have little coverage.  I probably could
reduce the number of animals using it though.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Popcount optimization using AVX512
Next
From: David Rowley
Date:
Subject: Re: Flushing large data immediately in pqcomm