Re: Improve error reporting in 027_stream_regress test - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Improve error reporting in 027_stream_regress test
Date
Msg-id 869277.1753793389@sss.pgh.pa.us
Whole thread Raw
In response to Re: Improve error reporting in 027_stream_regress test  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Jul 29, 2025 at 12:41:39AM -0400, Tom Lane wrote:
>> There is something strange happening on mamba --- not sure what,
>> but its cycle time for the past week has been a lot more than normal.

> We are getting PQPING_NO_RESPONSE meaning a lack of report activity
> from the postmaster.  An increase in timeout may help, but the host
> seems like it's facing a high workload so it's not really possible to
> come with a perfect number, just an estimation.  How about adding a
> --timeout to pg_isready based on PGCONNECT_TIMEOUT, like in the
> attached?

Yeah, looking closely at the failure logs [1], the standby postmaster
reported pg_isready's connection request in good order at about
08:01:47, but then nothing happened for a few seconds:

2025-07-28 08:01:47.884 EDT [6985:1] [unknown] LOG:  connection received: host=[local]
2025-07-28 08:01:48.261 EDT [6985:2] [unknown] LOG:  connection authenticated: user="buildfarm" method=trust
(/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/recovery/tmp_check/t_027_stream_regress_standby_1_data/pgdata/pg_hba.conf:117)
2025-07-28 08:01:48.261 EDT [6985:3] [unknown] LOG:  connection authorized: user=buildfarm database=postgres
application_name=027_stream_regress.pl
2025-07-28 08:01:51.552 EDT [6985:4] 027_stream_regress.pl LOG:  could not send data to client: Broken pipe
2025-07-28 08:01:51.552 EDT [6985:5] 027_stream_regress.pl FATAL:  connection to client lost
2025-07-28 08:01:51.552 EDT [6985:6] 027_stream_regress.pl LOG:  disconnection: session time: 0:00:03.670
user=buildfarmdatabase=postgres host=[local] 

The client side reported failure at 08:01:50 and a bit, matching
pg_isready's default 3-second timeout:

# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24780
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24780 - accepting connections
# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24781
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24781 - no response
[08:01:50.899](1505.313s) ok 2 - regression tests pass
[08:01:50.902](0.003s) ok 3 - primary alive after regression test run
[08:01:50.905](0.003s) not ok 4 - standby alive after regression test run
[08:01:50.908](0.003s)
[08:01:50.908](0.000s) #   Failed test 'standby alive after regression test run'
#   at t/027_stream_regress.pl line 104.
[08:01:50.909](0.001s) #          got: '0'
#     expected: '1'

So I agree that this looks a lot like pg_isready timed out.  It's not
too surprising if the machine was heavily loaded at that point, since
the standby would probably still be replaying the last actions of the
regression tests.

+1 for lengthening the timeout, as I see you already did.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-07-28%2007%3A46%3A26



pgsql-hackers by date:

Previous
From: "Euler Taveira"
Date:
Subject: Re: refactor backend type lists
Next
From: "Euler Taveira"
Date:
Subject: Re: event trigger support for PL/Python