Re: Flaky 003_start_stop.pl test - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Flaky 003_start_stop.pl test
Date
Msg-id rm3wawhaebopcncwwadnenypl7mwwqacoqj2u35qi36d5zcms2@wbbcb6t6ukf4
Whole thread Raw
In response to Re: Flaky 003_start_stop.pl test  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hi,

On 2025-03-04 09:48:45 +1300, Thomas Munro wrote:
> On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote:
> > > [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256
> >
> > A possibly related test failure is:
> >
> > https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259
> 
> They look different to me:

Thanks for checking!


> (1) In Jelte's report for 003_start_stop, we see:
> 
> [03:40:42.347](2.022s) ok 1 - dead-end connection 0
> [03:40:42.348](0.001s) ok 2 - dead-end connection 1
> [03:40:42.349](0.001s) ok 3 - dead-end connection 2
> ... up to 20 ...
> 
> We see in the server log that the first one had pid 31660 and slot 1,
> but it never said "SSLRequest rejected" like all the others.  The Perl
> code must have received "N", or it would not have moved onto opening
> the next connection.  Then just as it is about to try to connect,
> hoping to see "too many clients", that one exits, the "invalid_role"
> connection gets slot 1 instead of the dead-end backend we expected:
> 
> 03:40:42.375 UTC [31645][postmaster] DEBUG:  client backend (PID
> 31660) exited with exit code 0 <-- goodbye, silent backend
> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
> connection received: host=[local]
> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
> SSLRequest rejected
> 03:40:42.511 UTC [31645][postmaster] DEBUG:  assigned pm child slot 1
> for backend <-- we get silent backend's slot
> 03:40:42.512 UTC [31645][postmaster] DEBUG:  forked new client
> backend, pid=31690 socket=7
> 03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG:
> connection received: host=[local]
> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
> connection authenticated: user="invalid_user" method=trust
>
(/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117)
> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
> connection authorized: user=invalid_user database=postgres
> application_name=003_start_stop.pl
> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL:
> role "invalid_user" does not exist

Good catch.

The pattern is similar in other failed tests. There has been a spate of errors
on postgres/postgres' CI, which made me look at this again.


> Could the backend have sent an "N" (NoticeResponse) unrelated to SSL
> negotiation?

I don't think that'd be legal protocol flow from the server side. And, given
that the test uses log_min_messages=debug2, should have been logged in the
server log.

All the recent failures were on FreeBSD, interestingly.


I wonder if we should, at least temporarily, increase the test's log level?

I also suspect that it'd be worthwhile to add a few more if
(Trace_connection_negotiation) messages.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "Euler Taveira"
Date:
Subject: Re: log_min_messages per backend type
Next
From: Jacob Champion
Date:
Subject: Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible