Re: Flaky 003_start_stop.pl test - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Flaky 003_start_stop.pl test
Date
Msg-id 77d310f3-1ddd-4b2f-829a-fe5b8a14c6de@iki.fi
Whole thread Raw
In response to Re: Flaky 003_start_stop.pl test  (Andres Freund <andres@anarazel.de>)
Responses Re: Flaky 003_start_stop.pl test
List pgsql-hackers
Thanks for the analysis!

On 05/03/2025 02:41, Andres Freund wrote:
> On 2025-03-04 09:48:45 +1300, Thomas Munro wrote:
>> On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres@anarazel.de> wrote:
>>> On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote:
>>>> [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256
>>>
>>> A possibly related test failure is:
>>>
>>> https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259
>>
>> They look different to me:
> 
> Thanks for checking!
> 
> 
>> (1) In Jelte's report for 003_start_stop, we see:
>>
>> [03:40:42.347](2.022s) ok 1 - dead-end connection 0
>> [03:40:42.348](0.001s) ok 2 - dead-end connection 1
>> [03:40:42.349](0.001s) ok 3 - dead-end connection 2
>> ... up to 20 ...
>>
>> We see in the server log that the first one had pid 31660 and slot 1,
>> but it never said "SSLRequest rejected" like all the others.  The Perl
>> code must have received "N", or it would not have moved onto opening
>> the next connection.  Then just as it is about to try to connect,
>> hoping to see "too many clients", that one exits, the "invalid_role"
>> connection gets slot 1 instead of the dead-end backend we expected:
>>
>> 03:40:42.375 UTC [31645][postmaster] DEBUG:  client backend (PID
>> 31660) exited with exit code 0 <-- goodbye, silent backend
>> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
>> connection received: host=[local]
>> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
>> SSLRequest rejected
>> 03:40:42.511 UTC [31645][postmaster] DEBUG:  assigned pm child slot 1
>> for backend <-- we get silent backend's slot
>> 03:40:42.512 UTC [31645][postmaster] DEBUG:  forked new client
>> backend, pid=31690 socket=7
>> 03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG:
>> connection received: host=[local]
>> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
>> connection authenticated: user="invalid_user" method=trust
>>
(/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117)
>> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
>> connection authorized: user=invalid_user database=postgres
>> application_name=003_start_stop.pl
>> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL:
>> role "invalid_user" does not exist
> 
> Good catch.
> 
> The pattern is similar in other failed tests. There has been a spate of errors
> on postgres/postgres' CI, which made me look at this again.

I see what's going on here. Before the loop that opens all the 
connections with the SSLRequests, the test does this:

> if (!$node->raw_connect_works())
> {
>     plan skip_all => "this test requires working raw_connect()";
> }

That's the connection you see in the log without the SSLRequest.

When the test fails, what happens is that the backend handling that 
connection is slow. It lingers around, even though the perl client has 
already closed the socket. The backend finally exits _just_ when the 
test has opened all the "dead end" connections, and checks that the 
final connection fails with the "FATAL:  sorry, too many clients 
already" error. Because the backend exits at just the right moment, it 
frees up a connection slot which is then used by that final connection. 
The test fails, because it expects all the connection slots to be in use 
and for that final connection to launch a dead-end backend.

The 002_connection_limits.pl test has potential for a similar issue. I 
added a "sleep(1)" to proc_exit to simulate slow backend exit, and the 
test started failing, because the "safe_psql" calls used in the 
initialization was still consuming a connection slot.

I've pushed a fix to both tests, by restarting the server after the 
initialization steps. That's a little heavy-weight, but it's a simple 
way to ensure that all the backends are gone.

- Heikki




pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: libxml2 video about its abandonment
Next
From: Jasper Smit
Date:
Subject: Re: Visibility bug in tuple lock