Re: Flaky 003_start_stop.pl test - Mailing list pgsql-hackers
| From | Heikki Linnakangas |
|---|---|
| Subject | Re: Flaky 003_start_stop.pl test |
| Date | |
| Msg-id | 77d310f3-1ddd-4b2f-829a-fe5b8a14c6de@iki.fi Whole thread Raw |
| In response to | Re: Flaky 003_start_stop.pl test (Andres Freund <andres@anarazel.de>) |
| Responses |
Re: Flaky 003_start_stop.pl test
|
| List | pgsql-hackers |
Thanks for the analysis! On 05/03/2025 02:41, Andres Freund wrote: > On 2025-03-04 09:48:45 +1300, Thomas Munro wrote: >> On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres@anarazel.de> wrote: >>> On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote: >>>> [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256 >>> >>> A possibly related test failure is: >>> >>> https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259 >> >> They look different to me: > > Thanks for checking! > > >> (1) In Jelte's report for 003_start_stop, we see: >> >> [03:40:42.347](2.022s) ok 1 - dead-end connection 0 >> [03:40:42.348](0.001s) ok 2 - dead-end connection 1 >> [03:40:42.349](0.001s) ok 3 - dead-end connection 2 >> ... up to 20 ... >> >> We see in the server log that the first one had pid 31660 and slot 1, >> but it never said "SSLRequest rejected" like all the others. The Perl >> code must have received "N", or it would not have moved onto opening >> the next connection. Then just as it is about to try to connect, >> hoping to see "too many clients", that one exits, the "invalid_role" >> connection gets slot 1 instead of the dead-end backend we expected: >> >> 03:40:42.375 UTC [31645][postmaster] DEBUG: client backend (PID >> 31660) exited with exit code 0 <-- goodbye, silent backend >> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG: >> connection received: host=[local] >> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG: >> SSLRequest rejected >> 03:40:42.511 UTC [31645][postmaster] DEBUG: assigned pm child slot 1 >> for backend <-- we get silent backend's slot >> 03:40:42.512 UTC [31645][postmaster] DEBUG: forked new client >> backend, pid=31690 socket=7 >> 03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG: >> connection received: host=[local] >> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG: >> connection authenticated: user="invalid_user" method=trust >> (/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117) >> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG: >> connection authorized: user=invalid_user database=postgres >> application_name=003_start_stop.pl >> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL: >> role "invalid_user" does not exist > > Good catch. > > The pattern is similar in other failed tests. There has been a spate of errors > on postgres/postgres' CI, which made me look at this again. I see what's going on here. Before the loop that opens all the connections with the SSLRequests, the test does this: > if (!$node->raw_connect_works()) > { > plan skip_all => "this test requires working raw_connect()"; > } That's the connection you see in the log without the SSLRequest. When the test fails, what happens is that the backend handling that connection is slow. It lingers around, even though the perl client has already closed the socket. The backend finally exits _just_ when the test has opened all the "dead end" connections, and checks that the final connection fails with the "FATAL: sorry, too many clients already" error. Because the backend exits at just the right moment, it frees up a connection slot which is then used by that final connection. The test fails, because it expects all the connection slots to be in use and for that final connection to launch a dead-end backend. The 002_connection_limits.pl test has potential for a similar issue. I added a "sleep(1)" to proc_exit to simulate slow backend exit, and the test started failing, because the "safe_psql" calls used in the initialization was still consuming a connection slot. I've pushed a fix to both tests, by restarting the server after the initialization steps. That's a little heavy-weight, but it's a simple way to ensure that all the backends are gone. - Heikki
pgsql-hackers by date: