Re: Flaky 003_start_stop.pl test - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Flaky 003_start_stop.pl test |
Date | |
Msg-id | CA+hUKGKJSOAdAukP4QTkR3-jFws39+8C197XC-a97dgYr=cdBA@mail.gmail.com Whole thread Raw |
In response to | Re: Flaky 003_start_stop.pl test (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote: > > [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256 > > A possibly related test failure is: > > https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259 They look different to me: (1) In Jelte's report for 003_start_stop, we see: [03:40:42.347](2.022s) ok 1 - dead-end connection 0 [03:40:42.348](0.001s) ok 2 - dead-end connection 1 [03:40:42.349](0.001s) ok 3 - dead-end connection 2 ... up to 20 ... We see in the server log that the first one had pid 31660 and slot 1, but it never said "SSLRequest rejected" like all the others. The Perl code must have received "N", or it would not have moved onto opening the next connection. Then just as it is about to try to connect, hoping to see "too many clients", that one exits, the "invalid_role" connection gets slot 1 instead of the dead-end backend we expected: 03:40:42.375 UTC [31645][postmaster] DEBUG: client backend (PID 31660) exited with exit code 0 <-- goodbye, silent backend 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG: connection received: host=[local] 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG: SSLRequest rejected 03:40:42.511 UTC [31645][postmaster] DEBUG: assigned pm child slot 1 for backend <-- we get silent backend's slot 03:40:42.512 UTC [31645][postmaster] DEBUG: forked new client backend, pid=31690 socket=7 03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG: connection received: host=[local] 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG: connection authenticated: user="invalid_user" method=trust (/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117) 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG: connection authorized: user=invalid_user database=postgres application_name=003_start_stop.pl 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL: role "invalid_user" does not exist Could the backend have sent an "N" (NoticeResponse) unrelated to SSL negotiation? (2) In Andres's report for 002_connection_limits, we see that the server logged "FATAL: remaining connection slots are reserved for roles with the SUPERUSER attribute" but the test saw "server closed the connection unexpectedly", on Windows. That's the socket/exit phenomenon of the simple type where the clients next call is recv()[1][2]. If there are no objections I think I will go ahead and re-commit 6051857fc, as threatened in a8458f50, to master only. And think about how to collect CI/BF stats on this specific failure pattern (which happens in various tests with various FATAL messages) to measure the effects. [1] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Miscellaneous_tests_fail_on_Windows_due_to_a_connection_closed_before_receiving_a_final_error_message [2] https://www.postgresql.org/message-id/flat/CA%2BhUKGLR10ZqRCvdoRrkQusq75wF5%3DvEetRSs2_u1s%2BFAUosFQ%40mail.gmail.com
pgsql-hackers by date: