Thread: Flaky 003_start_stop.pl test
The test introduced in bb86141 turns out to be flaky in CI[1] with the following error message: [03:40:43.047] stderr: [03:40:43.047] # Failed test 'connect : matches' [03:40:43.047] # at /tmp/cirrus-ci-build/src/test/postmaster/t/003_start_stop.pl line 78. [03:40:43.047] # 'psql: error: connection to server on socket "/tmp/X7O4HUuFUZ/.s.PGSQL.25995" failed: FATAL: role "invalid_user" does not exist' [03:40:43.047] # doesn't match '(?^:FATAL: sorry, too many clients already)' [03:40:43.047] # Looks like you failed 1 test of 25. [03:40:43.047] [03:40:43.047] (test program exited with status code 1) [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256
Hi, On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote: > The test introduced in bb86141 turns out to be flaky in CI[1] with the > following error message: > > [03:40:43.047] stderr: > [03:40:43.047] # Failed test 'connect : matches' > [03:40:43.047] # at > /tmp/cirrus-ci-build/src/test/postmaster/t/003_start_stop.pl line 78. > [03:40:43.047] # 'psql: error: connection to server on socket > "/tmp/X7O4HUuFUZ/.s.PGSQL.25995" failed: FATAL: role "invalid_user" > does not exist' > [03:40:43.047] # doesn't match '(?^:FATAL: sorry, too many clients already)' > [03:40:43.047] # Looks like you failed 1 test of 25. > [03:40:43.047] > [03:40:43.047] (test program exited with status code 1) > > [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256 A possibly related test failure is: https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259 [04:12:26.438] ------------------------------------- 8< ------------------------------------- [04:12:26.438] stderr: [04:12:26.438] # Failed test 'reserved_connections limit: matches' [04:12:26.438] # at C:/cirrus/src/test/postmaster/t/002_connection_limits.pl line 63. [04:12:26.438] # 'psql: error: connection to server on socket "C:/Windows/TEMP/jFcAsvY5h5/.s.PGSQL.21116"failed: server closed the connection unexpectedly [04:12:26.438] # This probably means the server terminated abnormally [04:12:26.438] # before or while processing the request.' [04:12:26.438] # doesn't match '(?^:FATAL: remaining connection slots are reserved for roles with the SUPERUSER attribute)' [04:12:26.438] # Looks like you failed 1 test of 27. [04:12:26.438] [04:12:26.438] (test program exited with status code 1) [04:12:26.438] ------------------------------------------------------------------------------ Greetings, Andres Freund
On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote: > > [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256 > > A possibly related test failure is: > > https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259 They look different to me: (1) In Jelte's report for 003_start_stop, we see: [03:40:42.347](2.022s) ok 1 - dead-end connection 0 [03:40:42.348](0.001s) ok 2 - dead-end connection 1 [03:40:42.349](0.001s) ok 3 - dead-end connection 2 ... up to 20 ... We see in the server log that the first one had pid 31660 and slot 1, but it never said "SSLRequest rejected" like all the others. The Perl code must have received "N", or it would not have moved onto opening the next connection. Then just as it is about to try to connect, hoping to see "too many clients", that one exits, the "invalid_role" connection gets slot 1 instead of the dead-end backend we expected: 03:40:42.375 UTC [31645][postmaster] DEBUG: client backend (PID 31660) exited with exit code 0 <-- goodbye, silent backend 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG: connection received: host=[local] 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG: SSLRequest rejected 03:40:42.511 UTC [31645][postmaster] DEBUG: assigned pm child slot 1 for backend <-- we get silent backend's slot 03:40:42.512 UTC [31645][postmaster] DEBUG: forked new client backend, pid=31690 socket=7 03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG: connection received: host=[local] 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG: connection authenticated: user="invalid_user" method=trust (/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117) 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG: connection authorized: user=invalid_user database=postgres application_name=003_start_stop.pl 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL: role "invalid_user" does not exist Could the backend have sent an "N" (NoticeResponse) unrelated to SSL negotiation? (2) In Andres's report for 002_connection_limits, we see that the server logged "FATAL: remaining connection slots are reserved for roles with the SUPERUSER attribute" but the test saw "server closed the connection unexpectedly", on Windows. That's the socket/exit phenomenon of the simple type where the clients next call is recv()[1][2]. If there are no objections I think I will go ahead and re-commit 6051857fc, as threatened in a8458f50, to master only. And think about how to collect CI/BF stats on this specific failure pattern (which happens in various tests with various FATAL messages) to measure the effects. [1] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Miscellaneous_tests_fail_on_Windows_due_to_a_connection_closed_before_receiving_a_final_error_message [2] https://www.postgresql.org/message-id/flat/CA%2BhUKGLR10ZqRCvdoRrkQusq75wF5%3DvEetRSs2_u1s%2BFAUosFQ%40mail.gmail.com
Hi, On 2025-03-04 09:48:45 +1300, Thomas Munro wrote: > On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres@anarazel.de> wrote: > > On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote: > > > [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256 > > > > A possibly related test failure is: > > > > https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259 > > They look different to me: Thanks for checking! > (1) In Jelte's report for 003_start_stop, we see: > > [03:40:42.347](2.022s) ok 1 - dead-end connection 0 > [03:40:42.348](0.001s) ok 2 - dead-end connection 1 > [03:40:42.349](0.001s) ok 3 - dead-end connection 2 > ... up to 20 ... > > We see in the server log that the first one had pid 31660 and slot 1, > but it never said "SSLRequest rejected" like all the others. The Perl > code must have received "N", or it would not have moved onto opening > the next connection. Then just as it is about to try to connect, > hoping to see "too many clients", that one exits, the "invalid_role" > connection gets slot 1 instead of the dead-end backend we expected: > > 03:40:42.375 UTC [31645][postmaster] DEBUG: client backend (PID > 31660) exited with exit code 0 <-- goodbye, silent backend > 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG: > connection received: host=[local] > 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG: > SSLRequest rejected > 03:40:42.511 UTC [31645][postmaster] DEBUG: assigned pm child slot 1 > for backend <-- we get silent backend's slot > 03:40:42.512 UTC [31645][postmaster] DEBUG: forked new client > backend, pid=31690 socket=7 > 03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG: > connection received: host=[local] > 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG: > connection authenticated: user="invalid_user" method=trust > (/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117) > 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG: > connection authorized: user=invalid_user database=postgres > application_name=003_start_stop.pl > 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL: > role "invalid_user" does not exist Good catch. The pattern is similar in other failed tests. There has been a spate of errors on postgres/postgres' CI, which made me look at this again. > Could the backend have sent an "N" (NoticeResponse) unrelated to SSL > negotiation? I don't think that'd be legal protocol flow from the server side. And, given that the test uses log_min_messages=debug2, should have been logged in the server log. All the recent failures were on FreeBSD, interestingly. I wonder if we should, at least temporarily, increase the test's log level? I also suspect that it'd be worthwhile to add a few more if (Trace_connection_negotiation) messages. Greetings, Andres Freund