Thread: Flaky 003_start_stop.pl test

Flaky 003_start_stop.pl test

From
Jelte Fennema-Nio
Date:
The test introduced in bb86141 turns out to be flaky in CI[1] with the
following error message:

[03:40:43.047] stderr:
[03:40:43.047] # Failed test 'connect : matches'
[03:40:43.047] # at
/tmp/cirrus-ci-build/src/test/postmaster/t/003_start_stop.pl line 78.
[03:40:43.047] # 'psql: error: connection to server on socket
"/tmp/X7O4HUuFUZ/.s.PGSQL.25995" failed: FATAL: role "invalid_user"
does not exist'
[03:40:43.047] # doesn't match '(?^:FATAL: sorry, too many clients already)'
[03:40:43.047] # Looks like you failed 1 test of 25.
[03:40:43.047]
[03:40:43.047] (test program exited with status code 1)

[1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256



Re: Flaky 003_start_stop.pl test

From
Andres Freund
Date:
Hi,

On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote:
> The test introduced in bb86141 turns out to be flaky in CI[1] with the
> following error message:
>
> [03:40:43.047] stderr:
> [03:40:43.047] # Failed test 'connect : matches'
> [03:40:43.047] # at
> /tmp/cirrus-ci-build/src/test/postmaster/t/003_start_stop.pl line 78.
> [03:40:43.047] # 'psql: error: connection to server on socket
> "/tmp/X7O4HUuFUZ/.s.PGSQL.25995" failed: FATAL: role "invalid_user"
> does not exist'
> [03:40:43.047] # doesn't match '(?^:FATAL: sorry, too many clients already)'
> [03:40:43.047] # Looks like you failed 1 test of 25.
> [03:40:43.047]
> [03:40:43.047] (test program exited with status code 1)
>
> [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256

A possibly related test failure is:

https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259

[04:12:26.438] ------------------------------------- 8< -------------------------------------
[04:12:26.438] stderr:
[04:12:26.438] #   Failed test 'reserved_connections limit: matches'
[04:12:26.438] #   at C:/cirrus/src/test/postmaster/t/002_connection_limits.pl line 63.
[04:12:26.438] #                   'psql: error: connection to server on socket
"C:/Windows/TEMP/jFcAsvY5h5/.s.PGSQL.21116"failed: server closed the connection unexpectedly
 
[04:12:26.438] #     This probably means the server terminated abnormally
[04:12:26.438] #     before or while processing the request.'
[04:12:26.438] #     doesn't match '(?^:FATAL:  remaining connection slots are reserved for roles with the SUPERUSER
attribute)'
[04:12:26.438] # Looks like you failed 1 test of 27.
[04:12:26.438]
[04:12:26.438] (test program exited with status code 1)
[04:12:26.438] ------------------------------------------------------------------------------

Greetings,

Andres Freund



Re: Flaky 003_start_stop.pl test

From
Thomas Munro
Date:
On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote:
> > [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256
>
> A possibly related test failure is:
>
> https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259

They look different to me:

(1) In Jelte's report for 003_start_stop, we see:

[03:40:42.347](2.022s) ok 1 - dead-end connection 0
[03:40:42.348](0.001s) ok 2 - dead-end connection 1
[03:40:42.349](0.001s) ok 3 - dead-end connection 2
... up to 20 ...

We see in the server log that the first one had pid 31660 and slot 1,
but it never said "SSLRequest rejected" like all the others.  The Perl
code must have received "N", or it would not have moved onto opening
the next connection.  Then just as it is about to try to connect,
hoping to see "too many clients", that one exits, the "invalid_role"
connection gets slot 1 instead of the dead-end backend we expected:

03:40:42.375 UTC [31645][postmaster] DEBUG:  client backend (PID
31660) exited with exit code 0 <-- goodbye, silent backend
03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
connection received: host=[local]
03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
SSLRequest rejected
03:40:42.511 UTC [31645][postmaster] DEBUG:  assigned pm child slot 1
for backend <-- we get silent backend's slot
03:40:42.512 UTC [31645][postmaster] DEBUG:  forked new client
backend, pid=31690 socket=7
03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG:
connection received: host=[local]
03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
connection authenticated: user="invalid_user" method=trust
(/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117)
03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
connection authorized: user=invalid_user database=postgres
application_name=003_start_stop.pl
03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL:
role "invalid_user" does not exist

Could the backend have sent an "N" (NoticeResponse) unrelated to SSL
negotiation?

(2) In Andres's report for 002_connection_limits, we see that the
server logged "FATAL:  remaining connection slots are reserved for
roles with the SUPERUSER attribute" but the test saw "server closed
the connection unexpectedly", on Windows.  That's the socket/exit
phenomenon of the simple type where the clients next call is
recv()[1][2].  If there are no objections I think I will go ahead and
re-commit 6051857fc, as threatened in a8458f50, to master only.  And
think about how to collect CI/BF stats on this specific failure
pattern (which happens in various tests with various FATAL messages)
to measure the effects.

[1]
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Miscellaneous_tests_fail_on_Windows_due_to_a_connection_closed_before_receiving_a_final_error_message
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLR10ZqRCvdoRrkQusq75wF5%3DvEetRSs2_u1s%2BFAUosFQ%40mail.gmail.com



Re: Flaky 003_start_stop.pl test

From
Andres Freund
Date:
Hi,

On 2025-03-04 09:48:45 +1300, Thomas Munro wrote:
> On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote:
> > > [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256
> >
> > A possibly related test failure is:
> >
> > https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259
> 
> They look different to me:

Thanks for checking!


> (1) In Jelte's report for 003_start_stop, we see:
> 
> [03:40:42.347](2.022s) ok 1 - dead-end connection 0
> [03:40:42.348](0.001s) ok 2 - dead-end connection 1
> [03:40:42.349](0.001s) ok 3 - dead-end connection 2
> ... up to 20 ...
> 
> We see in the server log that the first one had pid 31660 and slot 1,
> but it never said "SSLRequest rejected" like all the others.  The Perl
> code must have received "N", or it would not have moved onto opening
> the next connection.  Then just as it is about to try to connect,
> hoping to see "too many clients", that one exits, the "invalid_role"
> connection gets slot 1 instead of the dead-end backend we expected:
> 
> 03:40:42.375 UTC [31645][postmaster] DEBUG:  client backend (PID
> 31660) exited with exit code 0 <-- goodbye, silent backend
> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
> connection received: host=[local]
> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
> SSLRequest rejected
> 03:40:42.511 UTC [31645][postmaster] DEBUG:  assigned pm child slot 1
> for backend <-- we get silent backend's slot
> 03:40:42.512 UTC [31645][postmaster] DEBUG:  forked new client
> backend, pid=31690 socket=7
> 03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG:
> connection received: host=[local]
> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
> connection authenticated: user="invalid_user" method=trust
>
(/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117)
> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
> connection authorized: user=invalid_user database=postgres
> application_name=003_start_stop.pl
> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL:
> role "invalid_user" does not exist

Good catch.

The pattern is similar in other failed tests. There has been a spate of errors
on postgres/postgres' CI, which made me look at this again.


> Could the backend have sent an "N" (NoticeResponse) unrelated to SSL
> negotiation?

I don't think that'd be legal protocol flow from the server side. And, given
that the test uses log_min_messages=debug2, should have been logged in the
server log.

All the recent failures were on FreeBSD, interestingly.


I wonder if we should, at least temporarily, increase the test's log level?

I also suspect that it'd be worthwhile to add a few more if
(Trace_connection_negotiation) messages.

Greetings,

Andres Freund