Re: disabled SSL log_like tests - Mailing list pgsql-hackers

From Tom Lane
Subject Re: disabled SSL log_like tests
Date
Msg-id 2859105.1745015195@sss.pgh.pa.us
Whole thread Raw
In response to Re: disabled SSL log_like tests  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: disabled SSL log_like tests
List pgsql-hackers
I wrote:
> * I was more than slightly surprised to find that there are a bunch of
> other connect_fails callers that are testing log_like or log_unlike
> and thereby risking the same type of race condition.  Some of those
> tests are relatively new and perhaps just haven't failed *yet*,
> but I wonder if we changed something since 2022 that solves this
> problem in a different way?

Nope, apparently not.  The failure depends on the kernel's scheduler,
so unsurprisingly it's quite platform-dependent.  But I can reproduce
it reliably on mamba's host (NetBSD 10/macppc) if I enable the
001_ssltests.pl log_like tests without applying the connect_fails
changes.

The fact that mamba hasn't been failing on the other affected
tests is a bit puzzling.  mamba isn't running kerberos or ldap
or oauth_validator, so the lack of failures there is expected.
authentication/t/001_password.pl appears accidentally not vulnerable:
it's not using log_like in any connect_fails tests.  (It is using
log_unlike, so those tests could be giving silent false negatives.)
But authentication/t/003_peer.pl has 8 test cases that look
vulnerable.  I guess there must be some extra dollop of timing
weirdness in the ssl tests that's not there in 003_peer.pl.

Unfortunately ... it sometimes fails even with the connect_fails
changes, for example

#   Failed test 'intermediate client certificate is missing: log matches'
#   at /home/tgl/pgsql/src/test/ssl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2666.
#                   '2025-04-18 17:59:19.358 EDT [1460] DEBUG:  assigned pm child slot 2 for backend
# 2025-04-18 17:59:19.359 EDT [1460] DEBUG:  forked new client backend, pid=599 socket=8
# 2025-04-18 17:59:19.369 EDT [599] [unknown] LOG:  connection received: host=localhost port=63709
# 2025-04-18 17:59:19.436 EDT [1460] DEBUG:  releasing pm child slot 1
# 2025-04-18 17:59:19.436 EDT [1460] DEBUG:  client backend (PID 25401) exited with exit code 0
# '
#     doesn't match '(?^:Client certificate verification failed at depth 0: unable to get local issuer certificate)'

What I think happened here is that a previous backend hadn't exited
yet when we start the test, and when its report does come out,
connect_fails prematurely stops waiting.  (In the above, evidently
the child process we want to wait for is 599, but we're fooled by
a delayed report for 25401.)  So my v1 patch needs work.

Maybe we can make the test compare the PIDs in the "forked new client
backend" and "client backend exited" log messages.  Stay tuned...

            regards, tom lane



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: disabled SSL log_like tests
Next
From: Tom Lane
Date:
Subject: Re: disabled SSL log_like tests