Thread: pgsql: Log details for client certificate failures
Log details for client certificate failures Currently, debugging client certificate verification failures is mostly limited to looking at the TLS alert code on the client side. For simple deployments, sometimes it's enough to see "sslv3 alert certificate revoked" and know exactly what needs to be fixed, but if you add any more complexity (multiple CA layers, misconfigured CA certificates, etc.), trying to debug what happened based on the TLS alert alone can be an exercise in frustration. Luckily, the server has more information about exactly what failed in the chain, and we already have the requisite callback implemented as a stub. We fill that in, collect the data, and pass the constructed error message back to the main code via a static variable. This lets us add our error details directly to the final "could not accept SSL connection" log message, as opposed to issuing intermediate LOGs. It ends up looking like LOG: connection received: host=localhost port=43112 LOG: could not accept SSL connection: certificate verify failed DETAIL: Client certificate verification failed at depth 1: unable to get local issuer certificate. Failed certificate data (unverified): subject "/CN=Test CA for PostgreSQL SSL regression test client certs",serial number 2315134995201656577, issuer "/CN=Test root CA for PostgreSQL SSL regression test suite". The length of the Subject and Issuer strings is limited to prevent malicious client certs from spamming the logs. In case the truncation makes things ambiguous, the certificate's serial number is also logged. Author: Jacob Champion <pchampion@vmware.com> Discussion: https://www.postgresql.org/message-id/flat/d13c4a5787c2a3f83705124f0391e0738c796751.camel@vmware.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/3a0e385048adf84d212e010d82db16228e5f4060 Modified Files -------------- src/backend/libpq/be-secure-openssl.c | 112 +++++++++++++++++++++++++++++++++- src/test/ssl/conf/client-long.config | 14 +++++ src/test/ssl/ssl/client-long.crt | 20 ++++++ src/test/ssl/ssl/client-long.key | 27 ++++++++ src/test/ssl/sslfiles.mk | 2 +- src/test/ssl/t/001_ssltests.pl | 40 +++++++++++- src/test/ssl/t/SSL/Backend/OpenSSL.pm | 2 +- 7 files changed, 210 insertions(+), 7 deletions(-)
On Fri, Jul 15, 2022 at 03:35:28PM +0000, Peter Eisentraut wrote: > Log details for client certificate failures This seems to be causing cfbot/linux to fail since yesterday. -- Justin
On 16.07.22 13:37, Justin Pryzby wrote: > On Fri, Jul 15, 2022 at 03:35:28PM +0000, Peter Eisentraut wrote: >> Log details for client certificate failures > > This seems to be causing cfbot/linux to fail since yesterday. Hmm, the Cirrus CI build works for me: https://cirrus-ci.com/task/6520112718020608 Isn't the cfbot using the same thing?
On Sat, Jul 16, 2022 at 01:57:25PM +0200, Peter Eisentraut wrote: > On 16.07.22 13:37, Justin Pryzby wrote: > > On Fri, Jul 15, 2022 at 03:35:28PM +0000, Peter Eisentraut wrote: > > > Log details for client certificate failures > > > > This seems to be causing cfbot/linux to fail since yesterday. > > Hmm, the Cirrus CI build works for me: > https://cirrus-ci.com/task/6520112718020608 > > Isn't the cfbot using the same thing? It's using the same thing, so I suppose the failure is intermittent. It failed for me just now: https://cirrus-ci.com/task/6502388730167296 -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Sat, Jul 16, 2022 at 01:57:25PM +0200, Peter Eisentraut wrote: >> Isn't the cfbot using the same thing? > It's using the same thing, so I suppose the failure is intermittent. > It failed for me just now: https://cirrus-ci.com/task/6502388730167296 It looks like a timing problem, which you'd probably never reproduce on an unloaded development system. On the presumably-busy CI hosts, I think we're managing to examine the log file before the backend has written the log entry. Can we rejigger the backend so that the log entry is written before we shove the failure response out to the client? Maybe not given the implementation technology. In that case the alternatives seem to be (1) insert a delay in the log_like checks (ugh) or (2) abandon attempting to test this particular behavior. regards, tom lane
> Justin Pryzby <pryzby@telsasoft.com> writes: >> It failed for me just now: https://cirrus-ci.com/task/6502388730167296 Looking at the cfbot, an annoyingly large percentage of current runs are being broken by this. I think we need to do this as a stopgap: > (2) abandon attempting to test this particular behavior. If somebody wants to think up a better answer later, great, but the current situation is completely not tolerable. regards, tom lane
Hi, On 2022-07-16 10:45:29 -0400, Tom Lane wrote: > > Justin Pryzby <pryzby@telsasoft.com> writes: > >> It failed for me just now: https://cirrus-ci.com/task/6502388730167296 > > Looking at the cfbot, an annoyingly large percentage of current runs > are being broken by this. I think we need to do this as a stopgap: > > > (2) abandon attempting to test this particular behavior. > > If somebody wants to think up a better answer later, great, but > the current situation is completely not tolerable. Yea, it's quite annoying. I think it might be worth to not wait for Peter to chime in, but to just add a skip to the test for now? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-07-16 10:45:29 -0400, Tom Lane wrote: >> Looking at the cfbot, an annoyingly large percentage of current runs >> are being broken by this. I think we need to do this as a stopgap: >>> (2) abandon attempting to test this particular behavior. > Yea, it's quite annoying. I think it might be worth to not wait for Peter to > chime in, but to just add a skip to the test for now? I was thinking of commenting out the log_like parts of the affected tests. Maybe that's too cute, but at least it'd still run through the code of interest, even if we don't verify the output. Anyway, if Peter doesn't answer pretty soon I'll go do that. regards, tom lane
Hi, On 2022-07-16 17:33:47 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Yea, it's quite annoying. I think it might be worth to not wait for Peter to > > chime in, but to just add a skip to the test for now? > > I was thinking of commenting out the log_like parts of the affected > tests. Maybe that's too cute, but at least it'd still run through > the code of interest, even if we don't verify the output. WFM. > Anyway, if Peter doesn't answer pretty soon I'll go do that. It's about midnight in Germany, so unlikely to come soon... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-07-16 17:33:47 -0400, Tom Lane wrote: >> Anyway, if Peter doesn't answer pretty soon I'll go do that. > It's about midnight in Germany, so unlikely to come soon... Done now. regards, tom lane
On Sat, Jul 16, 2022 at 06:26:57PM -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-07-16 17:33:47 -0400, Tom Lane wrote: > >> Anyway, if Peter doesn't answer pretty soon I'll go do that. > > > It's about midnight in Germany, so unlikely to come soon... > > Done now. I think this is still failing https://cirrus-ci.com/task/6142528721256448 You can see the cfbot user's test history here https://cirrus-ci.com/github/postgresql-cfbot/postgresql/
Justin Pryzby <pryzby@telsasoft.com> writes: > I think this is still failing > https://cirrus-ci.com/task/6142528721256448 Sigh ... misleading formatting led me to see one log_like as belonging to a connect_ok test, when it belongs to a connect_fails. Fixed, thanks. regards, tom lane