Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert - Mailing list pgsql-hackers

From Daniel Gustafsson
Subject Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
Date
Msg-id 9B6FE12B-D665-4764-895B-839CF650D543@yesql.se
Whole thread Raw
In response to Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert  (Jacob Champion <jchampion@timescale.com>)
Responses Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
List pgsql-hackers
> On 14 Apr 2023, at 19:34, Jacob Champion <jchampion@timescale.com> wrote:
>
> On Fri, Apr 14, 2023 at 7:20 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>> And again with the attachment.
>
> After some sleep... From inspection I think the final EOF branch could
> be masked by the new branch, if verification has failed but was already
> ignored.
>
> To test that, I tried hanging up on the client partway through the
> server handshake, and I got some strange results. With the patch, using
> sslmode=require and OpenSSL 1.0.1, I see:
>
>    connection to server at "127.0.0.1", port 50859 failed: SSL error:
> certificate verify failed: self signed certificate
>
> Which is wrong -- we shouldn't care about the self-signed failure if
> we're not using verify-*. I was going to suggest a patch like the following:
>
>>                    if (r == -1)
>> -                       libpq_append_conn_error(conn, "SSL SYSCALL error: %s",
>> -                                         SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
>> +                   {
>> +                       /*
>> +                        * If we get an X509 error here without an error in the
>> +                        * socket layer it means that verification failed without
>> +                        * it being a protocol error. A common cause is trying to
>> +                        * a default system CA which is missing or broken.
>> +                        */
>> +                       if (!save_errno && vcode != X509_V_OK)
>> +                           libpq_append_conn_error(conn, "SSL error: certificate verify failed: %s",
>> +                                                   X509_verify_cert_error_string(vcode));
>> +                       else
>> +                           libpq_append_conn_error(conn, "SSL SYSCALL error: %s",
>> +                                             SOCK_STRERROR(save_errno, sebuf, sizeof(sebuf)));
>> +                   }
>>                    else
>>                        libpq_append_conn_error(conn, "SSL SYSCALL error: EOF detected");
>
> But then I tested my case against PG15, and I didn't get the EOF message
> I expected:
>
>    connection to server at "127.0.0.1", port 50283 failed: SSL SYSCALL
> error: Success

This "error: Success" error has been reported to the list numerous times as
misleading, and I'd love to make progress on improving error reporting during
the v17 cycle.

> So it appears that this (hanging up on the client during the handshake)
> is _also_ a case where we could get a SYSCALL error with a zero errno,
> and my patch doesn't actually fix the misleading error message.
>
> That makes me worried, but I don't really have a concrete suggestion to
> make it better, yet. I'm not opposed to pushing this as a fix for the
> tests, but I suspect we'll have to iterate on this more.

So, taking a step back.  We know that libpq error reporting for SSL errors
isn't great, the permutations of sslmodes and OpenSSL versions and the very
fine-grained error handling API of OpenSSL make it hard to generalize well.
That's not what we're trying to solve here.

What we are trying solve is this one case where we know exactly what went
wrong, and we know that the error message as-is will be somewhere between
misleading and utterly bogus.  The committed feature is working as intended,
and the connection is refused as it should when no CA is available, but we know
it's a situation which is quite easy to get oneself into (a typo in an
environment variable can be enough).  So what we can do is pinpoint that
specific case and leave the unknowns to the current error reporting for
consistency with older postgres versions.

The attached checks for the specific known error, and leave all the other cases
to the same logging that we have today.  It relies on the knowledge that system
sslrootcert configs has deferred loading, and will run with verify-full.  So if
we see an X509 failure in loading the local issuer cert here then we know the
the user wanted to use the system CA pool for certificate verification but the
root CA cannot be loaded for some reason.

--
Daniel Gustafsson


Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: v16dev: invalid memory alloc request size 8488348128
Next
From: Justin Pryzby
Date:
Subject: Re: v16dev: invalid memory alloc request size 8488348128