Rare SSL failures on eelpout - Mailing list pgsql-hackers

From Thomas Munro
Subject Rare SSL failures on eelpout
Date
Msg-id CAEepm=2n6Nv+5tFfe8YnkUm1fXgvxR0Mm1FoD+QKG-vLNGLyKg@mail.gmail.com
Whole thread Raw
Responses Re: Rare SSL failures on eelpout  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
(Moving this discussion from the pgsql-committers thread "pgsql:
Update ssl test certificates and keys", which is innocent.)

On Wed, Jan 16, 2019 at 10:37 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Jan 4, 2019 at 10:08 AM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > Thanks.  FWIW I've just updated eelpout (a Debian testing BF animal
> > that runs all the extra tests including SSL) to use libssl-dev
> > (instead of libssl1.0-dev), and cleared its accache files.  Let's see
> > if that works...
>
> Since that upgrade (to libssl 1.1.1a-1), there are have been a few
> intermittent failures in the SSL tests on that animal (thanks Tom for
> pointing that out off-list).

I reproduced this manually.  From time to time (less than 1% of the
times I tried), the server hangs up without sending a goodbye message,
thereby causing a failure to match the expected psql error message.
From the kernel's perspective on the psql side, a working-as-expected
case looks like this:

sendto(3, "\27\3\3\2\356\313\334\372\201\353\201h\204\353\240A\4\355\232\340\375\0X\220\326V\374\253\222i\2724"...,
1115, MSG_NOSIGNAL, NULL, 0) = 1115
ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLOUT}])
sendto(3, "\27\3\3\0f\335\313\224\263\256r\371\215\177\273,N\345\342\200\257\r\321\6\323@\316\241\316\17\204\26"...,
107, MSG_NOSIGNAL, NULL, 0) = 107
ppoll([{fd=3, events=POLLIN|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLIN}])
recvfrom(3, "\27\3\3\0\23", 5, 0, NULL, NULL) = 5
recvfrom(3, "I\"\t;\3006\276\347\344]7>\2\234m\340]B\241", 19, 0,
NULL, NULL) = 19
close(3)                                = 0
write(2, "psql: SSL error: sslv3 alert cer"..., 49psql: SSL error:
sslv3 alert certificate revoked
) = 49

... and the unexpected failure case looks like this:

sendto(3, "\27\3\3\2\356/\254\307\277\342G?&BB\321\f\314\245\22\246U\337\263;\203f\302s\306\37\276"...,
1115, MSG_NOSIGNAL, NULL, 0) = 1115
ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLOUT|POLLERR|POLLHUP}])
sendto(3, "\27\3\3\0f\177\335\20\305\353\234\306\211#\343\321\336\22111J\312\323F\210\6U\331\264GAr"...,
107, MSG_NOSIGNAL, NULL, 0) = -1 ECONNRESET (Connection reset by peer)
write(2, "psql: server closed the connecti"..., 199psql: server closed
the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
could not send startup packet: Connection reset by peer
) = 199

To the kernel on the server side, a good case looks like this:

[pid 13740] sendto(8,
"\27\3\3\0\23\252\21\227\2\232s\354\21\243\236\207\301\3B\341\253\306k\346",
24, 0, NULL, 0) = 24
[pid 13740] write(2, "2019-01-22 09:23:04.425 UTC [137"...,
1112019-01-22 09:23:04.425 UTC [13740] [unknown] LOG:  could not
accept SSL connection: certificate verify failed
) = 111
[pid 13740] exit_group(0)               = ?
[pid 13740] +++ exited with 0 +++

I didn't manage to trace a bad case on the server side (maybe the
strace interferes with the required timing), but I noticed that
SSL_shutdown()'s return code (which we don't check) is always -1, and
I noticed that SSL_get_error() after that is always SSL_ERROR_SSL.  I
wondered if we might somehow be reaching exit() when the alert
response is still buffered inside OpenSSL, but that doesn't seem right
-- the port has noblock=0 at that stage.

Hmm.  Why is psql doing two sendto() calls without reading a response
in between, when it's possible for the server to exit after the first,
anyway?  Seems like a protocol violation somewhere?

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Amit Khandekar
Date:
Subject: Re: Pluggable Storage - Andres's take
Next
From: Andrew Gierth
Date:
Subject: Re: Allowing extensions to find out the OIDs of their member objects