Re: Rare SSL failures on eelpout - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Rare SSL failures on eelpout
Date
Msg-id CAEepm=0sHUVZHfz3Bcxaqj3YQwXAX2za_AKtEhZc2gxAomdEDQ@mail.gmail.com
Whole thread Raw
In response to Re: Rare SSL failures on eelpout  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Rare SSL failures on eelpout  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Wed, Jan 23, 2019 at 4:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > Hmm.  Why is psql doing two sendto() calls without reading a response
> > in between, when it's possible for the server to exit after the first,
> > anyway?  Seems like a protocol violation somewhere?
>
> Keep in mind this is all down inside the SSL handshake, so if any
> protocol is being violated, it's theirs not ours.

The sendto() of 1115 bytes is SSL_connect()'s last syscall, just
before it returns 1 to indicate success (even though it wasn't
successful?), without waiting for a further response.  The sendto() of
107 bytes is our start-up packet, which either succeeds and is
followed by reading a "certificate revoked" message from the server,
or fails with ECONNRESET if the socket has already been shut down at
the server end due to the racing exit.

It seems very strange to me that the error report is deferred until we
send our start-up packet.  It seems like a response that belongs to
the connection attempt, not our later data sending.  Bug in OpenSSL?
Unintended consequence of our switch to blocking IO at that point?

I tried to find out how this looked on 1.0.2, but it looks like Debian
has just removed the older version from the buster distro and I'm out
of time to hunt this on other OSes today.

> The whole thing reminds me of the recent bug #15598:
>
> https://www.postgresql.org/message-id/87k1iy44fd.fsf%40news-spur.riddles.org.uk

Yeah, if errors get moved to later exchanges but the server might exit
and close its end of the socket before we can manage to initiate a
later exchange, it starts to look just like that.

A less interesting bug is the appearance of 3 nonsensical "Success"
(glibc) or "No error: 0" (FreeBSD) error messages in the server logs
on systems running OpenSSL 1.1.1, much like this, which I guess might
mean EOF:

https://www.postgresql.org/message-id/CAEepm=3cc5wYv=X4Nzy7VOUkdHBiJs9bpLzqtqJWxdDUp5DiPQ@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Isaac Morland
Date:
Subject: Re: Strange query behaviour
Next
From: maayan mordehai
Date:
Subject: postgres on a non-journaling filesystem