Home > mailing lists

Re: Rare SSL failures on eelpout - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: Rare SSL failures on eelpout
Date	March 4, 2019 03:58:09
Msg-id	CA+hUKG+sHn73iFPDWKd6E-Tn5-Xz39WfuzHCQViRv0a6jvvnrA@mail.gmail.com Whole thread Raw
In response to	Re: Rare SSL failures on eelpout (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses	Re: Rare SSL failures on eelpout
List	pgsql-hackers

Tree view

On Wed, Jan 23, 2019 at 11:23 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jan 23, 2019 at 4:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The whole thing reminds me of the recent bug #15598:
> >
> > https://www.postgresql.org/message-id/87k1iy44fd.fsf%40news-spur.riddles.org.uk
>
> Yeah, if errors get moved to later exchanges but the server might exit
> and close its end of the socket before we can manage to initiate a
> later exchange, it starts to look just like that.

Based on some clues from Andrew Gierth (in the email referenced above
and also in an off-list chat), I did some experimentation that seemed
to confirm a theory of his that Linux might be taking a shortcut when
both sides are local, bypassing the RST step because it can see both
ends (whereas normally the TCP stack should cause the *next* sendto()
to fail IIUC?).  I think this case is essentially the same as bug
#15598, it's just happening at a different time.

With a simple socket test program I can see that if you send a single
packet after the remote end has closed and after it had already read
everything you sent it up to now, you get EPIPE.  If there was some
outstanding data from a previous send that it hadn't read yet when it
closed its end, you get ECONNRESET.  This doesn't happen if client and
server are on different machines, or on FreeBSD even on the same
machine, but does happen if client and server are on the same Linux
system (whether using the loopback interface or a real network
interface).  However, after you get ECONNRESET, you can still read the
final data that was sent by the server before it closed, which
presumably contains the error we want to report to the user.  That
suggests that we could perhaps handle ECONNRESET both at startup
packet send time (for certificate rejection, eelpout's case) and at
initial query send (for idle timeout, bug #15598's case) by attempting
to read.  Does that make sense?  I haven't poked into the libpq state
machine stuff to see if that would be easy or hard.

PS: looking again at the strace output from earlier, it's kinda funny
that it says revents=POLLOUT|POLLERR|POLLHUP, since that seems to be a
contradiction: if this were poll() and not ppoll() I think it might
violate POSIX which says "[POLLHUP] and POLLOUT are
mutually-exclusive; a stream can never be writable if a hangup has
occurred", but I don't see what we could do differently with that
anyway.

-- 
Thomas Munro
https://enterprisedb.com

pgsql-hackers by date:

From: Tom Lane
Date: 04 March 2019, 03:26:16
Subject: Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

From: Tom Lane
Date: 04 March 2019, 04:06:22
Subject: Re: Rare SSL failures on eelpout

Re: Rare SSL failures on eelpout - Mailing list pgsql-hackers

Previous

Next