Re: Rare SSL failures on eelpout - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Rare SSL failures on eelpout |
Date | |
Msg-id | 30952.1551738944@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Rare SSL failures on eelpout (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Rare SSL failures on eelpout
Re: Rare SSL failures on eelpout Re: Rare SSL failures on eelpout |
List | pgsql-hackers |
Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Mar 5, 2019 at 10:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's all very confusing, but I think there's a nontrivial chance >> that this is an OpenSSL bug, especially since we haven't been able >> to replicate it elsewhere. > Hmm. Yes, it is strange that we haven't seen it elsewhere, but > remember that very few animals are running the ssl tests; also it > requires particular timing to hit. True. I've spent some time today running the ssl tests on various machines here, without any luck reproducing. > OK, here's something. I can reproduce it quite easily on this > machine, and I can fix it like this: > libpq_gettext("could not send startup packet: %s\n"), > SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf))); > free(startpacket); > + pqHandleSendFailure(conn); > goto error_return; > } Yeah. But I don't like pqHandleSendFailure all that much: it has strong constraints on what state libpq has to be in, as a consequence of which it's called from a bunch of ad-hoc places, and can't be called from some others. It's kind of accidental that it will work here. I was toying with the idea of getting rid of that function in favor of a design like this: * Handle any socket-write failure at some low level of libpq by recording the fact that the error occurred (plus whatever data we need to produce a message about it), and then starting to just discard output data. * Eventually, we'll try to read. Process any available input data as usual (and, if we get a read error, report that). When no more input data is available, if a socket write failure has been recorded, report that much as if it were an incoming ERROR message, and then shut down the connection. This would essentially give us pqHandleSendFailure-like behavior across the board, which might be enough to fix this problem as well as bug #15598. Or not ... as you say, we haven't thoroughly understood either issue, so it's possible this wouldn't do it. This would have the effect of prioritizing reports of socket read failures over those of write failures, which is an interesting behavioral change, but offhand I can't see a problem with it. One thing that isn't real clear to me is how much timing sensitivity there is in "when no more input data is available". Can we assume that if we've gotten ECONNRESET or an allied error from a write, then any data the far end might've wished to send us is already available to read? In principle, since TCP allows shutting down either transmission direction independently, the server could close the read side of its socket while continuing to transmit data. In practice, PG servers don't do that; but could network timing glitches create the same effect? Even if it's possible, does it happen enough to worry about? The reason I'm concerned is that I don't think it'd be bright to ignore a send error until we see input EOF, which'd be the obvious way to solve a timing problem if there is one. If our send error was some transient glitch and the connection is really still open, then we might not get EOF, but we won't get a server reply either because no message went to the server. You could imagine waiting some small interval of time before deciding that it's time to report the write failure, but ugh. In any case, consuming immediately-available data before reporting the write error would already be a step forward over what we're doing now, I think. Thoughts? regards, tom lane
pgsql-hackers by date: