Re: SSL renegotiation and other related woes - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: SSL renegotiation and other related woes |
Date | |
Msg-id | 54D3DA86.6000303@vmware.com Whole thread Raw |
In response to | SSL renegotiation and other related woes (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: SSL renegotiation and other related woes
(Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: SSL renegotiation and other related woes (Andres Freund <andres@2ndquadrant.com>) |
List | pgsql-hackers |
On 01/26/2015 12:14 PM, Andres Freund wrote: > Hi, > > When working on getting rid of ImmediateInterruptOK I wanted to verify > that ssl still works correctly. Turned out it didn't. But neither did it > in master. > > Turns out there's two major things we do wrong: > > 1) We ignore the rule that once called and returning > SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called > again with the same parameters. Unfortunately that rule doesn't mean > just that the same parameters have to be passed in, but also that we > can't just constantly switch between _read()/write(). Especially > nonblocking backend code (i.e. walsender) and the whole frontend code > violate this rule. I don't buy that. Googling around, and looking at the man pages, I just don't see anything to support that claim. I believe it is supported to switch between reads and writes. > 2) We start renegotiations in be_tls_write() while in nonblocking mode, > but don't properly retry to handle socket readyness. There's a loop > that retries handshakes twenty times (???), but what actually is > needed is to call SSL_get_error() and ensure that there's actually > data available. Yeah, that's just crazy. Actually, I don't think we need to call SSL_do_handshake() at all. At least on modern OpenSSL versions. OpenSSL internally uses a state machine, and SSL_read() and SSL_write() calls will complete the handshake just as well. > 2) is easy enough to fix, but 1) is pretty hard. Before anybody says > that 1) isn't an important rule: It reliably causes connection aborts > within a couple renegotiations. The higher the latency the higher the > likelihood of aborts. Even locally it doesn't take very long to > abort. Errors usually are something like "SSL connection has been closed > unexpectedly" or "SSL Error: sslv3 alert unexpected message" and a host > of other similar messages. There's a couple reports of those in the > archives and I've seen many more in client logs. Yeah, I can reproduce that. There's clearly something wrong. I believe this is an OpenSSL bug. I traced the "unexpected message" error to this code in OpenSSL's s3_read_bytes() function: > case SSL3_RT_APPLICATION_DATA: > /* > * At this point, we were expecting handshake data, but have > * application data. If the library was running inside ssl3_read() > * (i.e. in_read_app_data is set) and it makes sense to read > * application data at this point (session renegotiation not yet > * started), we will indulge it. > */ > if (s->s3->in_read_app_data && > (s->s3->total_renegotiations != 0) && > (((s->state & SSL_ST_CONNECT) && > (s->state >= SSL3_ST_CW_CLNT_HELLO_A) && > (s->state <= SSL3_ST_CR_SRVR_HELLO_A) > ) || ((s->state & SSL_ST_ACCEPT) && > (s->state <= SSL3_ST_SW_HELLO_REQ_A) && > (s->state >= SSL3_ST_SR_CLNT_HELLO_A) > ) > )) { > s->s3->in_read_app_data = 2; > return (-1); > } else { > al = SSL_AD_UNEXPECTED_MESSAGE; > SSLerr(SSL_F_SSL3_READ_BYTES, SSL_R_UNEXPECTED_RECORD); > goto f_err; > } So that quite clearly throws an error, *unless* the original application call was SSL_read(). As you also concluded, OpenSSL doesn't like it when SSL_write() is called when it's in renegotiation. I think that's OpenSSL's fault and it should "indulge" the application data message even in SSL_write(). Can we work-around that easily? I believe we can. The crucial part is that we mustn't let SSL_write() to process any incoming application data. We can achieve that if we always call SSL_read() to drain such data, before calling SSL_write(). But that's not quite enough; if we're in renegotiation, SSL_write() might still try to read more data from the socket that has arrived after the SSL_read() call. Fortunately, we can easily prevent that by hacking pqsecure_raw_read() to always return EWOULDBLOCK, when it's called through SSL_write(). The attached patch does that. I've been running your pg_receivexlog test case with this for about 15 minutes without any errors now, with ssl_renegotiation_limit=50kB, while before it errored out within seconds. In theory, I guess we should do similar hacks in the server, and always call SSL_read() before SSL_write(), but it seems to work without it. Or maybe not; OpenSSL server and client code is not symmetric, so perhaps it works in server mode without that. PS. The server code started renegotiation when it's 1kB from reaching ssl_renegotiation_limit. I increased that to 32kB, because in testing, I got some "SSL failed to renegotiate connection before limit expired" errors in testing before I did that. PPS. I did all testing of this patch with my sleeping logic simplification patch applied (http://www.postgresql.org/message-id/54D3821E.9060004@vmware.com). It applies without it too, and I don't think it matters, but I thought I'd mention it. - Heikki
Attachment
pgsql-hackers by date: