Re: SSL renegotiation and other related woes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: SSL renegotiation and other related woes
Date
Msg-id 54D3DA86.6000303@vmware.com
Whole thread Raw
In response to SSL renegotiation and other related woes  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: SSL renegotiation and other related woes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: SSL renegotiation and other related woes  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On 01/26/2015 12:14 PM, Andres Freund wrote:
> Hi,
>
> When working on getting rid of ImmediateInterruptOK I wanted to verify
> that ssl still works correctly. Turned out it didn't. But neither did it
> in master.
>
> Turns out there's two major things we do wrong:
>
> 1) We ignore the rule that once called and returning
>     SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called
>     again with the same parameters. Unfortunately that rule doesn't mean
>     just that the same parameters have to be passed in, but also that we
>     can't just constantly switch between _read()/write(). Especially
>     nonblocking backend code (i.e. walsender) and the whole frontend code
>     violate this rule.

I don't buy that. Googling around, and looking at the man pages, I just
don't see anything to support that claim. I believe it is supported to
switch between reads and writes.

> 2) We start renegotiations in be_tls_write() while in nonblocking mode,
>     but don't properly retry to handle socket readyness. There's a loop
>     that retries handshakes twenty times (???), but what actually is
>     needed is to call SSL_get_error() and ensure that there's actually
>     data available.

Yeah, that's just crazy.

Actually, I don't think we need to call SSL_do_handshake() at all. At
least on modern OpenSSL versions. OpenSSL internally uses a state
machine, and SSL_read() and SSL_write() calls will complete the
handshake just as well.

> 2) is easy enough to fix, but 1) is pretty hard. Before anybody says
> that 1) isn't an important rule: It reliably causes connection aborts
> within a couple renegotiations. The higher the latency the higher the
> likelihood of aborts. Even locally it doesn't take very long to
> abort. Errors usually are something like "SSL connection has been closed
> unexpectedly" or "SSL Error: sslv3 alert unexpected message" and a host
> of other similar messages. There's a couple reports of those in the
> archives and I've seen many more in client logs.

Yeah, I can reproduce that. There's clearly something wrong.

I believe this is an OpenSSL bug. I traced the "unexpected message"
error to this code in OpenSSL's s3_read_bytes() function:

>     case SSL3_RT_APPLICATION_DATA:
>         /*
>          * At this point, we were expecting handshake data, but have
>          * application data.  If the library was running inside ssl3_read()
>          * (i.e. in_read_app_data is set) and it makes sense to read
>          * application data at this point (session renegotiation not yet
>          * started), we will indulge it.
>          */
>         if (s->s3->in_read_app_data &&
>             (s->s3->total_renegotiations != 0) &&
>             (((s->state & SSL_ST_CONNECT) &&
>               (s->state >= SSL3_ST_CW_CLNT_HELLO_A) &&
>               (s->state <= SSL3_ST_CR_SRVR_HELLO_A)
>              ) || ((s->state & SSL_ST_ACCEPT) &&
>                    (s->state <= SSL3_ST_SW_HELLO_REQ_A) &&
>                    (s->state >= SSL3_ST_SR_CLNT_HELLO_A)
>              )
>             )) {
>             s->s3->in_read_app_data = 2;
>             return (-1);
>         } else {
>             al = SSL_AD_UNEXPECTED_MESSAGE;
>         SSLerr(SSL_F_SSL3_READ_BYTES, SSL_R_UNEXPECTED_RECORD);
>             goto f_err;
>         }

So that quite clearly throws an error, *unless* the original application
call was SSL_read(). As you also concluded, OpenSSL doesn't like it when
SSL_write() is called when it's in renegotiation. I think that's
OpenSSL's fault and it should "indulge" the application data message
even in SSL_write().

Can we work-around that easily? I believe we can. The crucial part is
that we mustn't let SSL_write() to process any incoming application
data. We can achieve that if we always call SSL_read() to drain such
data, before calling SSL_write(). But that's not quite enough; if we're
in renegotiation, SSL_write() might still try to read more data from the
socket that has arrived after the SSL_read() call. Fortunately, we can
easily prevent that by hacking pqsecure_raw_read() to always return
EWOULDBLOCK, when it's called through SSL_write().

The attached patch does that. I've been running your pg_receivexlog test
case with this for about 15 minutes without any errors now, with
ssl_renegotiation_limit=50kB, while before it errored out within seconds.

In theory, I guess we should do similar hacks in the server, and always
call SSL_read() before SSL_write(), but it seems to work without it. Or
maybe not; OpenSSL server and client code is not symmetric, so perhaps
it works in server mode without that.

PS. The server code started renegotiation when it's 1kB from reaching
ssl_renegotiation_limit. I increased that to 32kB, because in testing, I
got some "SSL failed to renegotiate connection before limit expired"
errors in testing before I did that.

PPS. I did all testing of this patch with my sleeping logic
simplification patch applied
(http://www.postgresql.org/message-id/54D3821E.9060004@vmware.com). It
applies without it too, and I don't think it matters, but I thought I'd
mention it.

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Next
From: Robert Haas
Date:
Subject: Re: Redesigning checkpoint_segments