Re: SSL renegotiation and other related woes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: SSL renegotiation and other related woes
Date
Msg-id 54DE6FAF.6050005@vmware.com
Whole thread Raw
In response to Re: SSL renegotiation and other related woes  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On 02/12/2015 01:41 AM, Andres Freund wrote:
> Hi,
>
> On 2015-02-05 23:03:02 +0200, Heikki Linnakangas wrote:
>> On 01/26/2015 12:14 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> When working on getting rid of ImmediateInterruptOK I wanted to verify
>>> that ssl still works correctly. Turned out it didn't. But neither did it
>>> in master.
>>>
>>> Turns out there's two major things we do wrong:
>>>
>>> 1) We ignore the rule that once called and returning
>>>     SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called
>>>     again with the same parameters. Unfortunately that rule doesn't mean
>>>     just that the same parameters have to be passed in, but also that we
>>>     can't just constantly switch between _read()/write(). Especially
>>>     nonblocking backend code (i.e. walsender) and the whole frontend code
>>>     violate this rule.
>>
>> I don't buy that. Googling around, and looking at the man pages, I just
>> don't see anything to support that claim. I believe it is supported to
>> switch between reads and writes.
>
> There's at least two implementations that seem to have workaround to
> achieve something like that. Apache's mod_ssl and the ssl layer for
> libevent.
>
>>> 2) We start renegotiations in be_tls_write() while in nonblocking mode,
>>>     but don't properly retry to handle socket readyness. There's a loop
>>>     that retries handshakes twenty times (???), but what actually is
>>>     needed is to call SSL_get_error() and ensure that there's actually
>>>     data available.
>>
>> Yeah, that's just crazy.
>>
>> Actually, I don't think we need to call SSL_do_handshake() at all. At least
>> on modern OpenSSL versions. OpenSSL internally uses a state machine, and
>> SSL_read() and SSL_write() calls will complete the handshake just as well.
>
> Some blaming seems to show that that's been the case for a long time.
>
>>> 2) is easy enough to fix, but 1) is pretty hard. Before anybody says
>>> that 1) isn't an important rule: It reliably causes connection aborts
>>> within a couple renegotiations. The higher the latency the higher the
>>> likelihood of aborts. Even locally it doesn't take very long to
>>> abort. Errors usually are something like "SSL connection has been closed
>>> unexpectedly" or "SSL Error: sslv3 alert unexpected message" and a host
>>> of other similar messages. There's a couple reports of those in the
>>> archives and I've seen many more in client logs.
>>
>> Yeah, I can reproduce that. There's clearly something wrong.
>
>> I believe this is an OpenSSL bug. I traced the "unexpected message" error to
>> this code in OpenSSL's s3_read_bytes() function:
>
> Yep, I got to that as well.
>
>>>     case SSL3_RT_APPLICATION_DATA:
>>>         /*
>>>          * At this point, we were expecting handshake data, but have
>>>          * application data.  If the library was running inside ssl3_read()
>>>          * (i.e. in_read_app_data is set) and it makes sense to read
>>>          * application data at this point (session renegotiation not yet
>>>          * started), we will indulge it.
>>>          */
>>>         if (s->s3->in_read_app_data &&
>>>             (s->s3->total_renegotiations != 0) &&
>>>             (((s->state & SSL_ST_CONNECT) &&
>>>               (s->state >= SSL3_ST_CW_CLNT_HELLO_A) &&
>>>               (s->state <= SSL3_ST_CR_SRVR_HELLO_A)
>>>              ) || ((s->state & SSL_ST_ACCEPT) &&
>>>                    (s->state <= SSL3_ST_SW_HELLO_REQ_A) &&
>>>                    (s->state >= SSL3_ST_SR_CLNT_HELLO_A)
>>>              )
>>>             )) {
>>>             s->s3->in_read_app_data = 2;
>>>             return (-1);
>>>         } else {
>>>             al = SSL_AD_UNEXPECTED_MESSAGE;
>>>         SSLerr(SSL_F_SSL3_READ_BYTES, SSL_R_UNEXPECTED_RECORD);
>>>             goto f_err;
>>>         }
>>
>> So that quite clearly throws an error, *unless* the original application
>> call was SSL_read(). As you also concluded, OpenSSL doesn't like it when
>> SSL_write() is called when it's in renegotiation. I think that's OpenSSL's
>> fault and it should "indulge" the application data message even in
>> SSL_write().
>
> That'd be nice, but I think there were multiple reports to them about
> this and there wasn't much of a response. Maybe it's time to try again.
>
>> In theory, I guess we should do similar hacks in the server, and always call
>> SSL_read() before SSL_write(), but it seems to work without it. Or maybe
>> not; OpenSSL server and client code is not symmetric, so perhaps it works in
>> server mode without that.
>
> I think we should do it on the server side - got some server side errors
> when I cranked up the pg_receivexlog's status interval and set the wal
> sender timeout very low.

I've not been able to reproduce any errors on the server side, with my
latest client-only patch. Not sure why. I would assume that the server
has the same problem, but perhaps there are some mitigating factors that
make it impossible or at least less likely there.

I'm planning to commit and backpatch the first two of the attached
patches. The first is essentially the same as what I've posted before,
with some minor cleanup. I realized that the hack can be isolated
completely in fe-secure-openssl.c by putting the "kill-switch" in the
custom BIO_read routine, instead of pqsecure_raw_read(), so I did that.
The second patch makes pqSendSome() call pqReadData() also in
non-blocking mode. Per discussion, that's necessary to avoid getting
stuck, if an application that uses non-blocking mode doesn't call
pqConsumeInput() between pqFlush() calls.

For the sake of completeness, the third one applies the same kill-switch
hack to the server, preventing SSL_write() from reading anything.
However, as it doesn't seem to actually be necessary, and it happens to
be a bit more messy to implement in the backend than in libpq, I'm not
going to commit that one.

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: RangeType internal use
Next
From: Robert Haas
Date:
Subject: Re: assessing parallel-safety