Thread: Fix for OpenSSL error queue bug

Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
Attached patch fixes an issue reported by William Felipe Welter about
a year ago [1]. It is loosely based on his original patch.

As Heikki goes into on that thread, the appropriate action seems to be
to constantly reset the error queue, and to make sure that we
ourselves clear the queue consistently. (Note that we might not have
consistently called ERR_get_error() in the event of an OOM within
SSLerrmessage(), for example). I have not changed backend code in the
patch, though; I felt that we had enough control of the general
situation there for it to be unnecessary to lock everything down.

The interface that OpenSSL exposes for all of this is very poorly
thought out. It's not exactly clear how a client of OpenSSL can be a
"good citizen" in handling the error queue. Correctly using the
library is only ever described in terms of a very exact thing that
must happen or must not happen. There is no overarching concept of how
things fit together so that each OpenSSL client doesn't clobber the
other. It's all rather impractical. Clearly, this patch needs careful
review.

[1]
http://www.postgresql.org/message-id/flat/20150224030956.2529.83279@wrigleys.postgresql.org#20150224030956.2529.83279@wrigleys.postgresql.org
--
Peter Geoghegan

Attachment

Re: Fix for OpenSSL error queue bug

From
Peter Eisentraut
Date:
On 2/5/16 5:04 AM, Peter Geoghegan wrote:
> As Heikki goes into on that thread, the appropriate action seems to be
> to constantly reset the error queue, and to make sure that we
> ourselves clear the queue consistently. (Note that we might not have
> consistently called ERR_get_error() in the event of an OOM within
> SSLerrmessage(), for example). I have not changed backend code in the
> patch, though; I felt that we had enough control of the general
> situation there for it to be unnecessary to lock everything down.

I think clearing the error after a call is not necessary.  The API
clearly requires that you should clear the error queue before a call, so
clearing it afterwards does not accomplish anything, except maybe make
broken code work sometimes, for a while.  Also, there is nothing that
says that an error produces exactly one entry in the error queue; it
could be multiple.  Or that errors couldn't arise at random times
between the reset and whatever happens next.

I think this is analogous to clearing errno before a C library call.
You could clear it afterwards as well, to be nice to the next guy, but
the next guy should really take care of that themselves, and we can't
rely on what happens in between anyway.

The places that you identified for change look correct as far as libpq
goes.  I do think that the backend should be updated in the same way,
because it's a) correct, b) easy enough, and c) there could well be
interactions with postgres_fdw, plproxy, plperl, or who knows what.




Re: Fix for OpenSSL error queue bug

From
Robert Haas
Date:
On Thu, Mar 3, 2016 at 7:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 2/5/16 5:04 AM, Peter Geoghegan wrote:
>> As Heikki goes into on that thread, the appropriate action seems to be
>> to constantly reset the error queue, and to make sure that we
>> ourselves clear the queue consistently. (Note that we might not have
>> consistently called ERR_get_error() in the event of an OOM within
>> SSLerrmessage(), for example). I have not changed backend code in the
>> patch, though; I felt that we had enough control of the general
>> situation there for it to be unnecessary to lock everything down.
>
> I think clearing the error after a call is not necessary.  The API
> clearly requires that you should clear the error queue before a call, so
> clearing it afterwards does not accomplish anything, except maybe make
> broken code work sometimes, for a while.  Also, there is nothing that
> says that an error produces exactly one entry in the error queue; it
> could be multiple.  Or that errors couldn't arise at random times
> between the reset and whatever happens next.
>
> I think this is analogous to clearing errno before a C library call.
> You could clear it afterwards as well, to be nice to the next guy, but
> the next guy should really take care of that themselves, and we can't
> rely on what happens in between anyway.
>
> The places that you identified for change look correct as far as libpq
> goes.  I do think that the backend should be updated in the same way,
> because it's a) correct, b) easy enough, and c) there could well be
> interactions with postgres_fdw, plproxy, plperl, or who knows what.

So what's the next step here?  Peter G, are you planning to update the
patch based on this review from Peter E?  If not, Peter E, do you want
to update the patch and commit?  If neither, I'm going to mark this
Returned with Feedback in the CF and move on, which seems a bit of a
shame since this appears to be a bona fide bug, but if nobody's
willing to work on it, it ain't gettin' fixed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Thu, Mar 10, 2016 at 2:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> So what's the next step here?  Peter G, are you planning to update the
> patch based on this review from Peter E?  If not, Peter E, do you want
> to update the patch and commit?  If neither, I'm going to mark this
> Returned with Feedback in the CF and move on, which seems a bit of a
> shame since this appears to be a bona fide bug, but if nobody's
> willing to work on it, it ain't gettin' fixed.

Getting to it very soon. Just really busy right this moment.


-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Thu, Mar 10, 2016 at 3:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Getting to it very soon. Just really busy right this moment.

That said, I agree with Peter's remarks about doing this frontend and
backend. So, while I'm not sure, I think we're in agreement on all
issues. I would have no problem with Peter E following through with
final steps + commit as Robert outlined, if that works for him.


-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Peter Eisentraut
Date:
On 3/10/16 6:10 PM, Peter Geoghegan wrote:
> On Thu, Mar 10, 2016 at 3:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Getting to it very soon. Just really busy right this moment.
>
> That said, I agree with Peter's remarks about doing this frontend and
> backend. So, while I'm not sure, I think we're in agreement on all
> issues. I would have no problem with Peter E following through with
> final steps + commit as Robert outlined, if that works for him.

My proposal is the attached patch.



Attachment

Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
Looked at your proposed patch. Will respond to your original mail on the matter.

On Thu, Mar 3, 2016 at 4:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> I think clearing the error after a call is not necessary.  The API
> clearly requires that you should clear the error queue before a call, so
> clearing it afterwards does not accomplish anything, except maybe make
> broken code work sometimes, for a while.

Uh, if it's so clear, then why haven't we been doing it all along? The
API doesn't require you to take *any* specific practical measure (for
example, the specific practical measure of resetting the queue before
calling an I/O function). It simply says "this exact thing cannot be
allowed to happen; the consequences are undefined", with nothing in
the way of guidance on what that means in the real world. It's a
shockingly bad API, but that's the reality.

Part of the problem is that various scripting language OpenSSL
wrappers are only very thin wrappers. They effectively pass the buck
on to PHP and Ruby devs. If we cannot get it right, what chance have
they? I've personally experienced a bit uptick in complaints about
this recently. I think there are 3 separate groups within Heroku that
regularly ask me how this patch is doing.

> Also, there is nothing that
> says that an error produces exactly one entry in the error queue; it
> could be multiple.  Or that errors couldn't arise at random times
> between the reset and whatever happens next.

I think that it's kind of implied, since calling ERR_get_error() pops
the stack. But even if that isn't so, it might be worth preventing bad
things from happening to client applications only sometimes.

> I think this is analogous to clearing errno before a C library call.
> You could clear it afterwards as well, to be nice to the next guy, but
> the next guy should really take care of that themselves, and we can't
> rely on what happens in between anyway.

It sounds like you're saying "well, we cannot be expected to bend over
backwards to make broken code work". But that broken code includes
every single version of libpq + OpenSSL currently distributed. Seems
like a very high standard. I'm not saying that that means we
definitely should clear the error queue reliably ourselves, but
doesn't it give you pause? Heikki seemed to think that clearing our
own queue was important when he looked at this a year ago:

http://www.postgresql.org/message-id/54EDD30D.5050107@vmware.com

Again, not conclusive, but I would like to hear a rationale for why
you think it's okay to not consistently clear our own queue for the
benefit of others. Is this informed by a concern about some specific
downside to taking that extra precaution?

Thanks
-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Peter Eisentraut
Date:
On 3/10/16 9:38 PM, Peter Geoghegan wrote:
> Looked at your proposed patch. Will respond to your original mail on the matter.
> 
> On Thu, Mar 3, 2016 at 4:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> I think clearing the error after a call is not necessary.  The API
>> clearly requires that you should clear the error queue before a call, so
>> clearing it afterwards does not accomplish anything, except maybe make
>> broken code work sometimes, for a while.
> 
> Uh, if it's so clear, then why haven't we been doing it all along?

The issue only happens if two interleaving trains of execution, one of
which is libpq, use OpenSSL.  Not many applications do that.  And you
also need to provoke the errors in a certain order.  And even then, in
some cases you might just see a false positive error, rather than a
crash.  So it's an edge case.

> Part of the problem is that various scripting language OpenSSL
> wrappers are only very thin wrappers. They effectively pass the buck
> on to PHP and Ruby devs. If we cannot get it right, what chance have
> they? I've personally experienced a bit uptick in complaints about
> this recently. I think there are 3 separate groups within Heroku that
> regularly ask me how this patch is doing.

I think they have been getting away with it for so long for the same
reasons.

Arguably, if everyone followed "my" approach, this should be very easy
to fix everywhere.  Instead, reading through the PHP bug report, they
are coming up with a fairly complex solution for clearing the error
queue afterwards so as to not leave "landmines" for the next guy.  But
their code will (AFAICT) still be wrong because they are not clearing
the error *before* the API calls where it is required per documentation.So "everyone" (sample of 2) is scrambling to
cleanup for the next guy
 
instead of doing the straightforward fix of following the API
documentation and cleaning up before their own calls.

I also see the clean-up-afterwards approach in the Python ssl module.  I
fear there is de facto a second API specification that requires you to
clean up errors after yourself and gives an implicit guarantee that the
error queue is empty whenever you want to make any SSL calls.  I don't
think this actually works in all cases, but maybe if everyone else is
convinced of that (in plain violation of the published OpenSSL
documentation, AFAICT) we need to get on board with that for
interoperability.

>> Also, there is nothing that
>> says that an error produces exactly one entry in the error queue; it
>> could be multiple.  Or that errors couldn't arise at random times
>> between the reset and whatever happens next.
> 
> I think that it's kind of implied, since calling ERR_get_error() pops
> the stack. But even if that isn't so, it might be worth preventing bad
> things from happening to client applications only sometimes.

The lore on the internet suggests that multiple errors could definitely
happen.  So popping one error afterwards isn't going to fix it, it just
moves the edge case around.  At least what we should do is clear the
entire queue afterwards instead of just the first error.

> doesn't it give you pause? Heikki seemed to think that clearing our
> own queue was important when he looked at this a year ago:
> 
> http://www.postgresql.org/message-id/54EDD30D.5050107@vmware.com

I think that message suggests that we should clear the queue before each
call, not after.




Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Thu, Mar 10, 2016 at 7:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> Arguably, if everyone followed "my" approach, this should be very easy
> to fix everywhere.

I don't think that there is any clear indication that the OpenSSL
people would share that view. Or my view. Or anything that's sensible
or practical or actionable.

> Instead, reading through the PHP bug report, they
> are coming up with a fairly complex solution for clearing the error
> queue afterwards so as to not leave "landmines" for the next guy.  But
> their code will (AFAICT) still be wrong because they are not clearing
> the error *before* the API calls where it is required per documentation.
>  So "everyone" (sample of 2) is scrambling to clean up for the next guy
> instead of doing the straightforward fix of following the API
> documentation and cleaning up before their own calls.

It will be less wrong, though.

PostgreSQL is the project that doesn't trust a C90 standard library
function to not blithely write passed the bounds of a passed buffer,
all because of a bug in certain versions of Solaris based systems that
was several years old at the time. See commit be8b06c364. My view that
that wasn't really worth worrying about was clearly the minority view
when this was discussed (a minority of 1, and a majority of 4 or 5,
IIRC). I think that this case vastly exceeds that standard for
worrying about other people's broken code; in this case, we ourselves
made the same mistake for years and years.

> I also see the clean-up-afterwards approach in the Python ssl module.  I
> fear there is de facto a second API specification that requires you to
> clean up errors after yourself and gives an implicit guarantee that the
> error queue is empty whenever you want to make any SSL calls.  I don't
> think this actually works in all cases, but maybe if everyone else is
> convinced of that (in plain violation of the published OpenSSL
> documentation, AFAICT) we need to get on board with that for
> interoperability.

I didn't know that Python's ssl module did that. That seems to lend
support to my view, which is that we should similarly clear the
thread's queue lest anyone else be affected. Yes, this approach is
fairly scatter-gun, but frankly that's just the situation we find
ourselves in.

>>> Also, there is nothing that
>>> says that an error produces exactly one entry in the error queue; it
>>> could be multiple.  Or that errors couldn't arise at random times
>>> between the reset and whatever happens next.
>>
>> I think that it's kind of implied, since calling ERR_get_error() pops
>> the stack. But even if that isn't so, it might be worth preventing bad
>> things from happening to client applications only sometimes.
>
> The lore on the internet suggests that multiple errors could definitely
> happen.  So popping one error afterwards isn't going to fix it, it just
> moves the edge case around.

Are you sure, specifically, that an I/O function is known to add more
than one error to the per-thread queue? Obviously there can be more
than one error in the queue. But I haven't seen any specific
indication, either in the docs or in the lore, that more than one
error can be added by a single call to an I/O function such as
SSL_read(). Perhaps you can share where you encountered the lore.

>> doesn't it give you pause? Heikki seemed to think that clearing our
>> own queue was important when he looked at this a year ago:
>>
>> http://www.postgresql.org/message-id/54EDD30D.5050107@vmware.com
>
> I think that message suggests that we should clear the queue before each
> call, not after.

Uh, it very clearly *is* Heikki's view that we should clear the queue
*afterwards*. Certainly, I think Heikki also wanted us to clear the
queue before, so we aren't screwed, "just to be sure", as he puts it
-- but nobody disputes that that's necessary anyway. That it might not
be *sufficient* to just call ERR_get_error() is the new information in
the bug report. Heikki said:

"""
<pedantic>
The OpenSSL manual doesn't directly require you to call
ERR_clear_error() before every SSL_* call. It just requires that you
ensure that the error queue is empty. Libpq ensures that by always
clearing the queue *after* an error happens, in SSLerrmessage().
</pedantic>

"""

The problem with this, as Heikki goes on to say, it that we might not
get to that point in SSLerrmessage(); we may not be able to pop the
queue/call ERR_get_error(), more or less by accident (e.g. I noticed
an OOM could do that). That's why I proposed to fix that by calling
ERR_get_error() early and unambiguously. If we must rely on that
happening, it should not be from such a long distance (i.e. from
within SSLerrmessage(), which is kind of far removed from the original
I/O function calls).

Thanks
--
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Thu, Mar 10, 2016 at 8:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Mar 10, 2016 at 7:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> Arguably, if everyone followed "my" approach, this should be very easy
>> to fix everywhere.
>
> I don't think that there is any clear indication that the OpenSSL
> people would share that view. Or my view. Or anything that's sensible
> or practical or actionable.

Ideally, we'd be able to get this into the upcoming minor release.
This bug has caused Heroku some serious problems.

-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Mar 10, 2016 at 8:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Thu, Mar 10, 2016 at 7:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>>> Arguably, if everyone followed "my" approach, this should be very easy
>>> to fix everywhere.

>> I don't think that there is any clear indication that the OpenSSL
>> people would share that view. Or my view. Or anything that's sensible
>> or practical or actionable.

> Ideally, we'd be able to get this into the upcoming minor release.

Agreed, we need to deal with this one way or the other.  My proposal
is:

1. In HEAD, do it as Peter E. suggests, ie clear error queue before calls.

2. In back branches, clear error queue before *and* after calls.  This
will waste a few nanoseconds but will avoid any risk of breaking
existing third-party code.

I think it's reasonable to expect extensions to update to the newer
behavior in a major release, but we're taking risks if we expect
that to happen in minor releases.

In any case, we need a patch that touches the backend-side code as well.
        regards, tom lane



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Mon, Mar 14, 2016 at 3:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Agreed, we need to deal with this one way or the other.  My proposal
> is:
>
> 1. In HEAD, do it as Peter E. suggests, ie clear error queue before calls.
>
> 2. In back branches, clear error queue before *and* after calls.  This
> will waste a few nanoseconds but will avoid any risk of breaking
> existing third-party code.
>
> I think it's reasonable to expect extensions to update to the newer
> behavior in a major release, but we're taking risks if we expect
> that to happen in minor releases.

I am concerned that users will never be able to get this right, since
I think it requires every Ruby or PHP app using some thin OpenSSL
wrapper to clear the per-queue thread. It's a big mess, but it's our
mess to some degree.

I wonder if it would be just as good if we ensured that
ERR_get_error() was definitely called in any circumstance where it
looked like we might have an error: ERR_get_error() would be
*reliably* called, as in my patch, but unlike my patch only when
SSL_get_error() indicated a problem (not all the time).

Heikki believed that clearing the error queue by calling
ERR_clear_error() before calling an I/O function like SSL_read() was
necessary, as we all do; no controversy there. But Heikki also
believed, even prior to hearing about this bug, that it was important
and necessary for ERR_get_error() to be reached when there might be an
error added to the queue following a Postgres/libpq call to an I/O
function like SSL_read() followed by SSL_get_error() indicating
trouble. He thought, as I do, that it would be a good idea to not rely
on that happening from a distance (i.e. not relying on reaching
SSLerrmessage()). Peter E. seems to believe that there is absolutely
no reason to rely on ERR_get_error() getting called at all, and that
the existing SSLerrmessage() only exists for the purposes of producing
a human-readable error message.

Anyway, thinking about it some more, perhaps the best solution is to
do the ERR_get_error() call iff SSL_get_error() seems unhappy, perhaps
even placing the two into a utility function. That's probably almost
the same as the existing behavior, as far as clearing up the queue
after we may have added to it goes. I don't know if that's any less
safe then my patch's pessimistic approach. It seems like it might be
just as safe. Under this compromise, I think we'd probably clear the
error queue after SSL_get_error() returned a value that is not
SSL_ERROR_NONE, though (including SSL_ERROR_WANT_READ, etc). What do
you think about that?

> In any case, we need a patch that touches the backend-side code as well.

I agree that the backend-side code should be covered. I quickly
changed my mind about that, and am happy to produce a revision along
those lines.

-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Mon, Mar 14, 2016 at 3:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Agreed, we need to deal with this one way or the other.  My proposal
>> is:
>> 
>> 1. In HEAD, do it as Peter E. suggests, ie clear error queue before calls.
>> 
>> 2. In back branches, clear error queue before *and* after calls.  This
>> will waste a few nanoseconds but will avoid any risk of breaking
>> existing third-party code.

> I am concerned that users will never be able to get this right, since
> I think it requires every Ruby or PHP app using some thin OpenSSL
> wrapper to clear the per-queue thread. It's a big mess, but it's our
> mess to some degree.

So your proposal is basically to do #2 in all branches?  I won't fight it,
if it doesn't bloat the code much.  The overhead should surely be trivial
compared to network communication costs, and I'm afraid you might be right
about the risk of latent bugs.
        regards, tom lane



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Mon, Mar 14, 2016 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So your proposal is basically to do #2 in all branches?  I won't fight it,
> if it doesn't bloat the code much.  The overhead should surely be trivial
> compared to network communication costs, and I'm afraid you might be right
> about the risk of latent bugs.

Yes, with one small difference: I wouldn't be calling ERR_get_error()
in the common case where SSL_get_error() returns SSL_ERROR_NONE, on
the theory that skipping that case represents no risk. I'm making a
concession to Peter E's view that that will calling ERR_get_error()
more will add useless cycles.

-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Mon, Mar 14, 2016 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Yes, with one small difference: I wouldn't be calling ERR_get_error()
> in the common case where SSL_get_error() returns SSL_ERROR_NONE, on
> the theory that skipping that case represents no risk. I'm making a
> concession to Peter E's view that that will calling ERR_get_error()
> more will add useless cycles.

The attached patch is what I have in mind.

I can produce a back-patchable variant of this if you and Peter E.
think this approach is okay.

--
Peter Geoghegan

Attachment

Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Mon, Mar 14, 2016 at 6:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I can produce a back-patchable variant of this if you and Peter E.
> think this approach is okay.

Where are we on this?

-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Mon, Mar 14, 2016 at 6:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I can produce a back-patchable variant of this if you and Peter E.
>> think this approach is okay.

> Where are we on this?

I'm generally okay with the approach used in
http://www.postgresql.org/message-id/CAM3SWZS0iV1HQ2fqNxvmTZm4+hPLAfH=7LeC4znmFX8az=ARMg@mail.gmail.com
though I have not reviewed that patch in detail.

One minor thought is that maybe it'd be better to do this:
    errno = 0;
+    ERR_clear_error();

in the other order, so as not to assume that ERR_clear_error doesn't
set errno.  On the other hand, if it does, things are probably hopelessly
broken anyway; so I'm not sure there is any case where this helps.
        regards, tom lane



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Wed, Mar 23, 2016 at 9:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> in the other order, so as not to assume that ERR_clear_error doesn't
> set errno.  On the other hand, if it does, things are probably hopelessly
> broken anyway; so I'm not sure there is any case where this helps.

I'm fine with that.

Will this make it into the next point release? I was rather hoping it would.

-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> Will this make it into the next point release? I was rather hoping it would.

I dunno.  I certainly haven't reviewed it carefully enough to commit it.
Perhaps Peter has, but time grows short ...
        regards, tom lane



Re: Fix for OpenSSL error queue bug

From
Michael Paquier
Date:
On Sun, Mar 27, 2016 at 9:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> Will this make it into the next point release? I was rather hoping it would.
>
> I dunno.  I certainly haven't reviewed it carefully enough to commit it.
> Perhaps Peter has, but time grows short ...

I have looked at this patch. Do we need to worry as well about
SSL_shutdown in disconnection code path? I believe that we don't care
much if an error happens at this point but we surely should consume
any error generated because the SSL context is kept after
destroy_ssl_system and another connection attempt may be done using
the same SSL context, no?
-- 
Michael



Re: Fix for OpenSSL error queue bug

From
Peter Eisentraut
Date:
On 03/14/2016 09:44 PM, Peter Geoghegan wrote:
> On Mon, Mar 14, 2016 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Yes, with one small difference: I wouldn't be calling ERR_get_error()
>> in the common case where SSL_get_error() returns SSL_ERROR_NONE, on
>> the theory that skipping that case represents no risk. I'm making a
>> concession to Peter E's view that that will calling ERR_get_error()
>> more will add useless cycles.
>
> The attached patch is what I have in mind.
>
> I can produce a back-patchable variant of this if you and Peter E.
> think this approach is okay.

I think this patch is OK under the premises that we have established.

I wish we could avoid the huge, repeated comment blocks.  Perhaps we 
could put them at the top of the files once?

Also, why do you write 0UL instead of just 0?



Re: Fix for OpenSSL error queue bug

From
Peter Eisentraut
Date:
On 04/07/2016 03:47 AM, Michael Paquier wrote:
> I have looked at this patch. Do we need to worry as well about
> SSL_shutdown in disconnection code path? I believe that we don't care
> much if an error happens at this point but we surely should consume
> any error generated because the SSL context is kept after
> destroy_ssl_system and another connection attempt may be done using
> the same SSL context, no?

But we are the only user of our SSL context, and we clear the error 
before every call we make (with this patch).  The clean up afterwards is 
only if someone else is also using SSL in the same process, and they 
won't use our SSL context.



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Thu, Apr 7, 2016 at 6:08 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> I wish we could avoid the huge, repeated comment blocks.  Perhaps we could
> put them at the top of the files once?

I'm fine with that. Do you want to take care of that, or should I?

> Also, why do you write 0UL instead of just 0?

Because the variable is declared as a raw unsigned long, IIRC. It
hardly matters.

-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Michael Paquier
Date:
On Fri, Apr 8, 2016 at 10:23 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 04/07/2016 03:47 AM, Michael Paquier wrote:
>>
>> I have looked at this patch. Do we need to worry as well about
>> SSL_shutdown in disconnection code path? I believe that we don't care
>> much if an error happens at this point but we surely should consume
>> any error generated because the SSL context is kept after
>> destroy_ssl_system and another connection attempt may be done using
>> the same SSL context, no?
>
>
> But we are the only user of our SSL context, and we clear the error before
> every call we make (with this patch).  The clean up afterwards is only if
> someone else is also using SSL in the same process, and they won't use our
> SSL context.

Argh, yes. Because SSL_free() is directly used. I should read those
docs more carefully.
-- 
Michael



Re: Fix for OpenSSL error queue bug

From
Peter Eisentraut
Date:
On 04/07/2016 09:36 PM, Peter Geoghegan wrote:
> On Thu, Apr 7, 2016 at 6:08 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> I wish we could avoid the huge, repeated comment blocks.  Perhaps we could
>> put them at the top of the files once?
>
> I'm fine with that. Do you want to take care of that, or should I?

I have committed this so that the comments are only in the first 
instance in each file.  I think that should give enough information to 
someone who is curious about the details of the error handling.

Also, I have adjusted this so that we check for n<0 after SSL_read and 
SSL_write, as you had in the frontend code but not in the backend code.




Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2016 at 11:12 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> I have committed this so that the comments are only in the first instance in
> each file.  I think that should give enough information to someone who is
> curious about the details of the error handling.
>
> Also, I have adjusted this so that we check for n<0 after SSL_read and
> SSL_write, as you had in the frontend code but not in the backend code.

That seems reasonable. I'm glad we finally got this done. Thanks.


-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2016 at 11:20 AM, Peter Geoghegan <pg@heroku.com> wrote:
> That seems reasonable. I'm glad we finally got this done. Thanks.

Are we going to backpatch this?


-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Fri, Apr 8, 2016 at 11:20 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> That seems reasonable. I'm glad we finally got this done. Thanks.

> Are we going to backpatch this?

Seems like a reasonable thing to do, but somebody would have to do the
legwork to produce back-branch patches.
        regards, tom lane



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2016 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Seems like a reasonable thing to do, but somebody would have to do the
> legwork to produce back-branch patches.

I'll do so soon. I was waiting on Peter E to take me up on the offer.

-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2016 at 11:43 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I'll do so soon. I was waiting on Peter E to take me up on the offer.

Attached is a series of patches for each supported release branch. As
discussed, I would like to target every released version of Postgres
with this bugfix. This is causing users real pain at the moment.

Only the 9.5 backpatch was a simple, conflict-free "git cherry-pick".
Most of the effort here involved producing a clean 9.4 patch. This was
largely mechanical, if a little tricky. In release branches for
releases that preceded 9.4, there were a few further merge conflicts
as I worked backwards through the branches, but those were trivial.

I'm not sure if project policy around backpatching (that commit
messages and so on should match exactly) has anything to say about the
case where backpatching follows several weeks after commit to the
master branch. In the absence of any clear direction on that, I've
created commits that look like what Peter E might have pushed in early
April, had he decided to do that backpatch leg-work up front.

--
Peter Geoghegan

Attachment

Re: Fix for OpenSSL error queue bug

From
Michael Paquier
Date:
On Tue, Apr 26, 2016 at 9:37 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Only the 9.5 backpatch was a simple, conflict-free "git cherry-pick".
> Most of the effort here involved producing a clean 9.4 patch. This was
> largely mechanical, if a little tricky. In release branches for
> releases that preceded 9.4, there were a few further merge conflicts
> as I worked backwards through the branches, but those were trivial.

Looking again at this thread, the general agreement was to clear the
error stack before calling any SSL routine. Those patches are doing
so, and they look in good shape to me. Note: there is
SSL_do_handshake() on back-branches for the SSL renegotiation but we
don't need to bother about clearing the error queue as any error
occurring in those cases just stops the session, and we've never
bothered calling ERR_get_error there to get more details about the
errors.

> I'm not sure if project policy around backpatching (that commit
> messages and so on should match exactly) has anything to say about the
> case where backpatching follows several weeks after commit to the
> master branch. In the absence of any clear direction on that, I've
> created commits that look like what Peter E might have pushed in early
> April, had he decided to do that backpatch leg-work up front.

It seems to me that we definitely want to get this stuff backpatched
at the end. So +1 for this move.
-- 
Michael



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Mon, Apr 25, 2016 at 6:44 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> I'm not sure if project policy around backpatching (that commit
>> messages and so on should match exactly) has anything to say about the
>> case where backpatching follows several weeks after commit to the
>> master branch. In the absence of any clear direction on that, I've
>> created commits that look like what Peter E might have pushed in early
>> April, had he decided to do that backpatch leg-work up front.
>
> It seems to me that we definitely want to get this stuff backpatched
> at the end. So +1 for this move.

Right. This issue has a long history of causing users significant
(though often intermittent) problems. As an example, consider this
problem report from a co-worker of mine that dates back to 2012:

https://bitbucket.org/ged/ruby-pg/issues/142/async_exec-over-ssl-connection-can-fail-on

There are numerous problem reports like this that are easily found
using Google. I think that there are probably a variety of unpleasant
interactions and symptoms. Crashes are one rarer symptom, seen in
certain scenarios only (crashes are not described in the link above).

-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Alvaro Herrera
Date:
Peter Geoghegan wrote:

> I'm not sure if project policy around backpatching (that commit
> messages and so on should match exactly) has anything to say about the
> case where backpatching follows several weeks after commit to the
> master branch.

There is no value in providing exact message matches when the backpatch
occurs even after one other commit in the master branch.  The principal
reason for the requirement is so that src/tools/git_changelog is able to
merge the commits as a single entry, and that's not going to happen when
you have one or more other commits in the master branch anyway.  I find
it more productive to mention, in the backpatched commit message, what
is the commit ID in the master branch.  That way it can be more easily
be searched for, later on.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fix for OpenSSL error queue bug

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Peter Geoghegan wrote:
>> I'm not sure if project policy around backpatching (that commit
>> messages and so on should match exactly) has anything to say about the
>> case where backpatching follows several weeks after commit to the
>> master branch.

> There is no value in providing exact message matches when the backpatch
> occurs even after one other commit in the master branch.

Actually, git_changelog can merge identically-messaged commits despite
intervening commits.  It's set up to not merge commits more than 24 hours
apart, though.  We could loosen that requirement but I'm afraid it would
make things confusing to merge far-apart commits.
        regards, tom lane



Re: Fix for OpenSSL error queue bug

From
Peter Eisentraut
Date:
On 04/27/2016 11:04 PM, Tom Lane wrote:
>> There is no value in providing exact message matches when the backpatch
>> occurs even after one other commit in the master branch.
> 
> Actually, git_changelog can merge identically-messaged commits despite
> intervening commits.  It's set up to not merge commits more than 24 hours
> apart, though.  We could loosen that requirement but I'm afraid it would
> make things confusing to merge far-apart commits.

ISTM that git_changelog should be looking at the AuthorDate instead of
the CommitDate.  Then it would work correctly for backpatches done using
cherry-pick.




Re: Fix for OpenSSL error queue bug

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On 04/27/2016 11:04 PM, Tom Lane wrote:
>> Actually, git_changelog can merge identically-messaged commits despite
>> intervening commits.  It's set up to not merge commits more than 24 hours
>> apart, though.  We could loosen that requirement but I'm afraid it would
>> make things confusing to merge far-apart commits.

> ISTM that git_changelog should be looking at the AuthorDate instead of
> the CommitDate.  Then it would work correctly for backpatches done using
> cherry-pick.

Meh.  That would also make it noticeably *less* accurate for some other
scenarios.  It actually did look at AuthorDate to start with, and we
changed it because we didn't like the results; cf 7299778a9.

Also, IME, backpatching with cherry-pick fails often enough that designing
one's process around it is just asking for pain.  Certainly, many fewer
than half of my own back-patches manage to use cherry-pick.
        regards, tom lane



Re: Fix for OpenSSL error queue bug

From
Peter Eisentraut
Date:
On 4/25/16 8:37 PM, Peter Geoghegan wrote:
> Attached is a series of patches for each supported release branch. As
> discussed, I would like to target every released version of Postgres
> with this bugfix. This is causing users real pain at the moment.

All committed.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fix for OpenSSL error queue bug

From
Peter Geoghegan
Date:
On Fri, May 6, 2016 at 9:12 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> All committed.

Thanks!

This should no longer be referenced in the 9.6 release notes. It
should just appear in the next batch of point releases. Tom has an
sgml comment in the draft 9.6 release notes to that effect.

-- 
Peter Geoghegan



Re: Fix for OpenSSL error queue bug

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Fri, May 6, 2016 at 9:12 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> All committed.

> Thanks!

> This should no longer be referenced in the 9.6 release notes. It
> should just appear in the next batch of point releases. Tom has an
> sgml comment in the draft 9.6 release notes to that effect.

Yeah, I was just on that ...
        regards, tom lane