Thread: Failing SSL connection due to weird interaction with openssl
While investigating a ruby-pg issue [1], we noticed that a libpq SSL connection can fail, if the running application uses OpenSSL for other work, too. Root cause is the thread local error queue of OpenSSL, that is used to transmit textual error messages to the application after a failed crypto operation. In case that the application leaves errors on the queue, the communication to the PostgreSQL server can fail with a message left from the previous failed OpenSSL operation, in particular when using non-blocking operations on the socket. This issue with openssl is quite old now - see [3]. For [1] it turned out that the issue is subdivided into these three parts: 1. the ruby-openssl binding does not clear the thread local error queue of OpenSSL after a certificate verify 2. OpenSSL makes use of a shared error queue for different crypto contexts. 3. libpq does not ensure a cleared error queue when doing SSL_* calls To 1: Remaining messages on the error queue can generally lead to failing operations, later on. I'd talk to the ruby-openssl developers, to discuss how we can avoid any remaining messages on the queue. To 2: SSL_get_error() inspects the shared error queue under some conditions. It's maybe poor API design, but it's documented behaviour [2]. So we certainly have to get along with it. To 3: To make libpq independent to a previous error state, the error queue might be cleared with a call to ERR_clear_error() prior SSL_connect/read/write as in the attached trivial patch. This would make libpq robust against other uses of openssl within the application. What do you think about clearing the OpenSSL error queue in libpq in that way? [1] https://bitbucket.org/ged/ruby-pg/issue/142/async_exec-over-ssl-connection-can-fail-on [2] http://www.openssl.org/docs/ssl/SSL_get_error.html [3] http://www.educatedguesswork.org/movabletype/archives/2005/03/curse_you_opens.html
Attachment
On Tue, Oct 23, 2012 at 4:09 AM, Lars Kanis <lars@greiz-reinsdorf.de> wrote: > While investigating a ruby-pg issue [1], we noticed that a libpq SSL > connection can fail, if the running application uses OpenSSL for other work, > too. Root cause is the thread local error queue of OpenSSL, that is used to > transmit textual error messages to the application after a failed crypto > operation. In case that the application leaves errors on the queue, the > communication to the PostgreSQL server can fail with a message left from the > previous failed OpenSSL operation, in particular when using non-blocking > operations on the socket. This issue with openssl is quite old now - see > [3]. > > For [1] it turned out that the issue is subdivided into these three parts: > 1. the ruby-openssl binding does not clear the thread local error queue of > OpenSSL after a certificate verify > 2. OpenSSL makes use of a shared error queue for different crypto contexts. > 3. libpq does not ensure a cleared error queue when doing SSL_* calls > > To 1: Remaining messages on the error queue can generally lead to failing > operations, later on. I'd talk to the ruby-openssl developers, to discuss > how we can avoid any remaining messages on the queue. > > To 2: SSL_get_error() inspects the shared error queue under some conditions. > It's maybe poor API design, but it's documented behaviour [2]. So we > certainly have to get along with it. > > To 3: To make libpq independent to a previous error state, the error queue > might be cleared with a call to ERR_clear_error() prior > SSL_connect/read/write as in the attached trivial patch. This would make > libpq robust against other uses of openssl within the application. > > What do you think about clearing the OpenSSL error queue in libpq in that > way? Shouldn't it be the job of whatever code is consuming the error to clear the error queue afterwards? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Am 06.11.2012 21:40, schrieb Robert Haas: > On Tue, Oct 23, 2012 at 4:09 AM, Lars Kanis <lars@greiz-reinsdorf.de> wrote: >> While investigating a ruby-pg issue [1], we noticed that a libpq SSL >> connection can fail, if the running application uses OpenSSL for other work, >> too. Root cause is the thread local error queue of OpenSSL, that is used to >> transmit textual error messages to the application after a failed crypto >> operation. In case that the application leaves errors on the queue, the >> communication to the PostgreSQL server can fail with a message left from the >> previous failed OpenSSL operation, in particular when using non-blocking >> operations on the socket. This issue with openssl is quite old now - see >> [3]. >> >> For [1] it turned out that the issue is subdivided into these three parts: >> 1. the ruby-openssl binding does not clear the thread local error queue of >> OpenSSL after a certificate verify >> 2. OpenSSL makes use of a shared error queue for different crypto contexts. >> 3. libpq does not ensure a cleared error queue when doing SSL_* calls >> >> To 1: Remaining messages on the error queue can generally lead to failing >> operations, later on. I'd talk to the ruby-openssl developers, to discuss >> how we can avoid any remaining messages on the queue. >> >> To 2: SSL_get_error() inspects the shared error queue under some conditions. >> It's maybe poor API design, but it's documented behaviour [2]. So we >> certainly have to get along with it. >> >> To 3: To make libpq independent to a previous error state, the error queue >> might be cleared with a call to ERR_clear_error() prior >> SSL_connect/read/write as in the attached trivial patch. This would make >> libpq robust against other uses of openssl within the application. >> >> What do you think about clearing the OpenSSL error queue in libpq in that >> way? > Shouldn't it be the job of whatever code is consuming the error to > clear the error queue afterwards? > Yes, of course. I already filed a bug for ruby-openssl, some weeks ago [1]. But IMHO libpq should be changed too, for the following reasons: 1. The behavior of libpq isn't consistent, since blocking calls are already agnostic to remaining errors in the openssl queue, but non-blocking are not. This is a openssl quirk, that is exposed to the libpq-API, this way. 2. libpq throws wrong errors. The error of libpq isn't "Remaining errors in openssl error queue. libpq requires a clear error queue in order to work correctly.", but instead it throws arbitrary foreign errors that could relate to or may not relate to the communication of libpq. The documentation for SSL_get_error(3) is pretty unambiguous about the need to clear the error queue first. 3. The sensitivity of libpq to the error queue can lead to bugs that are hard to track down, like this one [2]. This is because a libpq error leads the developer to look for a bug related to the database connection, although the issue is in a very different part of the code. Regards, Lars [1] http://bugs.ruby-lang.org/issues/7215 [2] https://bitbucket.org/ged/ruby-pg/issue/142/async_exec-over-ssl-connection-can-fail-on
Lars Kanis wrote: > While investigating a ruby-pg issue [1], we noticed that a libpq SSL > connection can fail, if the running application uses OpenSSL for > other work, too. Root cause is the thread local error queue of > OpenSSL, that is used to transmit textual error messages to the > application after a failed crypto operation. In case that the > application leaves errors on the queue, the communication to the > PostgreSQL server can fail with a message left from the previous > failed OpenSSL operation, in particular when using non-blocking > operations on the socket. This issue with openssl is quite old now - > see [3]. I gather that this is supposed to be back-patched to all supported branches. > [3] http://www.educatedguesswork.org/movabletype/archives/2005/03/curse_you_opens.html This link is dead. Here's one that works: http://www.educatedguesswork.org/2005/03/curse_you_opens.html -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I gather that this is supposed to be back-patched to all supported > branches. FWIW, I don't like that patch any better than Robert does. It seems as likely to do harm as good. If there are places where libpq itself is leaving entries on the error stack, we should fix them -- retail. But it's not our business to clobber global state because there might be bugs in some other part of an application. regards, tom lane
On 2012-11-26 21:45:32 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > I gather that this is supposed to be back-patched to all supported > > branches. > > FWIW, I don't like that patch any better than Robert does. It seems > as likely to do harm as good. If there are places where libpq itself > is leaving entries on the error stack, we should fix them -- retail. > But it's not our business to clobber global state because there might > be bugs in some other part of an application. As there hasn't been any new input since this comment I am marking the patch as "Rejected" in the CF application. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 8, 2012 at 11:07 AM, Andres Freund <andres@2ndquadrant.com> wrote: > As there hasn't been any new input since this comment I am marking the > patch as "Rejected" in the CF application. Sounds good. FWIW, even if we were going to accept this, I can't imagine back-patching it. Users will come after us with pitchforks if we change something like this in a minor release, and for good reason.This could utterly break working applications in afashion that requires code changes and a recompile to fix. That is not a nice kind of thing for a shared library to do as part of a security/bug fix update. If you ask me, the problem here is that OpenSSL's error-reporting mechanism is just plain badly designed. I remember programming in BASIC back in the 80s and thinking to myself: what kind of a stupid error-handling interface is ON ERROR GOTO? And can I pummel the person who came up with it? This is basically a throwback to that sort of design, where your error-handlers get to guess where exactly the program was when the exception happened. You can make it work if you try hard enough, but you sure have to try hard. Frankly, I don't have a lot of hope of making things a whole lot better here no matter what we do. FWICS, this kind of problem is endemic in OpenSSL, which also doesn't seem to believe in comprehensive documentation or code comments. It would be nice if we had an API to some other, less crappy encryption library; or maybe even some generic API that lets you easily wire it into any library you happen to wish to use. Not that I'm volunteering to write the patch... :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > FWICS, this kind of problem is endemic in OpenSSL, which > also doesn't seem to believe in comprehensive documentation or code > comments. It would be nice if we had an API to some other, less > crappy encryption library; or maybe even some generic API that lets > you easily wire it into any library you happen to wish to use. Awhile back Red Hat was trying to get people to switch to NSS or GnuTLS, which apparently are better designed. > Not that I'm volunteering to write the patch... :-( Me either ... and in fact the lack of interest among upstreams in rewriting their TLS code is what made the aforesaid effort crash and burn. But FWIW, there are better alternatives out there. regards, tom lane