Thread: Client deadlocks when connecting via ssl

Client deadlocks when connecting via ssl

From
Amjith Ramanujam
Date:
Hi,

I'm accessing Postgresql from Python using Psycopg2. I'm making multiple calls from threads and once every few times it causes the script to lock up and become unresponsive. This only happens if I'm also making HTTPS calls in addition to Postgres TCP calls with SSL.

I've captured stack traces from the locked process using gdb (see attachment). They all seem to point to libpq 

#0  0x00007fff95d6f746 in __psynch_mutexwait () from /usr/lib/system/libsystem_kernel.dylib
#1  0x00007fff963d6779 in _pthread_mutex_lock () from /usr/lib/system/libsystem_pthread.dylib
#2  0x000000010fad5edf in pq_lockingcallback () from /usr/local/lib/libpq.5.dylib
#3  0x000000010f90903d in CRYPTO_add_lock () from /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
#4  0x000000010f9ba528 in X509_verify_cert () from /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
#5  0x000000010f8dc882 in ssl_verify_cert_chain () from /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
#6  0x000000010f8bd611 in ssl3_get_server_certificate () from /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
#7  0x000000010f8bc427 in ssl3_connect () from /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
#8  0x000000010fad522d in pqsecure_open_client () from /usr/local/lib/libpq.5.dylib


System Details:

OSX Maverics
psql (PostgreSQL) 9.3.4
psycopg2 version 2.5.4

I'm also attaching the script (pg_hammer.py) that I used to trigger this issue. The script spins up 20 threads and makes Postgres calls. Once every 10 iterations it also makes a call to https://google.com. It creates a log file 'db_stats1.log' which is written to every time a call to Postgres is successful. 

The dead lock doesn't happen every time, but if you relaunch the script a few times you'll see the issue pop up right away where the log file stops spewing lines. 

You might have to install the requests library (using `pip install requests`) in order to run the script. 

Please let me know if you need more details. 

Cheers!
Amjith
Attachment

Re: Client deadlocks when connecting via ssl

From
Heikki Linnakangas
Date:
On 10/08/2014 11:47 PM, Amjith Ramanujam wrote:
> Hi,
>
> I'm accessing Postgresql from Python using Psycopg2. I'm making multiple
> calls from threads and once every few times it causes the script to lock up
> and become unresponsive. This only happens if I'm also making HTTPS calls
> in addition to Postgres TCP calls with SSL.
>
> I've captured stack traces from the locked process using gdb (see
> attachment). They all seem to point to libpq
>
> #0  0x00007fff95d6f746 in __psynch_mutexwait () from
> /usr/lib/system/libsystem_kernel.dylib
> #1  0x00007fff963d6779 in _pthread_mutex_lock () from
> /usr/lib/system/libsystem_pthread.dylib
> #2  0x000000010fad5edf in pq_lockingcallback () from
> /usr/local/lib/libpq.5.dylib
> #3  0x000000010f90903d in CRYPTO_add_lock () from
> /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
> #4  0x000000010f9ba528 in X509_verify_cert () from
> /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
> #5  0x000000010f8dc882 in ssl_verify_cert_chain () from
> /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
> #6  0x000000010f8bd611 in ssl3_get_server_certificate () from
> /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
> #7  0x000000010f8bc427 in ssl3_connect () from
> /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
> #8  0x000000010fad522d in pqsecure_open_client () from
> /usr/local/lib/libpq.5.dylib
>
>
> System Details:
>
> OSX Maverics
> psql (PostgreSQL) 9.3.4
> psycopg2 version 2.5.4
>
> I'm also attaching the script (pg_hammer.py) that I used to trigger this
> issue. The script spins up 20 threads and makes Postgres calls. Once every
> 10 iterations it also makes a call to https://google.com. It creates a log
> file 'db_stats1.log' which is written to every time a call to Postgres is
> successful.
>
> The dead lock doesn't happen every time, but if you relaunch the script a
> few times you'll see the issue pop up right away where the log file stops
> spewing lines.

Thanks, I was able to easily reproduce this with the script.

I added some debugging information to libpq's pq_lockingcallback
function, to print a line whenever a lock is acquired or released. This
is how this happens:

Thread 1: Opens a connection with libpq. libpq calls
CRYPTO_set_id_callback and CRYPTO_set_locking_callback, as
multi-threaded applications that use OpenSSL are required to do.

Thread 2: Begins to open a non-libpq SSL connection. During this, it
acquires a lock, by calling the libpq callback.

Thread 1: Closes the libpq connection, and unregisters the callback.

Thread 2: Finishes the operation, and releases the lock. But the libpq
callback is no longer registered, so it turns into a no-op. The pthread
mutex that libpq created is still held.

Thread 1: Opens another libpq connection, registering the callbacks again.

Thread 2: Begins another SSL operation. It tries to acquire the same
lock as before, and calls the libpq callback. It blocks forever, because
the thread is still holding the lock.


I'm not sure what to do about this. We could not unregister the
callbacks, when the last libpq connection is closed, but there's a
comment in destroy_ssl_system explaining why that would be bad (if libpq
is later unloaded, the callback pointers would be left dangling).
Perhaps we could leave the callbacks registered when the last connection
is closed, and add a _fini() function that unregisters them when the
library is unloaded. I'm not sure how portable that is, though.

Another idea is to check if any of the locks are held, when the last
libpq connection is closed, and leave the callbacks in place if there are.

Hmm. I wonder how other libraries that use libpq handle this. libpq
can't be the only one having this problem. It's pretty annoying that
OpenSSL requires these callbacks...

- Heikki

Re: Client deadlocks when connecting via ssl

From
Amjith Ramanujam
Date:
On Thu, Oct 9, 2014 at 7:18 AM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

> Thanks, I was able to easily reproduce this with the script.
>

I'm glad you were able to reproduce it.


> I'm not sure what to do about this. We could not unregister the callbacks,
> when the last libpq connection is closed, but there's a comment in
> destroy_ssl_system explaining why that would be bad (if libpq is later
> unloaded, the callback pointers would be left dangling). Perhaps we could
> leave the callbacks registered when the last connection is closed, and add
> a _fini() function that unregisters them when the library is unloaded. I'm
> not sure how portable that is, though.
>
> Another idea is to check if any of the locks are held, when the last libpq
> connection is closed, and leave the callbacks in place if there are.
>

I'm not informed enough to suggest a solution in this case. But I'll be
glad to stress test it again if a patch were made available.

Let me know if I could be of more help.

Cheers!
Amjith

Re: Client deadlocks when connecting via ssl

From
"Peter J. Farrell"
Date:
Amjith Ramanujam wrote on 10/08/2014 03:47 PM:
> I'm accessing Postgresql from Python using Psycopg2. I'm making
> multiple calls from threads and once every few times it causes the
> script to lock up and become unresponsive. This only happens if I'm
> also making HTTPS calls in addition to Postgres TCP calls with SSL.

Just wanted to add that we're seeing this libpq pq_lockingcallback bug
in our production web apps on Heroku using Gunicorn, New Relic and
Postgres on Heroku.  We've relieved some of the symptoms by implementing
database connection pooling with high timeouts and switching from
Gunicorn to UWsgi so we are Harakiri rogue workers that are deadlocked.
Hoping this is fixed eventually...

Initial Background:
http://blog.maestropublishing.com/2015/01/12/gunicorn-dyno-death-spiral-on-heroku/

Discovery of Bug:
http://blog.maestropublishing.com/2015/01/16/gunicorn-dyno-death-spiral-on-heroku-part-ii/

--
Peter J. Farrell
Principal Technologist - GreatBizTools, LLC
651-747-1250 x101

* Learn about VSRE. I prioritize emails with VSRE in the subject!  http://vsre.info/
* Please do not send me Microsoft Office/Apple iWork documents. Send OpenDocument instead!
http://fsf.org/campaigns/opendocument/

Re: Client deadlocks when connecting via ssl

From
Daniele Varrazzo
Date:
On Sat, Apr 25, 2015 at 12:08 AM, Peter J. Farrell
<pfarrell@greatbiztools.com> wrote:
> Amjith Ramanujam wrote on 10/08/2014 03:47 PM:
>> I'm accessing Postgresql from Python using Psycopg2. I'm making
>> multiple calls from threads and once every few times it causes the
>> script to lock up and become unresponsive. This only happens if I'm
>> also making HTTPS calls in addition to Postgres TCP calls with SSL.
>
> Just wanted to add that we're seeing this libpq pq_lockingcallback bug
> in our production web apps on Heroku using Gunicorn, New Relic and
> Postgres on Heroku.  We've relieved some of the symptoms by implementing
> database connection pooling with high timeouts and switching from
> Gunicorn to UWsgi so we are Harakiri rogue workers that are deadlocked.
> Hoping this is fixed eventually...
>
> Initial Background:
> http://blog.maestropublishing.com/2015/01/12/gunicorn-dyno-death-spiral-on-heroku/
>
> Discovery of Bug:
> http://blog.maestropublishing.com/2015/01/16/gunicorn-dyno-death-spiral-on-heroku-part-ii/

Hello,

there is a pull request open on psycopg that could be what needed to
fix this problem: https://github.com/psycopg/psycopg2/pull/290

Could you please test if it fixes the issue? I can merge the patch and
release 2.6.1 in a few days.

Thank you,

-- Daniele