Re: Client deadlocks when connecting via ssl - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: Client deadlocks when connecting via ssl
Date
Msg-id 5436991B.5020708@vmware.com
Whole thread Raw
In response to Client deadlocks when connecting via ssl  (Amjith Ramanujam <amjith.r@gmail.com>)
Responses Re: Client deadlocks when connecting via ssl
List pgsql-bugs
On 10/08/2014 11:47 PM, Amjith Ramanujam wrote:
> Hi,
>
> I'm accessing Postgresql from Python using Psycopg2. I'm making multiple
> calls from threads and once every few times it causes the script to lock up
> and become unresponsive. This only happens if I'm also making HTTPS calls
> in addition to Postgres TCP calls with SSL.
>
> I've captured stack traces from the locked process using gdb (see
> attachment). They all seem to point to libpq
>
> #0  0x00007fff95d6f746 in __psynch_mutexwait () from
> /usr/lib/system/libsystem_kernel.dylib
> #1  0x00007fff963d6779 in _pthread_mutex_lock () from
> /usr/lib/system/libsystem_pthread.dylib
> #2  0x000000010fad5edf in pq_lockingcallback () from
> /usr/local/lib/libpq.5.dylib
> #3  0x000000010f90903d in CRYPTO_add_lock () from
> /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
> #4  0x000000010f9ba528 in X509_verify_cert () from
> /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
> #5  0x000000010f8dc882 in ssl_verify_cert_chain () from
> /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
> #6  0x000000010f8bd611 in ssl3_get_server_certificate () from
> /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
> #7  0x000000010f8bc427 in ssl3_connect () from
> /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
> #8  0x000000010fad522d in pqsecure_open_client () from
> /usr/local/lib/libpq.5.dylib
>
>
> System Details:
>
> OSX Maverics
> psql (PostgreSQL) 9.3.4
> psycopg2 version 2.5.4
>
> I'm also attaching the script (pg_hammer.py) that I used to trigger this
> issue. The script spins up 20 threads and makes Postgres calls. Once every
> 10 iterations it also makes a call to https://google.com. It creates a log
> file 'db_stats1.log' which is written to every time a call to Postgres is
> successful.
>
> The dead lock doesn't happen every time, but if you relaunch the script a
> few times you'll see the issue pop up right away where the log file stops
> spewing lines.

Thanks, I was able to easily reproduce this with the script.

I added some debugging information to libpq's pq_lockingcallback
function, to print a line whenever a lock is acquired or released. This
is how this happens:

Thread 1: Opens a connection with libpq. libpq calls
CRYPTO_set_id_callback and CRYPTO_set_locking_callback, as
multi-threaded applications that use OpenSSL are required to do.

Thread 2: Begins to open a non-libpq SSL connection. During this, it
acquires a lock, by calling the libpq callback.

Thread 1: Closes the libpq connection, and unregisters the callback.

Thread 2: Finishes the operation, and releases the lock. But the libpq
callback is no longer registered, so it turns into a no-op. The pthread
mutex that libpq created is still held.

Thread 1: Opens another libpq connection, registering the callbacks again.

Thread 2: Begins another SSL operation. It tries to acquire the same
lock as before, and calls the libpq callback. It blocks forever, because
the thread is still holding the lock.


I'm not sure what to do about this. We could not unregister the
callbacks, when the last libpq connection is closed, but there's a
comment in destroy_ssl_system explaining why that would be bad (if libpq
is later unloaded, the callback pointers would be left dangling).
Perhaps we could leave the callbacks registered when the last connection
is closed, and add a _fini() function that unregisters them when the
library is unloaded. I'm not sure how portable that is, though.

Another idea is to check if any of the locks are held, when the last
libpq connection is closed, and leave the callbacks in place if there are.

Hmm. I wonder how other libraries that use libpq handle this. libpq
can't be the only one having this problem. It's pretty annoying that
OpenSSL requires these callbacks...

- Heikki

pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: BUG #10675: alter database set tablespace and unlogged table
Next
From: Heikki Linnakangas
Date:
Subject: Re: BUG #11603: replication, pg_basebackup and high load