Re: Strange hanging bug in a simple milter - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Strange hanging bug in a simple milter
Date
Msg-id 522DF30E.90605@vmware.com
Whole thread Raw
In response to Re: Strange hanging bug in a simple milter  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Strange hanging bug in a simple milter
List pgsql-hackers
On 09.09.2013 18:20, Stephen Frost wrote:
> Vesa-Matti, Heikki,
>
> * Heikki Linnakangas (hlinnakangas@vmware.com) wrote:
>> On 09.09.2013 15:36, Vesa-Matti J Kari wrote:
>>> If I interpret this correctly, threads #2 and #3 are waiting for the same
>>> lock but they make no progress.
>>
>> A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL
>> enabled in my test server. As soon as I turned it on, it hung.
>>
>> Attached is a small stand-alone test program to reproduce it. You
>> can pass a libpq connection string as argument to it.
>
> Interesting...  Which version of libpq were you working against?  I see
> that Vesa-Matti had the problem happen w/ 9.1.9, which should have been
> before the changes that I made to add locking around our usage of
> SSL_context, as otherwise we would end up in situations where we'd dump
> core, but he also had it with 9.3rc1, which should have included it.  I
> had tested the patch w/ a pretty good amount of concurrent threads fired
> off from a little python script and didn't run into any deadlocks
> there..

> Vesa-Matti, was this working previously, and are you sure you were
> testing with 9.3rc1's libpq?  Heikki, which are you testing against and
> perhaps you might try before and after?  I'll be able to look into it
> more in a few hours also.

Thanks! I tested with git master.

I added printf()s into the pq_lockingcallback function, and got a trace
where both threads got stuck waiting for lock 10 in the pq_lockarray. It
looks like someone is failing to release it. The backtrace for both
threads look like this:

#0  __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fad49826f3c in _L_lock_974 () from
/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fad49826d8b in __GI___pthread_mutex_lock
(mutex=0x7fad3800a260) at pthread_mutex_lock.c:64
#3  0x00007fad49a53f08 in pq_lockingcallback (line=175,
file=0x7fad48fa8bb3 "x_pubkey.c", n=10, mode=<optimized out>) at
fe-secure.c:872
#4  pq_lockingcallback (mode=<optimized out>, n=10, file=0x7fad48fa8bb3
"x_pubkey.c", line=175) at fe-secure.c:868
#5  0x00007fad48f396ab in X509_PUBKEY_get () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6  0x00007fad48f56292 in X509_get_pubkey_parameters () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7  0x00007fad48f5649c in X509_verify_cert () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8  0x00007fad4924f14a in ?? () from
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#9  0x00007fad4922ce1c in ?? () from
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#10 0x00007fad492310d2 in ?? () from
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#11 0x00007fad49a54c67 in open_client_SSL (conn=0x7fad380397d0) at
fe-secure.c:1463
#12 pqsecure_open_client (conn=conn@entry=0x7fad380397d0) at fe-secure.c:306
#13 0x00007fad49a44fb6 in PQconnectPoll (conn=conn@entry=0x7fad380397d0)
at fe-connect.c:2123
#14 0x00007fad49a4618e in connectDBComplete
(conn=conn@entry=0x7fad380397d0) at fe-connect.c:1521
#15 0x00007fad49a46b47 in PQconnectdb (conninfo=<optimized out>) at
fe-connect.c:516
#16 0x00000000004007b6 in test_connect (threadid=2) at threaded-connect.c:25
#17 0x000000000040086e in run_thread (arg=0x600e04 <two>) at
threaded-connect.c:55
#18 0x00007fad49824e0e in start_thread (arg=0x7fad48203700) at
pthread_create.c:311
#19 0x00007fad4955993d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Sometimes the lockup happens differently, with one thread hung up in
SSL_init() and another waiting for the first one on ssl_config_mutex,
for example.

A good next step might be to create a standalone program that doesn't
use libpq at all, but just calls X509_verify_cert() concurrently in two
threads. Or open plain SSL connections. If the deadlock can be
reproduced with that, then we could just report the bug to the OpenSSL
and hope that they can figure it out.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: improve Chinese locale performance
Next
From: Robert Haas
Date:
Subject: Re: dynamic shared memory