Re: Strange hanging bug in a simple milter - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Strange hanging bug in a simple milter |
Date | |
Msg-id | 522DF30E.90605@vmware.com Whole thread Raw |
In response to | Re: Strange hanging bug in a simple milter (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: Strange hanging bug in a simple milter
|
List | pgsql-hackers |
On 09.09.2013 18:20, Stephen Frost wrote: > Vesa-Matti, Heikki, > > * Heikki Linnakangas (hlinnakangas@vmware.com) wrote: >> On 09.09.2013 15:36, Vesa-Matti J Kari wrote: >>> If I interpret this correctly, threads #2 and #3 are waiting for the same >>> lock but they make no progress. >> >> A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL >> enabled in my test server. As soon as I turned it on, it hung. >> >> Attached is a small stand-alone test program to reproduce it. You >> can pass a libpq connection string as argument to it. > > Interesting... Which version of libpq were you working against? I see > that Vesa-Matti had the problem happen w/ 9.1.9, which should have been > before the changes that I made to add locking around our usage of > SSL_context, as otherwise we would end up in situations where we'd dump > core, but he also had it with 9.3rc1, which should have included it. I > had tested the patch w/ a pretty good amount of concurrent threads fired > off from a little python script and didn't run into any deadlocks > there.. > Vesa-Matti, was this working previously, and are you sure you were > testing with 9.3rc1's libpq? Heikki, which are you testing against and > perhaps you might try before and after? I'll be able to look into it > more in a few hours also. Thanks! I tested with git master. I added printf()s into the pq_lockingcallback function, and got a trace where both threads got stuck waiting for lock 10 in the pq_lockarray. It looks like someone is failing to release it. The backtrace for both threads look like this: #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007fad49826f3c in _L_lock_974 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007fad49826d8b in __GI___pthread_mutex_lock (mutex=0x7fad3800a260) at pthread_mutex_lock.c:64 #3 0x00007fad49a53f08 in pq_lockingcallback (line=175, file=0x7fad48fa8bb3 "x_pubkey.c", n=10, mode=<optimized out>) at fe-secure.c:872 #4 pq_lockingcallback (mode=<optimized out>, n=10, file=0x7fad48fa8bb3 "x_pubkey.c", line=175) at fe-secure.c:868 #5 0x00007fad48f396ab in X509_PUBKEY_get () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #6 0x00007fad48f56292 in X509_get_pubkey_parameters () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #7 0x00007fad48f5649c in X509_verify_cert () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #8 0x00007fad4924f14a in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #9 0x00007fad4922ce1c in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #10 0x00007fad492310d2 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #11 0x00007fad49a54c67 in open_client_SSL (conn=0x7fad380397d0) at fe-secure.c:1463 #12 pqsecure_open_client (conn=conn@entry=0x7fad380397d0) at fe-secure.c:306 #13 0x00007fad49a44fb6 in PQconnectPoll (conn=conn@entry=0x7fad380397d0) at fe-connect.c:2123 #14 0x00007fad49a4618e in connectDBComplete (conn=conn@entry=0x7fad380397d0) at fe-connect.c:1521 #15 0x00007fad49a46b47 in PQconnectdb (conninfo=<optimized out>) at fe-connect.c:516 #16 0x00000000004007b6 in test_connect (threadid=2) at threaded-connect.c:25 #17 0x000000000040086e in run_thread (arg=0x600e04 <two>) at threaded-connect.c:55 #18 0x00007fad49824e0e in start_thread (arg=0x7fad48203700) at pthread_create.c:311 #19 0x00007fad4955993d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Sometimes the lockup happens differently, with one thread hung up in SSL_init() and another waiting for the first one on ssl_config_mutex, for example. A good next step might be to create a standalone program that doesn't use libpq at all, but just calls X509_verify_cert() concurrently in two threads. Or open plain SSL connections. If the deadlock can be reproduced with that, then we could just report the bug to the OpenSSL and hope that they can figure it out. - Heikki
Attachment
pgsql-hackers by date: