Re: Deadlock in libpq - Mailing list pgsql-general

From Erik Hesselink
Subject Re: Deadlock in libpq
Date
Msg-id AANLkTinvm=4qK61JHzV9dG9N2rtxHMzQuMqzNQ=VgB0K@mail.gmail.com
Whole thread Raw
In response to Re: Deadlock in libpq  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-general
On Thu, Mar 24, 2011 at 15:21, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 9:07 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>> On Thu, Mar 24, 2011 at 14:23, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> On Thu, Mar 24, 2011 at 4:17 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> We're getting a deadlock in our application (a web application with a
>>>> PostgreSQL backend) which I've traced to libpq. I've started our
>>>> application in gdb, and when it hangs, I've inspected the backtraces.
>>>> I've found a couple of threads I can account for (listening for new
>>>> connections, background processes) and 77 threads waiting for a mutex
>>>> lock:
>>>>
>>>> #0  0x00007ffff523d464 in __lll_lock_wait () from /lib/libpthread.so.0
>>>> #1  0x00007ffff52385d9 in _L_lock_953 () from /lib/libpthread.so.0
>>>> #2  0x00007ffff52383fb in pthread_mutex_lock () from /lib/libpthread.so.0
>>>> #3  0x00007ffff6160650 in ?? () from /usr/lib/libpq.so.5
>>>>      ==> pg_lockingcallback
>>>> #4  0x00007ffff440b791 in ?? () from /lib/libcrypto.so.0.9.8
>>>> #5  0x00007ffff440bcc9 in ?? () from /lib/libcrypto.so.0.9.8
>>>> #6  0x00007ffff47652fb in SSL_new () from /lib/libssl.so.0.9.8
>>>> #7  0x00007ffff61604dc in ?? () from /usr/lib/libpq.so.5
>>>>      ==> pqsecure_open_client
>>>> #8  0x00007ffff61525ce in PQconnectPoll () from /usr/lib/libpq.so.5
>>>> #9  0x00007ffff6152f5e in ?? () from /usr/lib/libpq.so.5
>>>>      ==> connectDBComplete
>>>> #10 0x00007ffff6153c5f in PQconnectdb () from /usr/lib/libpq.so.5
>>>> #11 0x0000000000f9b518 in sccR_info ()
>>>> #12 0x0000000000000000 in ?? ()
>>>>
>>>> So it seems everything is waiting for a lock on a mutex from
>>>> pq_lockarray (in fe-secure.c@846). Does anybody have any idea how this
>>>> can happen? Is this something we're doing wrong (I hope so) or a bug
>>>> in libpq?
>>>>
>>>> Some background: this happens only after a couple of thousand requests
>>>> (each doing about 15 database calls), with occasional other requests
>>>> coming in at the same time. Our server uses a Haskell binding to libpq
>>>> (HDBC [1] and HDBC-postgresql [2]). Both client and server run on the
>>>> same machine, running 64bit Ubuntu 10.04. The database version is
>>>> "PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
>>>> (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit". I'm not sure how to determine
>>>> the libpq version, but it is the most recent that comes with this
>>>> ubuntu. The changelogs for Ubuntu suggest 8.4.7 as well. Connections
>>>> are via TCP/IP to 127.0.0.1 with SSL turned on. The machine is under
>>>> some CPU load when this happens. There is plenty of free memory.
>>>>
>>>> When I turned off SSL or connect via domain sockets, we got different
>>>> errors that are possibly related: occasionally, the connection between
>>>> client (our app) and server (database) is lost. On the client, we get:
>>>>
>>>>    connectPostgreSQL: server closed the connection unexpectedly
>>>>    This probably means the server terminated abnormally
>>>>    before or while processing the request.
>>>>
>>>> and on the server:
>>>>
>>>>    could not send data to client: Broken pipe
>>>>
>>>> There is no further context around these messages.
>>>>
>>>> Any help would be greatly appreciated.
>>>
>>> How did you initialize ssl?   You are waiting inside a lock that is
>>> getting set up inside the crypto library.  Unless you are having some
>>> type of library initialization issue, I'm suspicious the problem is
>>> really inside libpq.  Is your application multithreaded, and if so are
>>> you properly synchronizing access to the connection object, etc?
>>
>> What do you mean exactly with "How did you initialize ssl"? I found
>> [1], which I did not know about. This seems to be a very non-local
>> problem: if one of our dependencies initializes ssl, and I use libpq
>> as well, this will go wrong. I've done a quick look through all our
>> dependencies, and none seem to use libcrypto or libssl.
>
> *something* must be initializing ssl, or you can't make secure
> connections from libpq.  you need to find out which pq ssl init
> function is begin called, when it is being called, and with what
> arguments. One of the main things PQInitSSL does is set up a lock
> vector which it passes to the crypto library.  The fact you are having
> blocking issues around those locks is suggesting SSL was not set up
> properly, something happened after being set up so that the locks are
> no longer good, you have application thread issue (although that
> sounds unlikely), or (least likely worst case) there is a bug in
> crypto.

From the postgresql documentation I linked to in my last post, it
seems that if I do not call PQinitOpenSSL and I do not initialize the
libraries myself, libpq will do it for me. Is that correct? If so,
then that is what is happening in my case.

Regards,

--
Erik Hesselink
http://silkapp.com

pgsql-general by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Deadlock in libpq
Next
From: Tom Lane
Date:
Subject: Re: constraint partition issue