Thread: Deadlock in libpq

Deadlock in libpq

From
Erik Hesselink
Date:
Hi,

We're getting a deadlock in our application (a web application with a
PostgreSQL backend) which I've traced to libpq. I've started our
application in gdb, and when it hangs, I've inspected the backtraces.
I've found a couple of threads I can account for (listening for new
connections, background processes) and 77 threads waiting for a mutex
lock:

#0  0x00007ffff523d464 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0x00007ffff52385d9 in _L_lock_953 () from /lib/libpthread.so.0
#2  0x00007ffff52383fb in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x00007ffff6160650 in ?? () from /usr/lib/libpq.so.5
      ==> pg_lockingcallback
#4  0x00007ffff440b791 in ?? () from /lib/libcrypto.so.0.9.8
#5  0x00007ffff440bcc9 in ?? () from /lib/libcrypto.so.0.9.8
#6  0x00007ffff47652fb in SSL_new () from /lib/libssl.so.0.9.8
#7  0x00007ffff61604dc in ?? () from /usr/lib/libpq.so.5
      ==> pqsecure_open_client
#8  0x00007ffff61525ce in PQconnectPoll () from /usr/lib/libpq.so.5
#9  0x00007ffff6152f5e in ?? () from /usr/lib/libpq.so.5
      ==> connectDBComplete
#10 0x00007ffff6153c5f in PQconnectdb () from /usr/lib/libpq.so.5
#11 0x0000000000f9b518 in sccR_info ()
#12 0x0000000000000000 in ?? ()

So it seems everything is waiting for a lock on a mutex from
pq_lockarray (in fe-secure.c@846). Does anybody have any idea how this
can happen? Is this something we're doing wrong (I hope so) or a bug
in libpq?

Some background: this happens only after a couple of thousand requests
(each doing about 15 database calls), with occasional other requests
coming in at the same time. Our server uses a Haskell binding to libpq
(HDBC [1] and HDBC-postgresql [2]). Both client and server run on the
same machine, running 64bit Ubuntu 10.04. The database version is
"PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
(Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit". I'm not sure how to determine
the libpq version, but it is the most recent that comes with this
ubuntu. The changelogs for Ubuntu suggest 8.4.7 as well. Connections
are via TCP/IP to 127.0.0.1 with SSL turned on. The machine is under
some CPU load when this happens. There is plenty of free memory.

When I turned off SSL or connect via domain sockets, we got different
errors that are possibly related: occasionally, the connection between
client (our app) and server (database) is lost. On the client, we get:

    connectPostgreSQL: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

and on the server:

    could not send data to client: Broken pipe

There is no further context around these messages.

Any help would be greatly appreciated.

Sincerely,

--
Erik Hesselink
http://silkapp.com

[1] http://hackage.haskell.org/package/HDBC
[2] http://hackage.haskell.org/package/HDBC-postgresql

Re: Deadlock in libpq

From
Merlin Moncure
Date:
On Thu, Mar 24, 2011 at 4:17 AM, Erik Hesselink <hesselink@gmail.com> wrote:
> Hi,
>
> We're getting a deadlock in our application (a web application with a
> PostgreSQL backend) which I've traced to libpq. I've started our
> application in gdb, and when it hangs, I've inspected the backtraces.
> I've found a couple of threads I can account for (listening for new
> connections, background processes) and 77 threads waiting for a mutex
> lock:
>
> #0  0x00007ffff523d464 in __lll_lock_wait () from /lib/libpthread.so.0
> #1  0x00007ffff52385d9 in _L_lock_953 () from /lib/libpthread.so.0
> #2  0x00007ffff52383fb in pthread_mutex_lock () from /lib/libpthread.so.0
> #3  0x00007ffff6160650 in ?? () from /usr/lib/libpq.so.5
>      ==> pg_lockingcallback
> #4  0x00007ffff440b791 in ?? () from /lib/libcrypto.so.0.9.8
> #5  0x00007ffff440bcc9 in ?? () from /lib/libcrypto.so.0.9.8
> #6  0x00007ffff47652fb in SSL_new () from /lib/libssl.so.0.9.8
> #7  0x00007ffff61604dc in ?? () from /usr/lib/libpq.so.5
>      ==> pqsecure_open_client
> #8  0x00007ffff61525ce in PQconnectPoll () from /usr/lib/libpq.so.5
> #9  0x00007ffff6152f5e in ?? () from /usr/lib/libpq.so.5
>      ==> connectDBComplete
> #10 0x00007ffff6153c5f in PQconnectdb () from /usr/lib/libpq.so.5
> #11 0x0000000000f9b518 in sccR_info ()
> #12 0x0000000000000000 in ?? ()
>
> So it seems everything is waiting for a lock on a mutex from
> pq_lockarray (in fe-secure.c@846). Does anybody have any idea how this
> can happen? Is this something we're doing wrong (I hope so) or a bug
> in libpq?
>
> Some background: this happens only after a couple of thousand requests
> (each doing about 15 database calls), with occasional other requests
> coming in at the same time. Our server uses a Haskell binding to libpq
> (HDBC [1] and HDBC-postgresql [2]). Both client and server run on the
> same machine, running 64bit Ubuntu 10.04. The database version is
> "PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
> (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit". I'm not sure how to determine
> the libpq version, but it is the most recent that comes with this
> ubuntu. The changelogs for Ubuntu suggest 8.4.7 as well. Connections
> are via TCP/IP to 127.0.0.1 with SSL turned on. The machine is under
> some CPU load when this happens. There is plenty of free memory.
>
> When I turned off SSL or connect via domain sockets, we got different
> errors that are possibly related: occasionally, the connection between
> client (our app) and server (database) is lost. On the client, we get:
>
>    connectPostgreSQL: server closed the connection unexpectedly
>    This probably means the server terminated abnormally
>    before or while processing the request.
>
> and on the server:
>
>    could not send data to client: Broken pipe
>
> There is no further context around these messages.
>
> Any help would be greatly appreciated.

How did you initialize ssl?   You are waiting inside a lock that is
getting set up inside the crypto library.  Unless you are having some
type of library initialization issue, I'm suspicious the problem is
really inside libpq.  Is your application multithreaded, and if so are
you properly synchronizing access to the connection object, etc?

merlin

Re: Deadlock in libpq

From
Erik Hesselink
Date:
On Thu, Mar 24, 2011 at 14:23, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 4:17 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>> Hi,
>>
>> We're getting a deadlock in our application (a web application with a
>> PostgreSQL backend) which I've traced to libpq. I've started our
>> application in gdb, and when it hangs, I've inspected the backtraces.
>> I've found a couple of threads I can account for (listening for new
>> connections, background processes) and 77 threads waiting for a mutex
>> lock:
>>
>> #0  0x00007ffff523d464 in __lll_lock_wait () from /lib/libpthread.so.0
>> #1  0x00007ffff52385d9 in _L_lock_953 () from /lib/libpthread.so.0
>> #2  0x00007ffff52383fb in pthread_mutex_lock () from /lib/libpthread.so.0
>> #3  0x00007ffff6160650 in ?? () from /usr/lib/libpq.so.5
>>      ==> pg_lockingcallback
>> #4  0x00007ffff440b791 in ?? () from /lib/libcrypto.so.0.9.8
>> #5  0x00007ffff440bcc9 in ?? () from /lib/libcrypto.so.0.9.8
>> #6  0x00007ffff47652fb in SSL_new () from /lib/libssl.so.0.9.8
>> #7  0x00007ffff61604dc in ?? () from /usr/lib/libpq.so.5
>>      ==> pqsecure_open_client
>> #8  0x00007ffff61525ce in PQconnectPoll () from /usr/lib/libpq.so.5
>> #9  0x00007ffff6152f5e in ?? () from /usr/lib/libpq.so.5
>>      ==> connectDBComplete
>> #10 0x00007ffff6153c5f in PQconnectdb () from /usr/lib/libpq.so.5
>> #11 0x0000000000f9b518 in sccR_info ()
>> #12 0x0000000000000000 in ?? ()
>>
>> So it seems everything is waiting for a lock on a mutex from
>> pq_lockarray (in fe-secure.c@846). Does anybody have any idea how this
>> can happen? Is this something we're doing wrong (I hope so) or a bug
>> in libpq?
>>
>> Some background: this happens only after a couple of thousand requests
>> (each doing about 15 database calls), with occasional other requests
>> coming in at the same time. Our server uses a Haskell binding to libpq
>> (HDBC [1] and HDBC-postgresql [2]). Both client and server run on the
>> same machine, running 64bit Ubuntu 10.04. The database version is
>> "PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
>> (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit". I'm not sure how to determine
>> the libpq version, but it is the most recent that comes with this
>> ubuntu. The changelogs for Ubuntu suggest 8.4.7 as well. Connections
>> are via TCP/IP to 127.0.0.1 with SSL turned on. The machine is under
>> some CPU load when this happens. There is plenty of free memory.
>>
>> When I turned off SSL or connect via domain sockets, we got different
>> errors that are possibly related: occasionally, the connection between
>> client (our app) and server (database) is lost. On the client, we get:
>>
>>    connectPostgreSQL: server closed the connection unexpectedly
>>    This probably means the server terminated abnormally
>>    before or while processing the request.
>>
>> and on the server:
>>
>>    could not send data to client: Broken pipe
>>
>> There is no further context around these messages.
>>
>> Any help would be greatly appreciated.
>
> How did you initialize ssl?   You are waiting inside a lock that is
> getting set up inside the crypto library.  Unless you are having some
> type of library initialization issue, I'm suspicious the problem is
> really inside libpq.  Is your application multithreaded, and if so are
> you properly synchronizing access to the connection object, etc?

What do you mean exactly with "How did you initialize ssl"? I found
[1], which I did not know about. This seems to be a very non-local
problem: if one of our dependencies initializes ssl, and I use libpq
as well, this will go wrong. I've done a quick look through all our
dependencies, and none seem to use libcrypto or libssl.

Our application is definitely multithreaded, as it is a web
application. But every database transaction creates a new connection
object. They are never shared between threads.

The problem is very hard to reproduce. I've taken all queries that
were performed when I last reproduced it, and have only those queries
(and inserts/updates) running in two concurrent loops, but so far,
that hasn't reproduced the problem yet. A couple of hours of running
our application with a script performing requests against it can
reproduce it, though.

If this seems to be a problem inside libpq, should I create a bug
report? I'm hesitant, as I don't have any steps to reproduce.

--
Erik Hesselink
http://silkapp.com

[1] http://www.postgresql.org/docs/8.4/static/libpq-ssl.html#LIBPQ-SSL-INITIALIZE

Re: Deadlock in libpq

From
Merlin Moncure
Date:
On Thu, Mar 24, 2011 at 9:07 AM, Erik Hesselink <hesselink@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 14:23, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Thu, Mar 24, 2011 at 4:17 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>>> Hi,
>>>
>>> We're getting a deadlock in our application (a web application with a
>>> PostgreSQL backend) which I've traced to libpq. I've started our
>>> application in gdb, and when it hangs, I've inspected the backtraces.
>>> I've found a couple of threads I can account for (listening for new
>>> connections, background processes) and 77 threads waiting for a mutex
>>> lock:
>>>
>>> #0  0x00007ffff523d464 in __lll_lock_wait () from /lib/libpthread.so.0
>>> #1  0x00007ffff52385d9 in _L_lock_953 () from /lib/libpthread.so.0
>>> #2  0x00007ffff52383fb in pthread_mutex_lock () from /lib/libpthread.so.0
>>> #3  0x00007ffff6160650 in ?? () from /usr/lib/libpq.so.5
>>>      ==> pg_lockingcallback
>>> #4  0x00007ffff440b791 in ?? () from /lib/libcrypto.so.0.9.8
>>> #5  0x00007ffff440bcc9 in ?? () from /lib/libcrypto.so.0.9.8
>>> #6  0x00007ffff47652fb in SSL_new () from /lib/libssl.so.0.9.8
>>> #7  0x00007ffff61604dc in ?? () from /usr/lib/libpq.so.5
>>>      ==> pqsecure_open_client
>>> #8  0x00007ffff61525ce in PQconnectPoll () from /usr/lib/libpq.so.5
>>> #9  0x00007ffff6152f5e in ?? () from /usr/lib/libpq.so.5
>>>      ==> connectDBComplete
>>> #10 0x00007ffff6153c5f in PQconnectdb () from /usr/lib/libpq.so.5
>>> #11 0x0000000000f9b518 in sccR_info ()
>>> #12 0x0000000000000000 in ?? ()
>>>
>>> So it seems everything is waiting for a lock on a mutex from
>>> pq_lockarray (in fe-secure.c@846). Does anybody have any idea how this
>>> can happen? Is this something we're doing wrong (I hope so) or a bug
>>> in libpq?
>>>
>>> Some background: this happens only after a couple of thousand requests
>>> (each doing about 15 database calls), with occasional other requests
>>> coming in at the same time. Our server uses a Haskell binding to libpq
>>> (HDBC [1] and HDBC-postgresql [2]). Both client and server run on the
>>> same machine, running 64bit Ubuntu 10.04. The database version is
>>> "PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
>>> (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit". I'm not sure how to determine
>>> the libpq version, but it is the most recent that comes with this
>>> ubuntu. The changelogs for Ubuntu suggest 8.4.7 as well. Connections
>>> are via TCP/IP to 127.0.0.1 with SSL turned on. The machine is under
>>> some CPU load when this happens. There is plenty of free memory.
>>>
>>> When I turned off SSL or connect via domain sockets, we got different
>>> errors that are possibly related: occasionally, the connection between
>>> client (our app) and server (database) is lost. On the client, we get:
>>>
>>>    connectPostgreSQL: server closed the connection unexpectedly
>>>    This probably means the server terminated abnormally
>>>    before or while processing the request.
>>>
>>> and on the server:
>>>
>>>    could not send data to client: Broken pipe
>>>
>>> There is no further context around these messages.
>>>
>>> Any help would be greatly appreciated.
>>
>> How did you initialize ssl?   You are waiting inside a lock that is
>> getting set up inside the crypto library.  Unless you are having some
>> type of library initialization issue, I'm suspicious the problem is
>> really inside libpq.  Is your application multithreaded, and if so are
>> you properly synchronizing access to the connection object, etc?
>
> What do you mean exactly with "How did you initialize ssl"? I found
> [1], which I did not know about. This seems to be a very non-local
> problem: if one of our dependencies initializes ssl, and I use libpq
> as well, this will go wrong. I've done a quick look through all our
> dependencies, and none seem to use libcrypto or libssl.

*something* must be initializing ssl, or you can't make secure
connections from libpq.  you need to find out which pq ssl init
function is begin called, when it is being called, and with what
arguments. One of the main things PQInitSSL does is set up a lock
vector which it passes to the crypto library.  The fact you are having
blocking issues around those locks is suggesting SSL was not set up
properly, something happened after being set up so that the locks are
no longer good, you have application thread issue (although that
sounds unlikely), or (least likely worst case) there is a bug in
crypto.

merlin

Re: Deadlock in libpq

From
Erik Hesselink
Date:
On Thu, Mar 24, 2011 at 15:21, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 9:07 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>> On Thu, Mar 24, 2011 at 14:23, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> On Thu, Mar 24, 2011 at 4:17 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> We're getting a deadlock in our application (a web application with a
>>>> PostgreSQL backend) which I've traced to libpq. I've started our
>>>> application in gdb, and when it hangs, I've inspected the backtraces.
>>>> I've found a couple of threads I can account for (listening for new
>>>> connections, background processes) and 77 threads waiting for a mutex
>>>> lock:
>>>>
>>>> #0  0x00007ffff523d464 in __lll_lock_wait () from /lib/libpthread.so.0
>>>> #1  0x00007ffff52385d9 in _L_lock_953 () from /lib/libpthread.so.0
>>>> #2  0x00007ffff52383fb in pthread_mutex_lock () from /lib/libpthread.so.0
>>>> #3  0x00007ffff6160650 in ?? () from /usr/lib/libpq.so.5
>>>>      ==> pg_lockingcallback
>>>> #4  0x00007ffff440b791 in ?? () from /lib/libcrypto.so.0.9.8
>>>> #5  0x00007ffff440bcc9 in ?? () from /lib/libcrypto.so.0.9.8
>>>> #6  0x00007ffff47652fb in SSL_new () from /lib/libssl.so.0.9.8
>>>> #7  0x00007ffff61604dc in ?? () from /usr/lib/libpq.so.5
>>>>      ==> pqsecure_open_client
>>>> #8  0x00007ffff61525ce in PQconnectPoll () from /usr/lib/libpq.so.5
>>>> #9  0x00007ffff6152f5e in ?? () from /usr/lib/libpq.so.5
>>>>      ==> connectDBComplete
>>>> #10 0x00007ffff6153c5f in PQconnectdb () from /usr/lib/libpq.so.5
>>>> #11 0x0000000000f9b518 in sccR_info ()
>>>> #12 0x0000000000000000 in ?? ()
>>>>
>>>> So it seems everything is waiting for a lock on a mutex from
>>>> pq_lockarray (in fe-secure.c@846). Does anybody have any idea how this
>>>> can happen? Is this something we're doing wrong (I hope so) or a bug
>>>> in libpq?
>>>>
>>>> Some background: this happens only after a couple of thousand requests
>>>> (each doing about 15 database calls), with occasional other requests
>>>> coming in at the same time. Our server uses a Haskell binding to libpq
>>>> (HDBC [1] and HDBC-postgresql [2]). Both client and server run on the
>>>> same machine, running 64bit Ubuntu 10.04. The database version is
>>>> "PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
>>>> (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit". I'm not sure how to determine
>>>> the libpq version, but it is the most recent that comes with this
>>>> ubuntu. The changelogs for Ubuntu suggest 8.4.7 as well. Connections
>>>> are via TCP/IP to 127.0.0.1 with SSL turned on. The machine is under
>>>> some CPU load when this happens. There is plenty of free memory.
>>>>
>>>> When I turned off SSL or connect via domain sockets, we got different
>>>> errors that are possibly related: occasionally, the connection between
>>>> client (our app) and server (database) is lost. On the client, we get:
>>>>
>>>>    connectPostgreSQL: server closed the connection unexpectedly
>>>>    This probably means the server terminated abnormally
>>>>    before or while processing the request.
>>>>
>>>> and on the server:
>>>>
>>>>    could not send data to client: Broken pipe
>>>>
>>>> There is no further context around these messages.
>>>>
>>>> Any help would be greatly appreciated.
>>>
>>> How did you initialize ssl?   You are waiting inside a lock that is
>>> getting set up inside the crypto library.  Unless you are having some
>>> type of library initialization issue, I'm suspicious the problem is
>>> really inside libpq.  Is your application multithreaded, and if so are
>>> you properly synchronizing access to the connection object, etc?
>>
>> What do you mean exactly with "How did you initialize ssl"? I found
>> [1], which I did not know about. This seems to be a very non-local
>> problem: if one of our dependencies initializes ssl, and I use libpq
>> as well, this will go wrong. I've done a quick look through all our
>> dependencies, and none seem to use libcrypto or libssl.
>
> *something* must be initializing ssl, or you can't make secure
> connections from libpq.  you need to find out which pq ssl init
> function is begin called, when it is being called, and with what
> arguments. One of the main things PQInitSSL does is set up a lock
> vector which it passes to the crypto library.  The fact you are having
> blocking issues around those locks is suggesting SSL was not set up
> properly, something happened after being set up so that the locks are
> no longer good, you have application thread issue (although that
> sounds unlikely), or (least likely worst case) there is a bug in
> crypto.

From the postgresql documentation I linked to in my last post, it
seems that if I do not call PQinitOpenSSL and I do not initialize the
libraries myself, libpq will do it for me. Is that correct? If so,
then that is what is happening in my case.

Regards,

--
Erik Hesselink
http://silkapp.com

Re: Deadlock in libpq

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> *something* must be initializing ssl, or you can't make secure
> connections from libpq.  you need to find out which pq ssl init
> function is begin called, when it is being called, and with what
> arguments. One of the main things PQInitSSL does is set up a lock
> vector which it passes to the crypto library.  The fact you are having
> blocking issues around those locks is suggesting SSL was not set up
> properly, something happened after being set up so that the locks are
> no longer good, you have application thread issue (although that
> sounds unlikely), or (least likely worst case) there is a bug in
> crypto.

Given that Erik still sees trouble when not using SSL, my money would
be on the idea that somewhere he's got two threads trying to use the
same connection object concurrently.  The SSL confusion is just one
symptom of that.

            regards, tom lane

Re: Deadlock in libpq

From
Erik Hesselink
Date:
On Thu, Mar 24, 2011 at 16:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> *something* must be initializing ssl, or you can't make secure
>> connections from libpq.  you need to find out which pq ssl init
>> function is begin called, when it is being called, and with what
>> arguments. One of the main things PQInitSSL does is set up a lock
>> vector which it passes to the crypto library.  The fact you are having
>> blocking issues around those locks is suggesting SSL was not set up
>> properly, something happened after being set up so that the locks are
>> no longer good, you have application thread issue (although that
>> sounds unlikely), or (least likely worst case) there is a bug in
>> crypto.
>
> Given that Erik still sees trouble when not using SSL, my money would
> be on the idea that somewhere he's got two threads trying to use the
> same connection object concurrently.  The SSL confusion is just one
> symptom of that.

I've just looked at all the spots in the code where I fork a new
thread, and nowhere do I share a connection between threads. I've also
checked the source for HDBC-postgresql, and it also creates a new
connection each time I ask it to (i.e. no caching/pooling etc) by
calling PQconnectdb. So I'm pretty sure I'm not sharing a connection
between two threads. This is a web application, so there is pretty
good isolation between different threads (almost no shared state).

--
Erik Hesselink
http://silkapp.com

Re: Deadlock in libpq

From
Merlin Moncure
Date:
On Thu, Mar 24, 2011 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> *something* must be initializing ssl, or you can't make secure
>> connections from libpq.  you need to find out which pq ssl init
>> function is begin called, when it is being called, and with what
>> arguments. One of the main things PQInitSSL does is set up a lock
>> vector which it passes to the crypto library.  The fact you are having
>> blocking issues around those locks is suggesting SSL was not set up
>> properly, something happened after being set up so that the locks are
>> no longer good, you have application thread issue (although that
>> sounds unlikely), or (least likely worst case) there is a bug in
>> crypto.
>
> Given that Erik still sees trouble when not using SSL, my money would
> be on the idea that somewhere he's got two threads trying to use the
> same connection object concurrently.  The SSL confusion is just one
> symptom of that.

maybe -- but aside from the fact the OP is disclaiming that case, I'm
not convinced you would see strictly 'broken pipe' error resulting
from connection synchronization problems (surely, you'd see other
problems eventually). Maybe one of his threads is dying and
aggravating some other client side problem.  Also, he's getting
blocked in connection establishment.

He needs to rule out the most obvious problem first -- PQInitSSL being
called improperly or at the wrong time.  OP: It's a library wide
setting and must be called before the first connection is established
and only once.  Perhaps the HDBC driver has some bad logic there.

merlin

Re: Deadlock in libpq

From
Erik Hesselink
Date:
On Thu, Mar 24, 2011 at 16:43, Merlin Moncure <mmoncure@gmail.com> wrote:
> He needs to rule out the most obvious problem first -- PQInitSSL being
> called improperly or at the wrong time.  OP: It's a library wide
> setting and must be called before the first connection is established
> and only once.  Perhaps the HDBC driver has some bad logic there.

I just grepped the source, and PQInitSSL doesn't appear once. So I
think it is never called at all. I'll try adding it to our application
startup code, and see if that fixes the problem.

This doesn't explain the broken connections without SSL though. I
think if my threads were dying, I'd see a log message: there are top
level exception handlers that log to email, file and the console. The
connection code also has a handler that closes the connection in case
of an exception.

--
Erik Hesselink
http://silkapp.com

Re: Deadlock in libpq

From
Merlin Moncure
Date:
On Thu, Mar 24, 2011 at 10:54 AM, Erik Hesselink <hesselink@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 16:43, Merlin Moncure <mmoncure@gmail.com> wrote:
>> He needs to rule out the most obvious problem first -- PQInitSSL being
>> called improperly or at the wrong time.  OP: It's a library wide
>> setting and must be called before the first connection is established
>> and only once.  Perhaps the HDBC driver has some bad logic there.
>
> I just grepped the source, and PQInitSSL doesn't appear once. So I
> think it is never called at all. I'll try adding it to our application
> startup code, and see if that fixes the problem.
>
> This doesn't explain the broken connections without SSL though. I
> think if my threads were dying, I'd see a log message: there are top
> level exception handlers that log to email, file and the console. The
> connection code also has a handler that closes the connection in case
> of an exception.

Also check PQinitOpenSSL.  For connections without SSL, all we know
for sure is that the tcp connection is unexpectedly getting dropped.
This could happen for all kinds of reasons -- like a miscreant router
deciding the connection is idle too long (this could in fact be the
issue aggravating the SSL lock issue, but that's worse and needs to be
caught and extirpated).

merlin

Re: Deadlock in libpq

From
Erik Hesselink
Date:
On Thu, Mar 24, 2011 at 17:18, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 10:54 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>> On Thu, Mar 24, 2011 at 16:43, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> He needs to rule out the most obvious problem first -- PQInitSSL being
>>> called improperly or at the wrong time.  OP: It's a library wide
>>> setting and must be called before the first connection is established
>>> and only once.  Perhaps the HDBC driver has some bad logic there.
>>
>> I just grepped the source, and PQInitSSL doesn't appear once. So I
>> think it is never called at all. I'll try adding it to our application
>> startup code, and see if that fixes the problem.
>>
>> This doesn't explain the broken connections without SSL though. I
>> think if my threads were dying, I'd see a log message: there are top
>> level exception handlers that log to email, file and the console. The
>> connection code also has a handler that closes the connection in case
>> of an exception.
>
> Also check PQinitOpenSSL.  For connections without SSL, all we know
> for sure is that the tcp connection is unexpectedly getting dropped.
> This could happen for all kinds of reasons -- like a miscreant router
> deciding the connection is idle too long (this could in fact be the
> issue aggravating the SSL lock issue, but that's worse and needs to be
> caught and extirpated).

PQinitOpenSSL also doesn't appear. I'll ask the HDBC maintainer if
he's considered this issue.

As far as connections getting dropped: yes, this sounds reasonable,
but given that both the client and the server are running on the same
machine, will connections (to 127.0.0.1) really be dropped once every
100.000 or so?

--
Erik Hesselink
http://silkapp.com

Re: Deadlock in libpq

From
Merlin Moncure
Date:
On Thu, Mar 24, 2011 at 11:27 AM, Erik Hesselink <hesselink@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 17:18, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Thu, Mar 24, 2011 at 10:54 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>>> On Thu, Mar 24, 2011 at 16:43, Merlin Moncure <mmoncure@gmail.com> wrote:
>>>> He needs to rule out the most obvious problem first -- PQInitSSL being
>>>> called improperly or at the wrong time.  OP: It's a library wide
>>>> setting and must be called before the first connection is established
>>>> and only once.  Perhaps the HDBC driver has some bad logic there.
>>>
>>> I just grepped the source, and PQInitSSL doesn't appear once. So I
>>> think it is never called at all. I'll try adding it to our application
>>> startup code, and see if that fixes the problem.
>>>
>>> This doesn't explain the broken connections without SSL though. I
>>> think if my threads were dying, I'd see a log message: there are top
>>> level exception handlers that log to email, file and the console. The
>>> connection code also has a handler that closes the connection in case
>>> of an exception.
>>
>> Also check PQinitOpenSSL.  For connections without SSL, all we know
>> for sure is that the tcp connection is unexpectedly getting dropped.
>> This could happen for all kinds of reasons -- like a miscreant router
>> deciding the connection is idle too long (this could in fact be the
>> issue aggravating the SSL lock issue, but that's worse and needs to be
>> caught and extirpated).
>
> PQinitOpenSSL also doesn't appear. I'll ask the HDBC maintainer if
> he's considered this issue.
>
> As far as connections getting dropped: yes, this sounds reasonable,
> but given that both the client and the server are running on the same
> machine, will connections (to 127.0.0.1) really be dropped once every
> 100.000 or so?

No, don't bother, I forgot the default behavior was to do both, which
is probably correct in your case. InitSSL just signals if you want
them to be done.

libpq refcounts connections and does SSL initialization when
connection count goes from 0->1 and destruction when it goes from
1->0.  This operation is protected with mutex (you *are* using thread
safe libpq, right?).

merlin

Re: Deadlock in libpq

From
Merlin Moncure
Date:
On Thu, Mar 24, 2011 at 11:52 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> As far as connections getting dropped: yes, this sounds reasonable,
>> but given that both the client and the server are running on the same
>> machine, will connections (to 127.0.0.1) really be dropped once every
>> 100.000 or so?
>
> No, don't bother, I forgot the default behavior was to do both, which
> is probably correct in your case. InitSSL just signals if you want
> them to be done.
>
> libpq refcounts connections and does SSL initialization when
> connection count goes from 0->1 and destruction when it goes from
> 1->0.  This operation is protected with mutex (you *are* using thread
> safe libpq, right?).

meh, you have to be -- the locking stuff only gets set up w/thread
safe libpq.  It's basically impossible for that refcount to get thrown
off aiui.  hm.  I'm going back to thinking tom was right and this is
threading issue in the app...maybe there is something you haven't
considered?

merlin

merlin

Re: Deadlock in libpq

From
Merlin Moncure
Date:
On Thu, Mar 24, 2011 at 11:57 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 11:52 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> As far as connections getting dropped: yes, this sounds reasonable,
>>> but given that both the client and the server are running on the same
>>> machine, will connections (to 127.0.0.1) really be dropped once every
>>> 100.000 or so?
>>
>> No, don't bother, I forgot the default behavior was to do both, which
>> is probably correct in your case. InitSSL just signals if you want
>> them to be done.
>>
>> libpq refcounts connections and does SSL initialization when
>> connection count goes from 0->1 and destruction when it goes from
>> 1->0.  This operation is protected with mutex (you *are* using thread
>> safe libpq, right?).
>
> meh, you have to be -- the locking stuff only gets set up w/thread
> safe libpq.  It's basically impossible for that refcount to get thrown
> off aiui.  hm.  I'm going back to thinking tom was right and this is
> threading issue in the app...maybe there is something you haven't
> considered?

hm, ISTM (I don't know haskell) that the hdbc driver isn't doing any
type of synchronization at all unless it is using a non thread safe
libpq...and in that case it uses a global mutex.  That doesn't look
correct -- the hdbc driver should be locking around the PGconn always,
and globally if you're stuck with a non thread safe libpq.

merlin

Re: Deadlock in libpq

From
Erik Hesselink
Date:
On Thu, Mar 24, 2011 at 20:38, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 11:57 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Thu, Mar 24, 2011 at 11:52 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>>> As far as connections getting dropped: yes, this sounds reasonable,
>>>> but given that both the client and the server are running on the same
>>>> machine, will connections (to 127.0.0.1) really be dropped once every
>>>> 100.000 or so?
>>>
>>> No, don't bother, I forgot the default behavior was to do both, which
>>> is probably correct in your case. InitSSL just signals if you want
>>> them to be done.
>>>
>>> libpq refcounts connections and does SSL initialization when
>>> connection count goes from 0->1 and destruction when it goes from
>>> 1->0.  This operation is protected with mutex (you *are* using thread
>>> safe libpq, right?).
>>
>> meh, you have to be -- the locking stuff only gets set up w/thread
>> safe libpq.  It's basically impossible for that refcount to get thrown
>> off aiui.  hm.  I'm going back to thinking tom was right and this is
>> threading issue in the app...maybe there is something you haven't
>> considered?
>
> hm, ISTM (I don't know haskell) that the hdbc driver isn't doing any
> type of synchronization at all unless it is using a non thread safe
> libpq...and in that case it uses a global mutex.  That doesn't look
> correct -- the hdbc driver should be locking around the PGconn always,
> and globally if you're stuck with a non thread safe libpq.

No, that is not the case. If libpq is not thread safe, the library
uses a global lock. If it is thread safe, it uses a single lock per
connection. This lock is created on connect, and locked before
executing a statement. So it seems the library is doing the correct
things.

(And yes, libpq is thread safe, I just checked).

Erik

Re: Deadlock in libpq

From
Merlin Moncure
Date:
On Fri, Mar 25, 2011 at 3:26 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>> hm, ISTM (I don't know haskell) that the hdbc driver isn't doing any
>> type of synchronization at all unless it is using a non thread safe
>> libpq...and in that case it uses a global mutex.  That doesn't look
>> correct -- the hdbc driver should be locking around the PGconn always,
>> and globally if you're stuck with a non thread safe libpq.
>
> No, that is not the case. If libpq is not thread safe, the library
> uses a global lock. If it is thread safe, it uses a single lock per
> connection. This lock is created on connect, and locked before
> executing a statement. So it seems the library is doing the correct
> things.
>
> (And yes, libpq is thread safe, I just checked).

hm, I'm stumped.  Are you sure nothing else is using the crypto
library?  There is an unlikely but possible case that you initialized
crypto locks over somebody else.

*something* is happening here, but I have no idea what.   I'm very
skeptical it's a locking issue in libpq itself, because there is so
little going on beyond counting the connections.  Let me ask you this:
how much does your connection count range over time?  would it be
possible to reserve a connection that stays open all the time and see
if the issue re-occurs? (a wild stab in the dark, but I'm curious of
ssl re-initialization is causing your problem if you always have at
least one connection open, it wont re-initialize).

merlin

Re: Deadlock in libpq

From
Erik Hesselink
Date:
On Fri, Mar 25, 2011 at 21:21, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Mar 25, 2011 at 3:26 AM, Erik Hesselink <hesselink@gmail.com> wrote:
>>> hm, ISTM (I don't know haskell) that the hdbc driver isn't doing any
>>> type of synchronization at all unless it is using a non thread safe
>>> libpq...and in that case it uses a global mutex.  That doesn't look
>>> correct -- the hdbc driver should be locking around the PGconn always,
>>> and globally if you're stuck with a non thread safe libpq.
>>
>> No, that is not the case. If libpq is not thread safe, the library
>> uses a global lock. If it is thread safe, it uses a single lock per
>> connection. This lock is created on connect, and locked before
>> executing a statement. So it seems the library is doing the correct
>> things.
>>
>> (And yes, libpq is thread safe, I just checked).
>
> hm, I'm stumped.  Are you sure nothing else is using the crypto
> library?  There is an unlikely but possible case that you initialized
> crypto locks over somebody else.

No, I'm not sure at all. We have quite a few dependencies, since
Haskell libraries are very small. I've looked through all available
Haskell libraries and none that we use seem to depend on libcrypto or
libssl. However, it is possible that we transitively depend on
libraries that do. It is also possible that we depend on libraries
that use a C library that uses libcrypto or libssl (like we do with
libpq). It will take some time to analyze the full dependency tree.

> *something* is happening here, but I have no idea what.   I'm very
> skeptical it's a locking issue in libpq itself, because there is so
> little going on beyond counting the connections.  Let me ask you this:
> how much does your connection count range over time?  would it be
> possible to reserve a connection that stays open all the time and see
> if the issue re-occurs? (a wild stab in the dark, but I'm curious of
> ssl re-initialization is causing your problem if you always have at
> least one connection open, it wont re-initialize).

My tests were with two clients. Each client performs database queries
sequentially, and a background thread is also opening database
connections, so that makes 0 to 3 concurrent connections.

I will test with one connection always open; that should be very
simple to set up. I've also set up a connection pooler (pgbouncer) on
our live server, and turned off SSL. This seems to prevent/hide the
problem there at least, but I will continue to investigate on a
different server.

Thanks for all the help so far,

--
Erik Hesselink
http://silkapp.com