Thread: Database connections seemingly hanging

Database connections seemingly hanging

From
Fredrik.HuitfeldtMadsen@schneider-electric.com
Date:
Hi All,

We have a setup where 2 JBoss (5.1) servers communicate with 1 instance of PgPool (3.04), which again communicates with 2 Postgresql (8.4) servers. The JBoss servers host some Java code for us and as part of that they run some quartz jobs.

These jobs are triggered right after startup and as part of that we get what seems to get stuck. At least when we can see in the database that when inspecting pg_locks, there exists a virtual transaction that has all desired locks granted but seems to be stuck. When we inspect pg_stat_activity, it seems that the process is still waiting for the query (SELECT ... FOR UPDATE) to finish.

The locking transaction is described here: http://pastebin.com/3pEn6vPe

We know that the quartz thread is attempting to obtain a row share lock. We know that we have enough connections available in postgres and in pgpool. We also know that the issue occurs much more frequently when we enable postgres statememt logging. We assume that this is due to postgres becomming slower as a result of the additionsl logging. When we look at the server thread dump, we can see that all quartz threads are either sleeping or waiting for postgres.

A thread dump of the relevant quartz threads is described here: http://pastebin.com/iPhuFLrM

It is important to note that the issue does not only occur with quartz jobs, but it is where we see it most frequently. This is likely to be due to the fact that it is the place where we have the highest level of concurrency.

We suspect that a connection to the database acquires its locks but somehow does not return to the application. If this is true, it would either be a postgresql or a pgpool problem. We would appreciate any help in further debugging or resolving the situation.

Kind regards,
Fredrik

Re: Database connections seemingly hanging

From
Tom Lane
Date:
Fredrik.HuitfeldtMadsen@schneider-electric.com writes:
> We have a setup where 2 JBoss (5.1) servers communicate with 1 instance of
> PgPool (3.04), which again communicates with 2 Postgresql (8.4) servers.
> The JBoss servers host some Java code for us and as part of that they run
> some quartz jobs.

> These jobs are triggered right after startup and as part of that we get
> what seems to get stuck. At least when we can see in the database that
> when inspecting pg_locks, there exists a virtual transaction that has all
> desired locks granted but seems to be stuck. When we inspect
> pg_stat_activity, it seems that the process is still waiting for the query
> (SELECT ... FOR UPDATE) to finish.

> The locking transaction is described here: http://pastebin.com/3pEn6vPe

What that shows is several sessions running SELECT FOR UPDATE, but none
of them seem to be waiting.  What else is going on?  In particular, are
there any idle-in-transaction sessions?  Also, would any of these
SELECTs return enough rows that the sessions might be blocked trying to
send data back to their clients?  That wouldn't show as waiting = true,
though I think you could detect it by strace'ing the backends to see if
they are stopped in a send() kernel call.

> We suspect that a connection to the database acquires its locks but
> somehow does not return to the application. If this is true, it would
> either be a postgresql or a pgpool problem. We would appreciate any help
> in further debugging or resolving the situation.

It seems like a good guess would be that you have a deadlock situation
that cannot be detected by the database because part of the blockage is
on the client side --- that is, client thread A is waiting on its
database query, that query is waiting on some lock held by client thread
B's database session, and thread B is somehow waiting for A on the
client side.  It's not too hard to get into this type of situation when
B is sitting on an open idle-in-transaction session: B isn't waiting for
the database to do anything, but is doing something itself, and so it's
not obvious that there's any risk.  Thus my question about what idle
sessions there might be.  This does usually lead to a visibly waiting
database session for client A, though, so it's probably too simple as an
explanation for your issue.  We have seen some harder-to-debug cases
where the database sessions weren't visibly "waiting" because they were
blocked on client I/O, so maybe you've got something like that.

Another line of thought to pursue is possible misuse of pgpool.  If
pgpool doesn't realize you're inside a transaction and swaps the
connection to some other client thread, all kinds of confusion ensues.

Also, I hope you're running a reasonably recent 8.4.x minor release.
A quick look through the commit logs didn't show anything about deadlock
fixes in the 8.4 branch, but I might have missed something that was
fixed a long time ago.

            regards, tom lane


Re: Database connections seemingly hanging

From
Fredrik.HuitfeldtMadsen@schneider-electric.com
Date:
Hi All,

@ Tom
Thank you for your response. While working on your suggestions, we seem to have found the cause of our problems.

@ Yugo
Thank you for your response. We are running pgpool in replication mode with load balancing enabled. If you have further questions to aid in debugging the situation, please let me know.


It seems that the root cause was that pgpool acquired the locks in the wrong order. If the resource is called A it seems that pgpool allows child X to acquire A on node1 and at the same time, child Y acquires A on node2. This leaves X wanting A on node2 and Y wanting A on node1. This leaves both children hanging indefinitely. It also leaves both postgres'es blissfully unaware of the deadlock, whereby it escapes postgres'es deadlock detection.

We have included a summary of the system state here:
http://pastebin.com/9f6gjxLA

We have used netstat to trace the connections between the pgpool children and the postgress'es. pgpool child 7606 has acquired a lock on the .204 server but waits for the same lock on the .202 server. At the same time pgpool child 7681 has the lock on the .202 server and waits for it on the .204 server. Pgpool is running on the .204 server.

If anyone is interested, we have included the full outputs in the following pastebins:

pg_locks on 10.216.73.202: http://pastebin.com/uRQh5Env
pg_locks on 10.216.73.204: http://pastebin.com/BXpirVQ2
netstat -p on 10.216.73.202: http://pastebin.com/b9kV7Wz4
netstat -p on 10.216.73.204:
http://pastebin.com/tPz8gwRG

Kind regards,
Fredrik & friends




Tom Lane <tgl@sss.pgh.pa.us>

2013/01/10 05:30

To
Fredrik.HuitfeldtMadsen@schneider-electric.com
cc
pgsql-general@postgresql.org, pgpool-general@pgpool.net
Subject
Re: [GENERAL] Database connections seemingly hanging





Fredrik.HuitfeldtMadsen@schneider-electric.com writes:
> We have a setup where 2 JBoss (5.1) servers communicate with 1 instance of
> PgPool (3.04), which again communicates with 2 Postgresql (8.4) servers.
> The JBoss servers host some Java code for us and as part of that they run
> some quartz jobs.

> These jobs are triggered right after startup and as part of that we get
> what seems to get stuck. At least when we can see in the database that
> when inspecting pg_locks, there exists a virtual transaction that has all
> desired locks granted but seems to be stuck. When we inspect
> pg_stat_activity, it seems that the process is still waiting for the query
> (SELECT ... FOR UPDATE) to finish.

> The locking transaction is described here:
http://pastebin.com/3pEn6vPe

What that shows is several sessions running SELECT FOR UPDATE, but none
of them seem to be waiting.  What else is going on?  In particular, are
there any idle-in-transaction sessions?  Also, would any of these
SELECTs return enough rows that the sessions might be blocked trying to
send data back to their clients?  That wouldn't show as waiting = true,
though I think you could detect it by strace'ing the backends to see if
they are stopped in a send() kernel call.

> We suspect that a connection to the database acquires its locks but
> somehow does not return to the application. If this is true, it would
> either be a postgresql or a pgpool problem. We would appreciate any help
> in further debugging or resolving the situation.

It seems like a good guess would be that you have a deadlock situation
that cannot be detected by the database because part of the blockage is
on the client side --- that is, client thread A is waiting on its
database query, that query is waiting on some lock held by client thread
B's database session, and thread B is somehow waiting for A on the
client side.  It's not too hard to get into this type of situation when
B is sitting on an open idle-in-transaction session: B isn't waiting for
the database to do anything, but is doing something itself, and so it's
not obvious that there's any risk.  Thus my question about what idle
sessions there might be.  This does usually lead to a visibly waiting
database session for client A, though, so it's probably too simple as an
explanation for your issue.  We have seen some harder-to-debug cases
where the database sessions weren't visibly "waiting" because they were
blocked on client I/O, so maybe you've got something like that.

Another line of thought to pursue is possible misuse of pgpool.  If
pgpool doesn't realize you're inside a transaction and swaps the
connection to some other client thread, all kinds of confusion ensues.

Also, I hope you're running a reasonably recent 8.4.x minor release.
A quick look through the commit logs didn't show anything about deadlock
fixes in the 8.4 branch, but I might have missed something that was
fixed a long time ago.

                                                  regards, tom lane

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
______________________________________________________________________

Re: [pgpool-general: 1315] Re: Database connections seemingly hanging

From
Tatsuo Ishii
Date:
> It seems that the root cause was that pgpool acquired the locks in the
> wrong order. If the resource is called A it seems that pgpool allows child
> X to acquire A on node1 and at the same time, child Y acquires A on node2.
> This leaves X wanting A on node2 and Y wanting A on node1. This leaves
> both children hanging indefinitely. It also leaves both postgres'es
> blissfully unaware of the deadlock, whereby it escapes postgres'es
> deadlock detection.

That's hard to believe for me. For any query, pgpool sends it to the
master node (node1 in your case) first and waits until the node
returns response by using select(2) on the socket to PostgreSQL
backend. After someting comes from the socket, then pgpool issues to
node2.  So pgpool never sends query to master node(node1) and node2
concurrently.  This is a classical technique to avoid a cross node
dead lock situation.

If your explation is correct, pgpool easily goes into dead lock
situation even by using simple pgbench query.

Could you please show me self-contained test case?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: [pgpool-general: 1315] Re: Database connections seemingly hanging

From
Fredrik.HuitfeldtMadsen@schneider-electric.com
Date:
Hello Mr. Ishii,

we have attempted to create a selfcontained testcase, but have been unsuccessful so far. We understand how pgpool acquires locks in theory, but it seems that what we are seeing is different. We have summarized our findings here: http://pastebin.com/9f6gjxLA

It seems that pgpool child 7606 communicates with process 26453 on the .202 node and process 17789 on the .204 node. We can see from the output from postgres that process 26453 is waiting for the lock on the .202 server, while 17789 has the lock on the .204 server. This means that pgpool child 7606 will wait until process 26453 can obtain the lock on the .202 server.

At the same time, we can see that pgpool child 7681 communicates with process 23451 on the .202 server and process 12464 on the .204 server. We can see from the output from postgres that the process 23451 has its lock on the .202 server but process 12464 is waiting for it on the .204 server. This means that pgpool child 7681 will wait until process 12464 can obtain the lock on the .204 server.

Since pgpool child 7606 via process 17789 has the lock on the .204 server, it blocks pgpool child 7681 from completing. Since pgpool child 7681 via process 23451 has the lock on the .202 server, it blocks pgpool child 7606 from completing. This seems to be a classic deadlock.

We have spent a fair amount of time debugging this situation, and we would really appreciate feedback on our situation.

Is there any information that would aid you in providing us with this kind of feedback?

Kind regards,
Fredrik & Friends




Tatsuo Ishii <ishii@postgresql.org>

2013/01/16 07:54

To
Fredrik.HuitfeldtMadsen@schneider-electric.com
cc
tgl@sss.pgh.pa.us, nagata@sraoss.co.jp, pgsql-general@postgresql.org, pgpool-general@pgpool.net
Subject
Re: [pgpool-general: 1315] Re: [GENERAL] Database connections seemingly hanging





> It seems that the root cause was that pgpool acquired the locks in the
> wrong order. If the resource is called A it seems that pgpool allows child
> X to acquire A on node1 and at the same time, child Y acquires A on node2.
> This leaves X wanting A on node2 and Y wanting A on node1. This leaves
> both children hanging indefinitely. It also leaves both postgres'es
> blissfully unaware of the deadlock, whereby it escapes postgres'es
> deadlock detection.

That's hard to believe for me. For any query, pgpool sends it to the
master node (node1 in your case) first and waits until the node
returns response by using select(2) on the socket to PostgreSQL
backend. After someting comes from the socket, then pgpool issues to
node2.  So pgpool never sends query to master node(node1) and node2
concurrently.  This is a classical technique to avoid a cross node
dead lock situation.

If your explation is correct, pgpool easily goes into dead lock
situation even by using simple pgbench query.

Could you please show me self-contained test case?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English:
http://www.sraoss.co.jp/index_en.php
Japanese:
http://www.sraoss.co.jp

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
______________________________________________________________________