Thread: Socket read fails
Hi there, wondering if anyone could shed some light on selects intermittently not returning any data on our production server.
Symptoms are:
- occurs in several different parts of our codebase but apparently its the same jdbc codepath [below]
- these codepaths execute fine the vast majority of the time
- thread blocks on read forever
- database doesnt show any query running.
- system has a couple of high load processes that run every day and we only see the errors there, though the majority of the systems' queries are done then.
Any suggestions for what I could do to fix or help diagnose the problem greatly appreciated. We saw these problems with v7, v8 and now seeming more frequently (ie every few days) with 8.1 jdbc and 8.1 server.
We run ES Linux 3.
cheers
Colin.
"Thread-571" prio=1 tid=0x085745b0 nid=0x124e runnable [759fe000..759ff87c]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read (SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:183)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:222)
at java.io.BufferedInputStream.read (BufferedInputStream.java:277)
- locked <0x847ac890> (a java.io.BufferedInputStream)
at org.postgresql.core.PGStream.Receive(PGStream.java:445)
at org.postgresql.core.PGStream.Receive (PGStream.java:427)
at org.postgresql.core.PGStream.ReceiveTupleV3(PGStream.java:369)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1283)
at org.postgresql.core.v3.QueryExecutorImpl.fetch (QueryExecutorImpl.java:1447)
- locked <0x847aa0b0> (a org.postgresql.core.v3.QueryExecutorImpl)
at org.postgresql.jdbc2.AbstractJdbc2ResultSet.next(AbstractJdbc2ResultSet.java:1840)
at org.ofbiz.entity.util.EntityListIterator.next(EntityListIterator.java:269)
Symptoms are:
- occurs in several different parts of our codebase but apparently its the same jdbc codepath [below]
- these codepaths execute fine the vast majority of the time
- thread blocks on read forever
- database doesnt show any query running.
- system has a couple of high load processes that run every day and we only see the errors there, though the majority of the systems' queries are done then.
Any suggestions for what I could do to fix or help diagnose the problem greatly appreciated. We saw these problems with v7, v8 and now seeming more frequently (ie every few days) with 8.1 jdbc and 8.1 server.
We run ES Linux 3.
cheers
Colin.
"Thread-571" prio=1 tid=0x085745b0 nid=0x124e runnable [759fe000..759ff87c]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read (SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:183)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:222)
at java.io.BufferedInputStream.read (BufferedInputStream.java:277)
- locked <0x847ac890> (a java.io.BufferedInputStream)
at org.postgresql.core.PGStream.Receive(PGStream.java:445)
at org.postgresql.core.PGStream.Receive (PGStream.java:427)
at org.postgresql.core.PGStream.ReceiveTupleV3(PGStream.java:369)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1283)
at org.postgresql.core.v3.QueryExecutorImpl.fetch (QueryExecutorImpl.java:1447)
- locked <0x847aa0b0> (a org.postgresql.core.v3.QueryExecutorImpl)
at org.postgresql.jdbc2.AbstractJdbc2ResultSet.next(AbstractJdbc2ResultSet.java:1840)
at org.ofbiz.entity.util.EntityListIterator.next(EntityListIterator.java:269)
For the benefit of anyone else who has this problem, it seems to be alleviated, at least in our case by rebooting the box every night. Not ideal by any means but better than having the application hang. Which kind of suggests some resource issue, investing with linux sar command though doesnt reveal any process or file limits being reached.
On 1/16/06, Colin Taylor <colin.taylor@gmail.com> wrote:
Hi there, wondering if anyone could shed some light on selects intermittently not returning any data on our production server.
Symptoms are:
- occurs in several different parts of our codebase but apparently its the same jdbc codepath [below]
- these codepaths execute fine the vast majority of the time
- thread blocks on read forever
- database doesnt show any query running.
- system has a couple of high load processes that run every day and we only see the errors there, though the majority of the systems' queries are done then.
Any suggestions for what I could do to fix or help diagnose the problem greatly appreciated. We saw these problems with v7, v8 and now seeming more frequently (ie every few days) with 8.1 jdbc and 8.1 server.
We run ES Linux 3.
cheers
Colin.
"Thread-571" prio=1 tid=0x085745b0 nid=0x124e runnable [759fe000..759ff87c]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read (SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:183)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:222)
at java.io.BufferedInputStream.read (BufferedInputStream.java:277)
- locked <0x847ac890> (a java.io.BufferedInputStream)
at org.postgresql.core.PGStream.Receive(PGStream.java:445)
at org.postgresql.core.PGStream.Receive (PGStream.java:427)
at org.postgresql.core.PGStream.ReceiveTupleV3(PGStream.java:369)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1283)
at org.postgresql.core.v3.QueryExecutorImpl.fetch (QueryExecutorImpl.java:1447)
- locked <0x847aa0b0> (a org.postgresql.core.v3.QueryExecutorImpl)
at org.postgresql.jdbc2.AbstractJdbc2ResultSet.next(AbstractJdbc2ResultSet.java:1840)
at org.ofbiz.entity.util.EntityListIterator.next(EntityListIterator.java:269)
On Jan 25, 2006, at 1:03 AM, Colin Taylor wrote: > For the benefit of anyone else who has this problem, it seems to > be alleviated, at least in our case by rebooting the box every > night. Not ideal by any means but better than having the > application hang. Which kind of suggests some resource issue, > investing with linux sar command though doesnt reveal any process > or file limits being reached. > We have been fighting this very same problem with our open source project (OpenNMS) for a several months now but only on systems with fairly large databases (~=2.5 GB). We recently built 8.1.2 on one customer's beefy Sun system (Sparc and Solaris 10) hoping to resolve the issue only to find it happening now more frequently now as well. 8.1.2's performance, while running, is much better btw. We have been through our database code over and over and we sure that we are closing all connections, statements, etc. though that doesn't seem to be the issue since our hangs on socketRead0 happen on a call to DriverManger.getConnection() (creating new connection) which appears to be different for your hang. Here is a recent stack trace: "OpenNMS.Poller.DefaultPollContext" prio=10 tid=0x006df638 nid=0xa5 runnable [0x3cffe000..0x3cfffbf0] at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java: 218) at java.io.BufferedInputStream.read(BufferedInputStream.java: 235) - locked <0x60fa45e8> (a java.io.BufferedInputStream) at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:254) at org.postgresql.core.v2.QueryExecutorImpl.processResults (QueryExecutorImpl.java:408) at org.postgresql.core.v2.QueryExecutorImpl.execute (QueryExecutorImpl.java:364) at org.postgresql.core.v2.QueryExecutorImpl.execute (QueryExecutorImpl.java:258) - locked <0x60fa4608> (a org.postgresql.core.v2.QueryExecutorImpl) at org.postgresql.core.v2.ConnectionFactoryImpl.runSetupQuery (ConnectionFactoryImpl.java:405) at org.postgresql.core.v2.ConnectionFactoryImpl.runInitialQueries (ConnectionFactoryImpl.java:423) at org.postgresql.core.v2.ConnectionFactoryImpl.openConnectionImpl (ConnectionFactoryImpl.java:83) at org.postgresql.core.ConnectionFactory.openConnection (ConnectionFactory.java:65) at org.postgresql.jdbc2.AbstractJdbc2Connection.<init> (AbstractJdbc2Connection.java:116) at org.postgresql.jdbc3.AbstractJdbc3Connection.<init> (AbstractJdbc3Connection.java:30) at org.postgresql.jdbc3.Jdbc3Connection.<init> (Jdbc3Connection.java:24) at org.postgresql.Driver.makeConnection(Driver.java:369) at org.postgresql.Driver.connect(Driver.java:245) at java.sql.DriverManager.getConnection(DriverManager.java:525) - locked <0x48cd17e8> (a java.lang.Class) at java.sql.DriverManager.getConnection(DriverManager.java:171) - locked <0x48cd17e8> (a java.lang.Class) at org.opennms.netmgt.config.DatabaseConnectionFactory.getConnection (DatabaseConnectionFactory.java:1125) at org.opennms.netmgt.utils.JDBCTemplate.doExecute (JDBCTemplate.java:102) at org.opennms.netmgt.utils.JDBCTemplate.execute (JDBCTemplate.java:82) at org.opennms.netmgt.poller.DefaultQueryManager.resolveOutage (DefaultQueryManager.java:531) at org.opennms.netmgt.poller.DefaultPollContext$2.run (DefaultPollContext.java:139) at org.opennms.netmgt.poller.pollables.PendingPollEvent.processPending (PendingPollEvent.java:94) at org.opennms.netmgt.poller.DefaultPollContext.onEvent (DefaultPollContext.java:184) - locked <0x5a373ed0> (a java.util.LinkedList) at org.opennms.netmgt.eventd.EventIpcManagerDefaultImpl $ListenerThread.run(EventIpcManagerDefaultImpl.java:169) at java.lang.Thread.run(Thread.java:595) This past summer, we had someone try to debug this on the system level and here is what he reported: ----------- begin debugging ------------------ Thread Dump showing thread who has locked DatabaseConnectionFactory: "EventQueueProcessor" prio=5 tid=0x008262d0 nid=0x3b runnable [239ff000..239ffc28] at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java: 183) at java.io.BufferedInputStream.read(BufferedInputStream.java: 201) - locked <0x8183b8a8> (a java.io.BufferedInputStream) at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:166) at org.postgresql.core.QueryExecutor.executeV3 (QueryExecutor.java:127) - locked <0x8183b8c8> (a org.postgresql.core.PGStream) at org.postgresql.core.QueryExecutor.execute (QueryExecutor.java:100) at org.postgresql.core.QueryExecutor.execute (QueryExecutor.java:43) at org.postgresql.jdbc1.AbstractJdbc1Connection.execSQL (AbstractJdbc1Connection.java:887) at org.postgresql.jdbc1.AbstractJdbc1Connection.openConnectionV3 (AbstractJdbc1Connection.java:505) at org.postgresql.jdbc1.AbstractJdbc1Connection.openConnection (AbstractJdbc1Connection.java:214) at org.postgresql.Driver.connect(Driver.java:139) at java.sql.DriverManager.getConnection(DriverManager.java:512) - locked <0xf1b8f260> (a java.lang.Class) at java.sql.DriverManager.getConnection(DriverManager.java:171) - locked <0xf1b8f260> (a java.lang.Class) at org.opennms.netmgt.config.DatabaseConnectionFactory.getConnection (DatabaseConnectionFactory.java:1120) - locked <0x328b2758> (a java.util.LinkedList) at org.opennms.netmgt.xmlrpcd.XmlRpcNotifier.getNodeLabel (XmlRpcNotifier.java:355) pstack showing the FD of java->pgsql socket read call: ----------------- lwp# 59 / thread# 59 -------------------- ff33d2d8 read (36, 939a18, 2000) feccbc30 JVM_Read (36, 939a18, 2000, feccb5c8, ff364278, ff36c950) + 68 30edac88 Java_java_net_SocketInputStream_socketRead0 (826364, 239ff100, 239ff0fc, 239ff0f8, 0, 2000) + 1e8 pfiles showing port number of the java->psql socket read call: 54: S_IFSOCK mode:0666 dev:304,0 ino:16062 uid:0 gid:0 size:0 O_RDWR SOCK_STREAM SO_SNDBUF(49152),SO_RCVBUF(49152) sockname: AF_INET 127.0.0.1 port: 50912 peername: AF_INET 127.0.0.1 port: 5432 ...after greping through the pfiles output of 17 postmaster processes, found the one that has socket 50912 open: 8: S_IFSOCK mode:0666 dev:304,0 ino:13881 uid:0 gid:0 size:0 O_RDWR SOCK_STREAM SO_REUSEADDR,SO_KEEPALIVE,SO_SNDBUF(49152),SO_RCVBUF(49152) sockname: AF_INET 127.0.0.1 port: 5432 peername: AF_INET 127.0.0.1 port: 50912 pstack of this postmaster process: 3029: /usr/local/pgsql/bin/postmaster -D /usr/local/pgsql/data fefbca1c recv (8, 283df0, 2000, 0) 000f7420 pq_getbyte (285c00, 328598, 1, 6, 5cbd8, 5cc30) + 38 00156a38 SocketBackend (ffbff1d8, 49, 0, 49, 0, 2) + 4 00156d28 ReadCommand (ffbff1d8, 0, ffffffff, fffffff8, 0, 2ad800) + 20 00159c94 PostgresMain (4, 2c3850, 2c3830, 6, 0, 0) + a7c 0012e414 BackendRun (2d2038, 1084, 916e0, 0, ff1d2000, 1000) + 5b8 0012dc60 BackendStartup (2d2038, 2b3b0c, 0, 0, 2b3800, 5) + 150 0012bfdc ServerLoop (2, 2b3800, 2b3800, ffbffb98, ffbffc38, 286000) + 304 0012b7c8 PostmasterMain (0, 2bffe8, 2c7aa0, 2bf000, 2be400, 23f000) + b10 000f89cc main (3, 2bffe8, ffbffe2c, 2be6dc, ff1d0140, ff1d0180) + 220 00039d44 _start (0, 0, 0, 0, 0, 0) + 5c Conclusion java is doing a read on this socket, and postmaster is doing a recv. ----------end debugging--------------------- We are considering the use of DataSource instead of the deprecated DriverManager, since DriverManager is synchronized. Any thoughts on the system level debugging above or the use of the DataSource Interface? Note that we use very many threads, however, our connection pool keeps the number of connection down to around 12 with peaks for 20 or 25. -David David Hustace The OpenNMS Group, Inc. Main : +1 919 545 2553 Fax: +1 503-961-7746 Direct: +1 919 827 1201 Skype: dhustace Key Fingerprint: 8EC1 F319 2C1D 3197 AAB9 580F FACD 1D19 175E D903
David Hustace wrote: > Conclusion java is doing a read on this socket, and postmaster is doing > a recv. Well, that's pretty weird. Can you reproduce this with a recent driver and loglevel=2 (append it as a URL parameter to your JDBC URL) and send me the resulting debug output offlist? It sounds like it might be hard to trigger though :( > We are considering the use of DataSource instead of the deprecated > DriverManager, since DriverManager is synchronized. Any thoughts on > the system level debugging above or the use of the DataSource Interface? I doubt that using DataSource will make any difference. From the stack traces you showed, this does not look like a Java-level deadlock at all. (I noticed that your two stack traces seem to be talking to different server versions, though.. what's up with that?) -O
On Jan 25, 2006, at 5:12 PM, Oliver Jowett wrote: > David Hustace wrote: > >> Conclusion java is doing a read on this socket, and postmaster is >> doing a recv. > > Well, that's pretty weird. > > Can you reproduce this with a recent driver and loglevel=2 (append > it as a URL parameter to your JDBC URL) and send me the resulting > debug output offlist? It sounds like it might be hard to trigger > though :( Sure. I already have that enabled. Finding the output is tricky. We're currently using: url="jdbc:postgresql://localhost:5432/opennms? protocolVersion=2&loglevel=2" Version=2 is the latest attempt in debugging. > I doubt that using DataSource will make any difference. From the > stack traces you showed, this does not look like a Java-level > deadlock at all. You are right, we've been trying to eliminate every possible solution. Also, for future reference, can you tell me if the call to PGSimpleDataSource.getConnection() is thread safe? > (I noticed that your two stack traces seem to be talking to > different server versions, though.. what's up with that?) Not quite sure what you mean by "different server versions". If you are referring to PostgreSQL server versions, yes, we've been upgrading and trying different versions to try and fix the problem. -David David Hustace The OpenNMS Group, Inc. Main : +1 919 545 2553 Fax: +1 503-961-7746 Key Fingerprint: 8EC1 F319 2C1D 3197 AAB9 580F FACD 1D19 175E D903
David Hustace wrote: >> Can you reproduce this with a recent driver and loglevel=2 (append it >> as a URL parameter to your JDBC URL) and send me the resulting debug >> output offlist? It sounds like it might be hard to trigger though :( > > > Sure. I already have that enabled. Finding the output is tricky. The debug output should be going to the DriverManager log stream, or stderr if that's not set. > Also, for future reference, can you tell me if the call to > PGSimpleDataSource.getConnection() is thread safe? It does not change the state of the datasource and just delegates to DriverManager to do the real work, so it should be threadsafe. -O
On Jan 25, 2006, at 6:30 PM, Oliver Jowett wrote: > It does not change the state of the datasource and just delegates > to DriverManager to do the real work, so it should be threadsafe. Ah. I just saw that. in BasicDataSource.java. Thanks for your help. I guess I need improve my connection caching to try and remove stale connections or switch to another implementation. Anyone have production experience with SmartPool and a PostgreSQL driver? http:// smartpool.sourceforge.net/ It looks pretty good so far. I guess I should search the list before asking this question. (duck) -David David Hustace The OpenNMS Group, Inc. Main : +1 919 545 2553 Fax: +1 503-961-7746 Direct: +1 919 827 1201 Skype: dhustace Key Fingerprint: 8EC1 F319 2C1D 3197 AAB9 580F FACD 1D19 175E D903
David, A much more robust pooling mechanism is dbcp from apache http://jakarta.apache.org/commons/dbcp/ Dave On 25-Jan-06, at 8:18 PM, David Hustace wrote: > > On Jan 25, 2006, at 6:30 PM, Oliver Jowett wrote: > >> It does not change the state of the datasource and just delegates >> to DriverManager to do the real work, so it should be threadsafe. > > Ah. I just saw that. in BasicDataSource.java. Thanks for your > help. I guess I need improve my connection caching to try and > remove stale connections or switch to another implementation. > Anyone have production experience with SmartPool and a PostgreSQL > driver? http://smartpool.sourceforge.net/ It looks pretty good so > far. I guess I should search the list before asking this question. > (duck) > > -David > > > David Hustace > The OpenNMS Group, Inc. > Main : +1 919 545 2553 Fax: +1 503-961-7746 > Direct: +1 919 827 1201 Skype: dhustace > Key Fingerprint: 8EC1 F319 2C1D 3197 AAB9 580F FACD 1D19 175E D903 > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: explain analyze is your friend >