Thread: Mixed up protocol packets in server response?
I have an instance of what looks like a weird mixup of data returned by the server. This is observed in a Java application using the 9.0-801.jdbc4 JDBC driver connecting to an 8.3.4 server (yes, it's old) running on Solaris. The application hung waiting for the result of one select * from a_table where a_field = parameter; Looking at it with a debugger I see that: 1. it waits in: SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method] SocketInputStream.read(byte[], int, int) line: 129 VisibleBufferedInputStream.read(byte[], int, int) line: 219 PGStream.Receive(byte[], int, int) line: 460 PGStream.ReceiveTupleV3() line: 365 QueryExecutorImpl.processResults(ResultHandler, int) line: 1814 QueryExecutorImpl.execute(Query, ParameterList, ResultHandler, int, int, int) line: 257 Jdbc4PreparedStatement(AbstractJdbc2Statement).execute(Query, ParameterList, int) line: 500 Jdbc4PreparedStatement(AbstractJdbc2Statement).executeWithFlags(int) line: 388 Jdbc4PreparedStatement(AbstractJdbc2Statement).executeQuery() line: 273 because it expects a field of size = 825767394 (l_size in ReceiveTupleV3) which is much more than the whole result should be. 2. then looking at the answer array in ReceiveTupleV3: A few first fields have the expected values, then answer[3] has the expected size but somewhere in the middle it changes from field 3 of the first row of the result to the middle of field 3 of the last row of the result. Then of course there are 4 bytes of this field that as a 32-bit integer have the value of 825767394. Then answer[4] obviously has 825767394 elements. And it continues with the field 3 of the last row and then it contains the 4-byte length and then value of the field 4 (last) of the last row of the result - so the end of a DataRow message. Then a CommandComplete message with the tag "SELECT", then a ReadyForQuery message with the status indicator "T". And then there is the rest of the field 3 of the first row, then length and value for the field 4 of the first row and then the DataRow messages for the following rows of the result. The last one up until the place where it got mixed up before. So it looks like the server wanted to send D row 1, D row 2, D row 3, D row 4, C SELECT, Z T but the application sees D ro, ow 4, C SELECT, Z T, w 1, D row 2, D row 3, D r What may be the cause of this weird problem? Is it some known or unknown bug in 8.3.4 or is the application/Java side more suspected? -- Michał Politowski Talking has been known to lead to communication if practiced carelessly.
Michal Politowski <mpol+pg@meep.pl> writes: > 2. then looking at the answer array in ReceiveTupleV3: > ... > So it looks like the server wanted to send > D row 1, D row 2, D row 3, D row 4, C SELECT, Z T > but the application sees > D ro, ow 4, C SELECT, Z T, w 1, D row 2, D row 3, D r > What may be the cause of this weird problem? Is it some known or unknown bug in > 8.3.4 or is the application/Java side more suspected? It's just about impossible to believe that the server sent that, because (1) it generates and sends those messages one at a time, and (2) nothing of the sort has ever been reported before. My money is on some foulup in buffer-wrangling on the client side. regards, tom lane
On 1/06/2011 9:06 PM, Michal Politowski wrote: > What may be the cause of this weird problem? Is it some known or unknown bug in > 8.3.4 or is the application/Java side more suspected? It'd be really helpful if you could collect and examine a trace of the client/server communication using WireShark. That way you can confirm whether it is (as Tom suspects) the client side mangling its buffers, or whether the server really did send the nonsensical sequence. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
On Thu, 2 Jun 2011 08:50:30 +0800, Craig Ringer wrote: > On 1/06/2011 9:06 PM, Michal Politowski wrote: > > >What may be the cause of this weird problem? Is it some known or unknown bug in > >8.3.4 or is the application/Java side more suspected? > > It'd be really helpful if you could collect and examine a trace of > the client/server communication using WireShark. That way you can > confirm whether it is (as Tom suspects) the client side mangling its > buffers, or whether the server really did send the nonsensical > sequence. Actually my own money is with Tom's. It's very hard to believe Postgres would do something like this, unless it were some obvious and long fixed bug in 8.3.4. Still, trying to trace the communication makes sense, if I can convince the owners of the system to let me do it. Unfortunately this is an "one in a million of successful queries" (actually two in much more than a million) problem. And the next run of the application seems not to have hit it, yet. Thinking aloud: If this is, as it is to be suspected, an application-side problem, there is at first sight not much space in the application where it could hide. The data is mixed up in a driver buffer, two method calls from the standard library socket code. There is the VisibleBufferedInputStream there. Could it do something like this? Maybe if the connection was erroneously used from two threads concurrently? The connections are pooled via commons-dbcp BasicDataSource and queries are executed via Spring JdbcTemplate within Spring-configured transaction. No passing connections by hand anywhere, everything should be nicely thread-bound. Still, if not here, where could it go wrong? -- Michał Politowski Talking has been known to lead to communication if practiced carelessly.
Michal Politowski <mpol+pg@meep.pl> writes: > Thinking aloud: If this is, as it is to be suspected, an application-side problem, > there is at first sight not much space in the application where it could hide. The data is > mixed up in a driver buffer, two method calls from the standard library > socket code. There is the VisibleBufferedInputStream there. Could it do > something like this? Maybe if the connection was erroneously used from two threads > concurrently? The connections are pooled via commons-dbcp BasicDataSource > and queries are executed via Spring JdbcTemplate within Spring-configured > transaction. No passing connections by hand anywhere, everything should be > nicely thread-bound. Still, if not here, where could it go wrong? You'd probably be better off asking these questions in pgsql-jdbc ... a lot of us here don't even speak Java. regards, tom lane
On Fri, 3 Jun 2011 09:53:59 -0400, Tom Lane wrote: [...] > You'd probably be better off asking these questions in pgsql-jdbc ... > a lot of us here don't even speak Java. Good point. Will try the other list. Thank you for your help. -- Michał Politowski Talking has been known to lead to communication if practiced carelessly.
> transaction. No passing connections by hand anywhere, everything should be > nicely thread-bound. Still, if not here, where could it go wrong? > I have seen two cases in my career where there was an "evil box" on the network that corrupted the traffic. The first was a very long time ago (in the late '80s) but the second was only a couple of years ago and presented with very similar symptoms to your report. This happened at a consulting client's site (actually between two sites). Weird "franken-packets" showed up once in a while, leading to a protocol decode failure. Luckily we had been involved in writing both the client and the server, and therefore had a high degree of confidence that they were correct. The network administrators denied strongly that they had any equipment deployed that touched the payload of any packet. They denied this several times. Eventually we were able to take packet traces on both client and server machines, correlate the traffic (not necessarily an easy task), and prove conclusively that what had been sent from one end did not show up intact at the other end. A few days later the network people revealed that they had some sort of firewall/traffic management box that was mangling the traffic. Having said that, bugs in buffer management code are also not uncommon, and can manifest intermittently since they may be triggered by specific boundary conditions, specific received data buffer size, and so on. I have also seen once case of data leaking between threads in an unpleasant and intermittent way in a Java application, in buffer management code that attempted to avoid GC overhead by re-using buffers across sessions. So that's definitely a non-zero possibility too.