Thread: BUG #1061: message type 0x49 arrived from server while idle
The following bug has been logged online: Bug reference: 1061 Logged by: Robert Creager Email address: Robert_Creager@StorageTek.com PostgreSQL version: 7.4 Operating system: Linux Mandrake 9.2 2.4.22 SMP Description: message type 0x49 arrived from server while idle Details: The problem is not occuring very often, 3 times in a week. A <ps -Af> is showing the connection "idle in transaction". pg_autovacuum is running. This is a new install on a new dual Xenon machine. The process which received the message is running on Solaris/Perl 5.6.1 using DBI 1.30 and DBD::Pg 1.22. The PGSQL version on Solaris is 7.3.3, but that shouldn't matter? Compiled from source: PostgreSQL 7.4.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.1 (Mandrake Linux 9.2 3.3.1-2mdk)
"PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes: > Description: message type 0x49 arrived from server while idle Hm. This would seem to have something to do with issuing empty queries (message type 0x49 = 'I' is EmptyQueryResponse). I couldn't duplicate the problem in some simple tests, however. Unless you can come up with more information, there's not a lot I can do ... > The PGSQL version on Solaris is 7.3.3, but that shouldn't matter? No, that's *extremely* relevant, because it means the wire protocol is going to be 2.0 not 3.0. I couldn't get anything funny to happen by issuing empty queries with a 7.3 libpq and a 7.4 server, though. regards, tom lane
When grilled further on (Thu, 22 Jan 2004 14:27:19 -0500), Tom Lane <tgl@sss.pgh.pa.us> confessed: > "PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes: > > Description: message type 0x49 arrived from server while idle > > Hm. This would seem to have something to do with issuing empty queries > (message type 0x49 = 'I' is EmptyQueryResponse). I couldn't duplicate the > problem in some simple tests, however. Unless you can come up with more > information, there's not a lot I can do ... What more information would you be looking for? I'm thinking that maybe the query that was executed (or not executed)? I looked at the log and there was no info other than "unexpected EOF on client connection", which I get regularly from my multi-process program. I'll turn on logging of all queries to see if any happen to be empty. I don't believe I'm executing empty queries, and the problem is not reproducible by running the same program multiple times. It just shows up every now and then. I was not getting this error while running 7.4Beta4, but that was running on Solaris 2.8 and compiled with GCC 3.1. The same 7.3.3 PGSQL would of been used in this instance. > > > The PGSQL version on Solaris is 7.3.3, but that shouldn't matter? > > No, that's *extremely* relevant, because it means the wire protocol is > going to be 2.0 not 3.0. I couldn't get anything funny to happen by > issuing empty queries with a 7.3 libpq and a 7.4 server, though. I'll look into doing an install of the client side only (7.4.1) that I can access when running my scripts, and re-install DBD::Pg against that version. Probably won't happen 'till next week at the earliest though. Cheers, Rob -- 19:48:21 up 25 days, 9:35, 4 users, load average: 2.49, 2.43, 2.18
Robert Creager <Robert_Creager@LogicalChaos.org> writes: > What more information would you be looking for? Well, in the words of the old saw, if we knew what we were looking for it wouldn't be research ... I'm sure we could solve it quickly if we could get a reproducible test case, but it sounds like you are nowhere near being able to provide that. I would suggest looking for context information: what else is happening at the same time this happens? Another thing that would probably give enough information to solve it is a trace of the connection from a TCP packet sniffer, for at least a few packets leading up to the error message. Not sure if it's practical to try to get one, but if you could it'd be great. regards, tom lane
When grilled further on (Thu, 22 Jan 2004 22:19:44 -0500), Tom Lane <tgl@sss.pgh.pa.us> confessed: > Robert Creager <Robert_Creager@LogicalChaos.org> writes: > > What more information would you be looking for? > > Well, in the words of the old saw, if we knew what we were looking for > it wouldn't be research ... That's fine. The 'tone' of your e-mail led me to believe I wasn't providing something obvious, which I could believe. > > I'm sure we could solve it quickly if we could get a reproducible test > case, but it sounds like you are nowhere near being able to provide > that. I would suggest looking for context information: what else is > happening at the same time this happens? Multiple 'simultaneous' connections (through forking) is a possibility. I believe it was 4 child processes when this occurred the last time, but I'm not 100% sure. > > Another thing that would probably give enough information to solve it > is a trace of the connection from a TCP packet sniffer, for at least > a few packets leading up to the error message. Not sure if it's > practical to try to get one, but if you could it'd be great. Not a problem. Thankfully I have full control on this box, not the Solaris side though. And the box is dedicated to this database, so there shouldnb't be a ton of extra traffic either. Just the normal mis-configured network client broadcast jabber ;-) I should be able to figure out tcpdump enough to filter out all traffic except that between the two servers. Cheers, Rob -- 20:40:25 up 25 days, 10:27, 4 users, load average: 2.02, 2.02, 2.00
When grilled further on (Sat, 24 Jan 2004 15:54:07 -0500), Tom Lane <tgl@sss.pgh.pa.us> confessed: > > A custom Perl 5.6.1, DBI 1.30, DBD::Pg 1.22 script on Solaris 5.8. > > Using pgsql client version 7.3.3. I'm thinking it may be related to > > my forking? My program has a db connection, which is then overwritten > > in all child processes, first thing upon forking: > > Judging by that trace, there are multiple processes trying to use the > same database connection. Better take a closer look at what you are > doing. > > I suspect that "overwriting" the connection is not good enough, and that > you have to somehow actively close the DBI object inherited from the > parent. The fact that the excess operations all seem to be empty > queries is suspicious. I wonder whether those are coming from logic > internal to DBI or DBD::Pg, that might still have access to the > connection even though your Perl code doesn't anymore. I have more of a clue now. I'm calling "DBI->connect_cached" when I fork, even though I only use that particular db connection in the parent. connect_cached sends a ping to verify the connection is alive, which is probably what the empty queries are. The logs confirm that the pid associated with that connection [721] and [1050] don't actually execute any requests. The new server is much faster than the old one, so it is probably processing the packets faster, thus occasionally getting into this problem when the fork timing is just wrong and the dual packets are received before the first can be processed. I can remove that second connection (which was used to vacuum) now that I have pg_autovacuum running on the new server. I suspect this will clear up the empty queries and the message coming back because of the overlapping empty queries. Presuming this is the problem, I never would of found this without your help. Thanks for taking the time. I'll not forget to be more careful of using "connect_cached" next time around. > > Hmm ... AFAIR, libpq doesn't provide any clean way to close the socket > and discard the PGconn without actually sending a Terminate message to > the backend, which is what you'd want to happen in the child process. > This seems like a shortcoming in its API. > Cheers, Rob -- 14:08:58 up 27 days, 3:55, 4 users, load average: 2.05, 2.05, 2.05