Thread: BUG #1061: message type 0x49 arrived from server while idle

BUG #1061: message type 0x49 arrived from server while idle

From
"PostgreSQL Bugs List"
Date:
The following bug has been logged online:

Bug reference:      1061
Logged by:          Robert Creager

Email address:      Robert_Creager@StorageTek.com

PostgreSQL version: 7.4

Operating system:   Linux Mandrake 9.2 2.4.22 SMP

Description:        message type 0x49 arrived from server while idle

Details:


The problem is not occuring very often, 3 times in a week.  A <ps -Af> is
showing the connection "idle in transaction".  pg_autovacuum is running.

This is a new install on a new dual Xenon machine.  The process which
received the message is running on Solaris/Perl 5.6.1 using DBI 1.30 and
DBD::Pg 1.22.

The PGSQL version on Solaris is 7.3.3, but that shouldn't matter?

Compiled from source:
PostgreSQL 7.4.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.1
(Mandrake Linux 9.2 3.3.1-2mdk)

Re: BUG #1061: message type 0x49 arrived from server while idle

From
Tom Lane
Date:
"PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes:
> Description:        message type 0x49 arrived from server while idle

Hm.  This would seem to have something to do with issuing empty queries
(message type 0x49 = 'I' is EmptyQueryResponse).  I couldn't duplicate the
problem in some simple tests, however.  Unless you can come up with more
information, there's not a lot I can do ...

> The PGSQL version on Solaris is 7.3.3, but that shouldn't matter?

No, that's *extremely* relevant, because it means the wire protocol is
going to be 2.0 not 3.0.  I couldn't get anything funny to happen by
issuing empty queries with a 7.3 libpq and a 7.4 server, though.

            regards, tom lane

Re: BUG #1061: message type 0x49 arrived from server while

From
Robert Creager
Date:
When grilled further on (Thu, 22 Jan 2004 14:27:19 -0500),
Tom Lane <tgl@sss.pgh.pa.us> confessed:

> "PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes:
> > Description:        message type 0x49 arrived from server while idle
>
> Hm.  This would seem to have something to do with issuing empty queries
> (message type 0x49 = 'I' is EmptyQueryResponse).  I couldn't duplicate the
> problem in some simple tests, however.  Unless you can come up with more
> information, there's not a lot I can do ...

What more information would you be looking for?  I'm thinking that maybe the
query that was executed (or not executed)?  I looked at the log and there was no
info other than "unexpected EOF on client connection", which I get regularly
from my multi-process program. I'll turn on logging of all queries to see if
any happen to be empty. I don't believe I'm executing empty queries, and the
problem is not reproducible by running the same program multiple times.  It just
shows up every now and then.

I was not getting this error while running 7.4Beta4, but that was running on
Solaris 2.8 and compiled with GCC 3.1.  The same 7.3.3 PGSQL would of been used
in this instance.

>
> > The PGSQL version on Solaris is 7.3.3, but that shouldn't matter?
>
> No, that's *extremely* relevant, because it means the wire protocol is
> going to be 2.0 not 3.0.  I couldn't get anything funny to happen by
> issuing empty queries with a 7.3 libpq and a 7.4 server, though.

I'll look into doing an install of the client side only (7.4.1) that I can
access when running my scripts, and re-install DBD::Pg against that version.
Probably won't happen 'till next week at the earliest though.

Cheers,
Rob

--
 19:48:21 up 25 days,  9:35,  4 users,  load average: 2.49, 2.43, 2.18

Re: BUG #1061: message type 0x49 arrived from server while idle

From
Tom Lane
Date:
Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> What more information would you be looking for?

Well, in the words of the old saw, if we knew what we were looking for
it wouldn't be research ...

I'm sure we could solve it quickly if we could get a reproducible test
case, but it sounds like you are nowhere near being able to provide
that.  I would suggest looking for context information: what else is
happening at the same time this happens?

Another thing that would probably give enough information to solve it
is a trace of the connection from a TCP packet sniffer, for at least
a few packets leading up to the error message.  Not sure if it's
practical to try to get one, but if you could it'd be great.

            regards, tom lane

Re: BUG #1061: message type 0x49 arrived from server while

From
Robert Creager
Date:
When grilled further on (Thu, 22 Jan 2004 22:19:44 -0500),
Tom Lane <tgl@sss.pgh.pa.us> confessed:

> Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> > What more information would you be looking for?
>
> Well, in the words of the old saw, if we knew what we were looking for
> it wouldn't be research ...

That's fine.  The 'tone' of your e-mail led me to believe I wasn't providing
something obvious, which I could believe.

>
> I'm sure we could solve it quickly if we could get a reproducible test
> case, but it sounds like you are nowhere near being able to provide
> that.  I would suggest looking for context information: what else is
> happening at the same time this happens?

Multiple 'simultaneous' connections (through forking) is a possibility.  I
believe it was 4 child processes when this occurred the last time, but I'm not
100% sure.

>
> Another thing that would probably give enough information to solve it
> is a trace of the connection from a TCP packet sniffer, for at least
> a few packets leading up to the error message.  Not sure if it's
> practical to try to get one, but if you could it'd be great.

Not a problem.  Thankfully I have full control on this box, not the Solaris side
though.  And the box is dedicated to this database, so there shouldnb't be a ton
of extra traffic either.  Just the normal mis-configured network client
broadcast jabber ;-)  I should be able to figure out tcpdump enough to filter
out all traffic except that between the two servers.

Cheers,
Rob

--
 20:40:25 up 25 days, 10:27,  4 users,  load average: 2.02, 2.02, 2.00

Re: BUG #1061: message type 0x49 arrived from server while

From
Robert Creager
Date:
When grilled further on (Sat, 24 Jan 2004 15:54:07 -0500),
Tom Lane <tgl@sss.pgh.pa.us> confessed:

> > A custom Perl 5.6.1, DBI 1.30, DBD::Pg 1.22 script on Solaris 5.8.
> > Using pgsql client version 7.3.3.  I'm thinking it may be related to
> > my forking?  My program has a db connection, which is then overwritten
> > in all child processes, first thing upon forking:
>
> Judging by that trace, there are multiple processes trying to use the
> same database connection.  Better take a closer look at what you are
> doing.
>
> I suspect that "overwriting" the connection is not good enough, and that
> you have to somehow actively close the DBI object inherited from the
> parent.  The fact that the excess operations all seem to be empty
> queries is suspicious.  I wonder whether those are coming from logic
> internal to DBI or DBD::Pg, that might still have access to the
> connection even though your Perl code doesn't anymore.

I have more of a clue now.  I'm calling "DBI->connect_cached" when I fork, even
though I only use that particular db connection in the parent.  connect_cached
sends a ping to verify the connection is alive, which is probably what the empty
queries are.  The logs confirm that the pid associated with that connection
[721] and [1050] don't actually execute any requests.  The new server is much
faster than the old one, so it is probably processing the packets faster, thus
occasionally getting into this problem when the fork timing is just wrong and
the dual packets are received before the first can be processed.

I can remove that second connection (which was used to vacuum) now that I have
pg_autovacuum running on the new server.  I suspect this will clear up the empty
queries and the message coming back because of the overlapping empty queries.

Presuming this is the problem, I never would of found this without your help.
Thanks for taking the time.  I'll not forget to be more careful of using
"connect_cached" next time around.

>
> Hmm ... AFAIR, libpq doesn't provide any clean way to close the socket
> and discard the PGconn without actually sending a Terminate message to
> the backend, which is what you'd want to happen in the child process.
> This seems like a shortcoming in its API.
>

Cheers,
Rob

--
 14:08:58 up 27 days,  3:55,  4 users,  load average: 2.05, 2.05, 2.05