Defunct postmasters - Mailing list pgsql-general

From Gavin Scott
Subject Defunct postmasters
Date
Msg-id 1014670303.13536.95.camel@gavin.pokerpages.com
Whole thread Raw
Responses Re: Defunct postmasters  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Hi,

We have lately begun having problems with our production database
running postgres 7.1 on linux kernel v 2.4.17.  The system had run
without incident for many months (there were occasional reboots).  Since
we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem
until Feb 13, when postmaster appeared to stop taking new incoming
connections. We restarted and then the problem struck again Saturday
night (Feb 23).

In both instances attempting to access the db via the psql commandline
would just hang -- no error messages were printed.  Also we have two
perl scripts running that connect to the database once every few
minutes; one runs on a remote server the other locally.  Both create log
files and appeared to be stuck trying to make a connection.

In the 2nd incident /var/log/postgresql.log contained:

Sat Feb 23 23:41:00 CST 2002
PacketReceiveFragment: read() failed: Connection reset by peer
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: recv() failed: Connection reset by peer
Sat Feb 23 23:51:00 CST 2002
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: recv() failed: Connection reset by peer

23:40 appears to have been when the problem began. I added a cron job to
put the date lines in the above; in the 1st incident I didn't have that
so it was difficult to tell what was happening when the problem began;
it did contain messages similar to the above but I can't guarantee they
were produced at the time of the problem.

dmesg both on the postgres machine and our remote server which accesses
it via the script mentioned above showed a couple of lines like:

sending pkt_too_big to self
sending pkt_too_big to self

Since there aren't any timestamps in dmesg I can't guarantee that those
were produced at the time of incident.  Also I did not check dmesg
during the 1st incident.

In both incidences there were multiple zombies hanging around:

postgres 21264  0.0  0.0     0    0 ?        Z    Feb23   0:00
[postmaster <defunct>]
postgres 21266  0.0  0.0     0    0 ?        Z    Feb23   0:00
[postmaster <defunct>]

The system was mostly idle at the time I began investigating both
incidents.

While searching the mailing list archives I did find 2 threads that
seemed to reference similar problems.

This one sounded like an exact match:

http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s
There were similar elements mentioned here:
http://archives.postgresql.org/pgsql-hackers/2002-01/msg01142.php

I was especially intrigued by this quote from Tom Lane in the 2nd link:

"It sounds like the postmaster got into a state where it was not
responding to SIGCHLD signals.  We fixed one possible cause of that
between 7.1 and 7.2, but without a more concrete report I have no way to
know if you saw the same problem or a different one.  I'd have expected
connection attempts to unwedge the postmaster in any case."

Does anyone have any idea what might be causing our problem and whether
or now upgrading to 7.2 might solve it?

Also, does anyone know any reason to NOT upgrade to 7.2?  I've only
recently joined this list, so I may have overlooked outstanding known
problems with 7.2.


Thanks,
Gavin Scott
gavin@pokerpages.com


pgsql-general by date:

Previous
From: Joel Shellman
Date:
Subject: Permissions on file created by COPY TO
Next
From: Jan Wieck
Date:
Subject: Re: [HACKERS] Nice Oracle tuning article