Thread: Defunct postmasters

Defunct postmasters

From
Gavin Scott
Date:
Hi,

We have lately begun having problems with our production database
running postgres 7.1 on linux kernel v 2.4.17.  The system had run
without incident for many months (there were occasional reboots).  Since
we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem
until Feb 13, when postmaster appeared to stop taking new incoming
connections. We restarted and then the problem struck again Saturday
night (Feb 23).

In both instances attempting to access the db via the psql commandline
would just hang -- no error messages were printed.  Also we have two
perl scripts running that connect to the database once every few
minutes; one runs on a remote server the other locally.  Both create log
files and appeared to be stuck trying to make a connection.

In the 2nd incident /var/log/postgresql.log contained:

Sat Feb 23 23:41:00 CST 2002
PacketReceiveFragment: read() failed: Connection reset by peer
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: recv() failed: Connection reset by peer
Sat Feb 23 23:51:00 CST 2002
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: recv() failed: Connection reset by peer

23:40 appears to have been when the problem began. I added a cron job to
put the date lines in the above; in the 1st incident I didn't have that
so it was difficult to tell what was happening when the problem began;
it did contain messages similar to the above but I can't guarantee they
were produced at the time of the problem.

dmesg both on the postgres machine and our remote server which accesses
it via the script mentioned above showed a couple of lines like:

sending pkt_too_big to self
sending pkt_too_big to self

Since there aren't any timestamps in dmesg I can't guarantee that those
were produced at the time of incident.  Also I did not check dmesg
during the 1st incident.

In both incidences there were multiple zombies hanging around:

postgres 21264  0.0  0.0     0    0 ?        Z    Feb23   0:00
[postmaster <defunct>]
postgres 21266  0.0  0.0     0    0 ?        Z    Feb23   0:00
[postmaster <defunct>]

The system was mostly idle at the time I began investigating both
incidents.

While searching the mailing list archives I did find 2 threads that
seemed to reference similar problems.

This one sounded like an exact match:

http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s
There were similar elements mentioned here:
http://archives.postgresql.org/pgsql-hackers/2002-01/msg01142.php

I was especially intrigued by this quote from Tom Lane in the 2nd link:

"It sounds like the postmaster got into a state where it was not
responding to SIGCHLD signals.  We fixed one possible cause of that
between 7.1 and 7.2, but without a more concrete report I have no way to
know if you saw the same problem or a different one.  I'd have expected
connection attempts to unwedge the postmaster in any case."

Does anyone have any idea what might be causing our problem and whether
or now upgrading to 7.2 might solve it?

Also, does anyone know any reason to NOT upgrade to 7.2?  I've only
recently joined this list, so I may have overlooked outstanding known
problems with 7.2.


Thanks,
Gavin Scott
gavin@pokerpages.com


Re: Defunct postmasters

From
Tom Lane
Date:
Gavin Scott <gavin@pokerpages.com> writes:
> We have lately begun having problems with our production database
> running postgres 7.1 on linux kernel v 2.4.17.  The system had run
> without incident for many months (there were occasional reboots).  Since
> we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem
> until Feb 13, when postmaster appeared to stop taking new incoming
> connections. We restarted and then the problem struck again Saturday
> night (Feb 23).

If it happens again, could you attach to the postmaster with gdb and get
a stack trace from it?

> This one sounded like an exact match:
>
http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s


After looking again at the thread with Philip Crotwell, I have developed
a theory that might explain the postmaster's failing to reap zombie
(defunct) children right away.  The basic loop in the postmaster is to
use select(2) to wait for a connection attempt, and when one occurs,
use accept(2) to establish the connection; then fork off a child process
to deal with the connection, and return to the select().  Zombie
children are supposed to be reaped by the SIGCHLD signal handler, which
we enable only while waiting for select().

The scenario that comes to mind is: suppose that an abortive connection
attempt triggers select() to return a connection-ready indication, but
by the time we reach the accept() call, the kernel has decided the
connection was bogus.  (This seems somewhat plausible in the case of
a portscan, much less so for real connection attempts.)  The accept()
would then block waiting for another connection attempt to come in.
Until one happened, no SIGCHLD interrupts could be serviced, so you
might see zombie children hanging around after awhile.

The flaw in this idea is that once a second connection attempt does come
in, everything should be perfectly back to normal: the postmaster will
accept it and then resume normal operations.  So it's not at all clear
how this could cause your complaint of being unable to accept new
connections.

Nonetheless, Philip did exhibit a stack trace showing the postmaster
waiting at accept().  If someone else can replicate that, I'd start to
think that we had enough material to justify filing a Linux kernel bug
report.  Perhaps it's the kernel, not the postmaster, that's wedged
somehow.

I am thinking that it'd be a good idea for the postmaster to run the
listening socket in nonblock mode, which should theoretically prevent
the accept() call from blocking when there's no new connection
available.  It's not clear whether that would be a workaround for a
kernel bug, if there is one --- but it might be worth trying.


> Also, does anyone know any reason to NOT upgrade to 7.2?

The only significant glitch I've heard of is that pg_dump and psql have
a little disagreement over the handling of mixed-case database names and
user names.  If you have any, you might have to hand-edit your pg_dump
script (put double quotes around such names in \connect lines) before
you can reload the database.

            regards, tom lane

Re: Defunct postmasters

From
Gavin Scott
Date:
On Mon, 2002-02-25 at 16:59, Tom Lane wrote:
> Gavin Scott <gavin@pokerpages.com> writes:
> > We have lately begun having problems with our production database
> > running postgres 7.1 on linux kernel v 2.4.17.  The system had run
> > without incident for many months (there were occasional reboots).  Since
> > we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem
> > until Feb 13, when postmaster appeared to stop taking new incoming
> > connections. We restarted and then the problem struck again Saturday
> > night (Feb 23).
>
> If it happens again, could you attach to the postmaster with gdb and get
> a stack trace from it?

Just happened again this morning.  It turns out this looks like an
openssl problem, not a postgresql one:

[root@me2 gavin]# gdb /usr/bin/postmaster 736
GNU gdb 5.0
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux"...
(no debugging symbols found)...
/big/home/gavin/736: No such file or directory.
Attaching to program: /usr/bin/postmaster, Pid 736
Reading symbols from /usr/lib/libssl.so.0...(no debugging symbols
found)...
done.
Loaded symbols for /usr/lib/libssl.so.0
Reading symbols from /usr/lib/libcrypto.so.0...(no debugging symbols
found)...
done.
Loaded symbols for /usr/lib/libcrypto.so.0
Reading symbols from /usr/kerberos/lib/libkrb5.so.3...
(no debugging symbols found)...done.
Loaded symbols for /usr/kerberos/lib/libkrb5.so.3
Reading symbols from /usr/kerberos/lib/libk5crypto.so.3...
(no debugging symbols found)...done.
Loaded symbols for /usr/kerberos/lib/libk5crypto.so.3
Reading symbols from /usr/kerberos/lib/libcom_err.so.3...
(no debugging symbols found)...done.
Loaded symbols for /usr/kerberos/lib/libcom_err.so.3
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /usr/lib/libreadline.so.4.1...done.
Loaded symbols for /usr/lib/libreadline.so.4.1
Reading symbols from /lib/libtermcap.so.2...done.
Loaded symbols for /lib/libtermcap.so.2
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
0x40310424 in __libc_read () from /lib/libc.so.6
(gdb) where
#0  0x40310424 in __libc_read () from /lib/libc.so.6
#1  0x400f8e54 in __DTOR_END__ () from /usr/lib/libcrypto.so.0
#2  0x40090f53 in BIO_read () from /usr/lib/libcrypto.so.0
#3  0x4002dada in ssl3_read_n () from /usr/lib/libssl.so.0
#4  0x4002dbca in ssl3_get_record () from /usr/lib/libssl.so.0
#5  0x4002e633 in ssl3_read_bytes () from /usr/lib/libssl.so.0
#6  0x4002f517 in ssl3_get_message () from /usr/lib/libssl.so.0
#7  0x40027011 in ssl3_check_client_hello () from /usr/lib/libssl.so.0
#8  0x40026c19 in ssl3_accept () from /usr/lib/libssl.so.0
#9  0x40033842 in SSL_accept () from /usr/lib/libssl.so.0
#10 0x4003056a in ssl23_get_client_hello () from /usr/lib/libssl.so.0
#11 0x4002fd57 in ssl23_accept () from /usr/lib/libssl.so.0
#12 0x40033842 in SSL_accept () from /usr/lib/libssl.so.0
#13 0x80f136d in PostmasterMain ()
#14 0x80cf372 in PacketReceiveFragment ()
#15 0x80f10b1 in PostmasterMain ()
#16 0x80f0ac4 in PostmasterMain ()
#17 0x80cf7e8 in main ()
#18 0x40258f31 in __libc_start_main (main=0x80cf680 <main>, argc=5,
    ubp_av=0xbffffab4, init=0x8065e04 <_init>, fini=0x8154d60 <_fini>,
    rtld_fini=0x4000e274 <_dl_fini>, stack_end=0xbffffaac)
    at ../sysdeps/generic/libc-start.c:129

It did indeed seem to be initiated by some odd networking problems in
the path to the machine that makes SSL postgres connections to the
postgres server.

We were running openssl 0.9.5 on that machine.  I've upgraded it to
0.9.6 for now and am going to start looking to see if this was a known
bug of 0.9.5.

Thanks for all your help!

Gavin Scott
gavin@pokerpages.com



Re: Defunct postmasters

From
Tom Lane
Date:
Gavin Scott <gavin@pokerpages.com> writes:
> Just happened again this morning.  It turns out this looks like an
> openssl problem, not a postgresql one:

Hmm.  It'd be worth your while to update to Postgres 7.2.  In 7.2 the
postmaster forks a child process before it starts to do anything with
the connection, so hangups in SSL can't lock the whole postmaster, only
the one connection.  Or that's the theory anyway.

            regards, tom lane

Re: Defunct postmasters

From
Holger Marzen
Date:
On 28 Feb 2002, Gavin Scott wrote:

> On Mon, 2002-02-25 at 16:59, Tom Lane wrote:
> > Gavin Scott <gavin@pokerpages.com> writes:
> > > We have lately begun having problems with our production database
> > > running postgres 7.1 on linux kernel v 2.4.17.  The system had run
> > > without incident for many months (there were occasional reboots).  Since
> > > we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem
> > > until Feb 13, when postmaster appeared to stop taking new incoming
> > > connections. We restarted and then the problem struck again Saturday
> > > night (Feb 23).
> >
> > If it happens again, could you attach to the postmaster with gdb and get
> > a stack trace from it?
>
> Just happened again this morning.  It turns out this looks like an
> openssl problem, not a postgresql one:

Just for the record: I once had the problem that I couldn't connect to
postgres after 7 days. A restart fixed this for 7 days more.

Solution: I had a "delete all files in /tmp that are older than 7 days"
cronjob that deleted the postgres fifo/socket.