Re: Defunct postmasters - Mailing list pgsql-general

From Tom Lane
Subject Re: Defunct postmasters
Date
Msg-id 25062.1014677942@sss.pgh.pa.us
Whole thread Raw
In response to Defunct postmasters  (Gavin Scott <gavin@pokerpages.com>)
Responses Re: Defunct postmasters  (Gavin Scott <gavin@pokerpages.com>)
List pgsql-general
Gavin Scott <gavin@pokerpages.com> writes:
> We have lately begun having problems with our production database
> running postgres 7.1 on linux kernel v 2.4.17.  The system had run
> without incident for many months (there were occasional reboots).  Since
> we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem
> until Feb 13, when postmaster appeared to stop taking new incoming
> connections. We restarted and then the problem struck again Saturday
> night (Feb 23).

If it happens again, could you attach to the postmaster with gdb and get
a stack trace from it?

> This one sounded like an exact match:
>
http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s


After looking again at the thread with Philip Crotwell, I have developed
a theory that might explain the postmaster's failing to reap zombie
(defunct) children right away.  The basic loop in the postmaster is to
use select(2) to wait for a connection attempt, and when one occurs,
use accept(2) to establish the connection; then fork off a child process
to deal with the connection, and return to the select().  Zombie
children are supposed to be reaped by the SIGCHLD signal handler, which
we enable only while waiting for select().

The scenario that comes to mind is: suppose that an abortive connection
attempt triggers select() to return a connection-ready indication, but
by the time we reach the accept() call, the kernel has decided the
connection was bogus.  (This seems somewhat plausible in the case of
a portscan, much less so for real connection attempts.)  The accept()
would then block waiting for another connection attempt to come in.
Until one happened, no SIGCHLD interrupts could be serviced, so you
might see zombie children hanging around after awhile.

The flaw in this idea is that once a second connection attempt does come
in, everything should be perfectly back to normal: the postmaster will
accept it and then resume normal operations.  So it's not at all clear
how this could cause your complaint of being unable to accept new
connections.

Nonetheless, Philip did exhibit a stack trace showing the postmaster
waiting at accept().  If someone else can replicate that, I'd start to
think that we had enough material to justify filing a Linux kernel bug
report.  Perhaps it's the kernel, not the postmaster, that's wedged
somehow.

I am thinking that it'd be a good idea for the postmaster to run the
listening socket in nonblock mode, which should theoretically prevent
the accept() call from blocking when there's no new connection
available.  It's not clear whether that would be a workaround for a
kernel bug, if there is one --- but it might be worth trying.


> Also, does anyone know any reason to NOT upgrade to 7.2?

The only significant glitch I've heard of is that pg_dump and psql have
a little disagreement over the handling of mixed-case database names and
user names.  If you have any, you might have to hand-edit your pg_dump
script (put double quotes around such names in \connect lines) before
you can reload the database.

            regards, tom lane

pgsql-general by date:

Previous
From: "Martin Dillard"
Date:
Subject: Re: scaling a database
Next
From: Jason Earl
Date:
Subject: Re: scaling a database