Re: Defunct postmasters - Mailing list pgsql-general
From | Tom Lane |
---|---|
Subject | Re: Defunct postmasters |
Date | |
Msg-id | 25062.1014677942@sss.pgh.pa.us Whole thread Raw |
In response to | Defunct postmasters (Gavin Scott <gavin@pokerpages.com>) |
Responses |
Re: Defunct postmasters
(Gavin Scott <gavin@pokerpages.com>)
|
List | pgsql-general |
Gavin Scott <gavin@pokerpages.com> writes: > We have lately begun having problems with our production database > running postgres 7.1 on linux kernel v 2.4.17. The system had run > without incident for many months (there were occasional reboots). Since > we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem > until Feb 13, when postmaster appeared to stop taking new incoming > connections. We restarted and then the problem struck again Saturday > night (Feb 23). If it happens again, could you attach to the postmaster with gdb and get a stack trace from it? > This one sounded like an exact match: > http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s After looking again at the thread with Philip Crotwell, I have developed a theory that might explain the postmaster's failing to reap zombie (defunct) children right away. The basic loop in the postmaster is to use select(2) to wait for a connection attempt, and when one occurs, use accept(2) to establish the connection; then fork off a child process to deal with the connection, and return to the select(). Zombie children are supposed to be reaped by the SIGCHLD signal handler, which we enable only while waiting for select(). The scenario that comes to mind is: suppose that an abortive connection attempt triggers select() to return a connection-ready indication, but by the time we reach the accept() call, the kernel has decided the connection was bogus. (This seems somewhat plausible in the case of a portscan, much less so for real connection attempts.) The accept() would then block waiting for another connection attempt to come in. Until one happened, no SIGCHLD interrupts could be serviced, so you might see zombie children hanging around after awhile. The flaw in this idea is that once a second connection attempt does come in, everything should be perfectly back to normal: the postmaster will accept it and then resume normal operations. So it's not at all clear how this could cause your complaint of being unable to accept new connections. Nonetheless, Philip did exhibit a stack trace showing the postmaster waiting at accept(). If someone else can replicate that, I'd start to think that we had enough material to justify filing a Linux kernel bug report. Perhaps it's the kernel, not the postmaster, that's wedged somehow. I am thinking that it'd be a good idea for the postmaster to run the listening socket in nonblock mode, which should theoretically prevent the accept() call from blocking when there's no new connection available. It's not clear whether that would be a workaround for a kernel bug, if there is one --- but it might be worth trying. > Also, does anyone know any reason to NOT upgrade to 7.2? The only significant glitch I've heard of is that pg_dump and psql have a little disagreement over the handling of mixed-case database names and user names. If you have any, you might have to hand-edit your pg_dump script (put double quotes around such names in \connect lines) before you can reload the database. regards, tom lane
pgsql-general by date: