Thread: Defunct postmasters
Hi, We have lately begun having problems with our production database running postgres 7.1 on linux kernel v 2.4.17. The system had run without incident for many months (there were occasional reboots). Since we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem until Feb 13, when postmaster appeared to stop taking new incoming connections. We restarted and then the problem struck again Saturday night (Feb 23). In both instances attempting to access the db via the psql commandline would just hang -- no error messages were printed. Also we have two perl scripts running that connect to the database once every few minutes; one runs on a remote server the other locally. Both create log files and appeared to be stuck trying to make a connection. In the 2nd incident /var/log/postgresql.log contained: Sat Feb 23 23:41:00 CST 2002 PacketReceiveFragment: read() failed: Connection reset by peer pq_recvbuf: recv() failed: Connection reset by peer pq_recvbuf: recv() failed: Connection reset by peer Sat Feb 23 23:51:00 CST 2002 pq_recvbuf: recv() failed: Connection reset by peer pq_recvbuf: recv() failed: Connection reset by peer 23:40 appears to have been when the problem began. I added a cron job to put the date lines in the above; in the 1st incident I didn't have that so it was difficult to tell what was happening when the problem began; it did contain messages similar to the above but I can't guarantee they were produced at the time of the problem. dmesg both on the postgres machine and our remote server which accesses it via the script mentioned above showed a couple of lines like: sending pkt_too_big to self sending pkt_too_big to self Since there aren't any timestamps in dmesg I can't guarantee that those were produced at the time of incident. Also I did not check dmesg during the 1st incident. In both incidences there were multiple zombies hanging around: postgres 21264 0.0 0.0 0 0 ? Z Feb23 0:00 [postmaster <defunct>] postgres 21266 0.0 0.0 0 0 ? Z Feb23 0:00 [postmaster <defunct>] The system was mostly idle at the time I began investigating both incidents. While searching the mailing list archives I did find 2 threads that seemed to reference similar problems. This one sounded like an exact match: http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s There were similar elements mentioned here: http://archives.postgresql.org/pgsql-hackers/2002-01/msg01142.php I was especially intrigued by this quote from Tom Lane in the 2nd link: "It sounds like the postmaster got into a state where it was not responding to SIGCHLD signals. We fixed one possible cause of that between 7.1 and 7.2, but without a more concrete report I have no way to know if you saw the same problem or a different one. I'd have expected connection attempts to unwedge the postmaster in any case." Does anyone have any idea what might be causing our problem and whether or now upgrading to 7.2 might solve it? Also, does anyone know any reason to NOT upgrade to 7.2? I've only recently joined this list, so I may have overlooked outstanding known problems with 7.2. Thanks, Gavin Scott gavin@pokerpages.com
Gavin Scott <gavin@pokerpages.com> writes: > We have lately begun having problems with our production database > running postgres 7.1 on linux kernel v 2.4.17. The system had run > without incident for many months (there were occasional reboots). Since > we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem > until Feb 13, when postmaster appeared to stop taking new incoming > connections. We restarted and then the problem struck again Saturday > night (Feb 23). If it happens again, could you attach to the postmaster with gdb and get a stack trace from it? > This one sounded like an exact match: > http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s After looking again at the thread with Philip Crotwell, I have developed a theory that might explain the postmaster's failing to reap zombie (defunct) children right away. The basic loop in the postmaster is to use select(2) to wait for a connection attempt, and when one occurs, use accept(2) to establish the connection; then fork off a child process to deal with the connection, and return to the select(). Zombie children are supposed to be reaped by the SIGCHLD signal handler, which we enable only while waiting for select(). The scenario that comes to mind is: suppose that an abortive connection attempt triggers select() to return a connection-ready indication, but by the time we reach the accept() call, the kernel has decided the connection was bogus. (This seems somewhat plausible in the case of a portscan, much less so for real connection attempts.) The accept() would then block waiting for another connection attempt to come in. Until one happened, no SIGCHLD interrupts could be serviced, so you might see zombie children hanging around after awhile. The flaw in this idea is that once a second connection attempt does come in, everything should be perfectly back to normal: the postmaster will accept it and then resume normal operations. So it's not at all clear how this could cause your complaint of being unable to accept new connections. Nonetheless, Philip did exhibit a stack trace showing the postmaster waiting at accept(). If someone else can replicate that, I'd start to think that we had enough material to justify filing a Linux kernel bug report. Perhaps it's the kernel, not the postmaster, that's wedged somehow. I am thinking that it'd be a good idea for the postmaster to run the listening socket in nonblock mode, which should theoretically prevent the accept() call from blocking when there's no new connection available. It's not clear whether that would be a workaround for a kernel bug, if there is one --- but it might be worth trying. > Also, does anyone know any reason to NOT upgrade to 7.2? The only significant glitch I've heard of is that pg_dump and psql have a little disagreement over the handling of mixed-case database names and user names. If you have any, you might have to hand-edit your pg_dump script (put double quotes around such names in \connect lines) before you can reload the database. regards, tom lane
On Mon, 2002-02-25 at 16:59, Tom Lane wrote: > Gavin Scott <gavin@pokerpages.com> writes: > > We have lately begun having problems with our production database > > running postgres 7.1 on linux kernel v 2.4.17. The system had run > > without incident for many months (there were occasional reboots). Since > > we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem > > until Feb 13, when postmaster appeared to stop taking new incoming > > connections. We restarted and then the problem struck again Saturday > > night (Feb 23). > > If it happens again, could you attach to the postmaster with gdb and get > a stack trace from it? Just happened again this morning. It turns out this looks like an openssl problem, not a postgresql one: [root@me2 gavin]# gdb /usr/bin/postmaster 736 GNU gdb 5.0 Copyright 2000 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"... (no debugging symbols found)... /big/home/gavin/736: No such file or directory. Attaching to program: /usr/bin/postmaster, Pid 736 Reading symbols from /usr/lib/libssl.so.0...(no debugging symbols found)... done. Loaded symbols for /usr/lib/libssl.so.0 Reading symbols from /usr/lib/libcrypto.so.0...(no debugging symbols found)... done. Loaded symbols for /usr/lib/libcrypto.so.0 Reading symbols from /usr/kerberos/lib/libkrb5.so.3... (no debugging symbols found)...done. Loaded symbols for /usr/kerberos/lib/libkrb5.so.3 Reading symbols from /usr/kerberos/lib/libk5crypto.so.3... (no debugging symbols found)...done. Loaded symbols for /usr/kerberos/lib/libk5crypto.so.3 Reading symbols from /usr/kerberos/lib/libcom_err.so.3... (no debugging symbols found)...done. Loaded symbols for /usr/kerberos/lib/libcom_err.so.3 Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /usr/lib/libreadline.so.4.1...done. Loaded symbols for /usr/lib/libreadline.so.4.1 Reading symbols from /lib/libtermcap.so.2...done. Loaded symbols for /lib/libtermcap.so.2 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 0x40310424 in __libc_read () from /lib/libc.so.6 (gdb) where #0 0x40310424 in __libc_read () from /lib/libc.so.6 #1 0x400f8e54 in __DTOR_END__ () from /usr/lib/libcrypto.so.0 #2 0x40090f53 in BIO_read () from /usr/lib/libcrypto.so.0 #3 0x4002dada in ssl3_read_n () from /usr/lib/libssl.so.0 #4 0x4002dbca in ssl3_get_record () from /usr/lib/libssl.so.0 #5 0x4002e633 in ssl3_read_bytes () from /usr/lib/libssl.so.0 #6 0x4002f517 in ssl3_get_message () from /usr/lib/libssl.so.0 #7 0x40027011 in ssl3_check_client_hello () from /usr/lib/libssl.so.0 #8 0x40026c19 in ssl3_accept () from /usr/lib/libssl.so.0 #9 0x40033842 in SSL_accept () from /usr/lib/libssl.so.0 #10 0x4003056a in ssl23_get_client_hello () from /usr/lib/libssl.so.0 #11 0x4002fd57 in ssl23_accept () from /usr/lib/libssl.so.0 #12 0x40033842 in SSL_accept () from /usr/lib/libssl.so.0 #13 0x80f136d in PostmasterMain () #14 0x80cf372 in PacketReceiveFragment () #15 0x80f10b1 in PostmasterMain () #16 0x80f0ac4 in PostmasterMain () #17 0x80cf7e8 in main () #18 0x40258f31 in __libc_start_main (main=0x80cf680 <main>, argc=5, ubp_av=0xbffffab4, init=0x8065e04 <_init>, fini=0x8154d60 <_fini>, rtld_fini=0x4000e274 <_dl_fini>, stack_end=0xbffffaac) at ../sysdeps/generic/libc-start.c:129 It did indeed seem to be initiated by some odd networking problems in the path to the machine that makes SSL postgres connections to the postgres server. We were running openssl 0.9.5 on that machine. I've upgraded it to 0.9.6 for now and am going to start looking to see if this was a known bug of 0.9.5. Thanks for all your help! Gavin Scott gavin@pokerpages.com
Gavin Scott <gavin@pokerpages.com> writes: > Just happened again this morning. It turns out this looks like an > openssl problem, not a postgresql one: Hmm. It'd be worth your while to update to Postgres 7.2. In 7.2 the postmaster forks a child process before it starts to do anything with the connection, so hangups in SSL can't lock the whole postmaster, only the one connection. Or that's the theory anyway. regards, tom lane
On 28 Feb 2002, Gavin Scott wrote: > On Mon, 2002-02-25 at 16:59, Tom Lane wrote: > > Gavin Scott <gavin@pokerpages.com> writes: > > > We have lately begun having problems with our production database > > > running postgres 7.1 on linux kernel v 2.4.17. The system had run > > > without incident for many months (there were occasional reboots). Since > > > we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem > > > until Feb 13, when postmaster appeared to stop taking new incoming > > > connections. We restarted and then the problem struck again Saturday > > > night (Feb 23). > > > > If it happens again, could you attach to the postmaster with gdb and get > > a stack trace from it? > > Just happened again this morning. It turns out this looks like an > openssl problem, not a postgresql one: Just for the record: I once had the problem that I couldn't connect to postgres after 7 days. A restart fixed this for 7 days more. Solution: I had a "delete all files in /tmp that are older than 7 days" cronjob that deleted the postgres fifo/socket.