BUG #3504: Some listening sessions never return from writing, problems ensue - Mailing list pgsql-bugs
From | Peter Koczan |
---|---|
Subject | BUG #3504: Some listening sessions never return from writing, problems ensue |
Date | |
Msg-id | 200708012119.l71LJeOg031089@wwwmaster.postgresql.org Whole thread Raw |
Responses |
Re: BUG #3504: Some listening sessions never return from writing,
problems ensue
|
List | pgsql-bugs |
The following bug has been logged online: Bug reference: 3504 Logged by: Peter Koczan Email address: pjkoczan@gmail.com PostgreSQL version: 8.2.4 Operating system: CentOS 4.5 Linux (RHEL 4), kernel 2.6.9-55.ELsmp Description: Some listening sessions never return from writing, problems ensue Details: There is a problem where connections listening for asynchronous notifies occasionally block for writing on ther server side and never finish, resulting in connections that always have the status "notify interrupt". Apparently, this causes no ill effects for the application that's listening (i.e. it still updates somehow), but no data is written to the connection when it should be. It also becomes a problem for restarting the server since postgres cannot kill the connections when they're in that state. I either have to explicitly kill the connections, kill the client apps, or reboot the server. If I try to restart postgres, it kills all but the "notify interrupt" connections, but it doesn't shut down so I can't restart it short of rebooting. Here are stack traces from listening sessions, the first is ok, the second is blocked. ----------------------------------------------------- OK: [root@vero ~]# ps axvw | grep 30728 30728 ? Ss 0:00 0 3265 42526 29240 0.7 postgres: koczan csdb ator.cs.wisc.edu(38753) idle [root@vero ~]# gdb -p 30728 ...Standard loading messages... (gdb) bt #0 0x003bc7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x001bac93 in __read_nocancel () from /lib/tls/libc.so.6 #2 0x00e81e2e in sock_read () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #3 0x00000008 in ?? () #4 0x08c40068 in ?? () #5 0x00000005 in ?? () #6 0x00f160e8 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #7 0x08c3a310 in ?? () #8 0x00000000 in ?? () (This stack trace is of the last 3 updates, and is farily representative of what I've seen). [root@vero ~]# strace -p 30728 read(8, 0x8c40068, 5) = ? ERESTARTSYS (To be restarted) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0 semop(1081358, 0xbfe20b9c, 1) = 0 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 semop(1081358, 0xbfe20b6c, 1) = 0 _llseek(6, 0, [6078464], SEEK_END) = 0 time(NULL) = 1185998905 _llseek(5, 6569984, [6569984], SEEK_SET) = 0 write(5, "^\320\1\0\1\0\0\0!\0\0\0\0@d\225\34\0\0\0\10\0\0\0(\372"..., 8192) = 8192 fdatasync(5) = 0 semop(1048589, 0xbfe20adc, 1) = 0 semop(1048589, 0xbfe20b2c, 1) = 0 write(8, "\27\3\1\0 \221\363{!\235\355\317\223\260\253\tY\242\350"..., 90) = 90 sigreturn() = ? (mask now []) read(8, 0x8c40068, 5) = ? ERESTARTSYS (To be restarted) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0 semop(1081358, 0xbfe20b9c, 1) = 0 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 semop(1081358, 0xbfe20b6c, 1) = 0 _llseek(6, 0, [6078464], SEEK_END) = 0 time(NULL) = 1185998905 write(5, "^\320\1\0\1\0\0\0!\0\0\0\0`d\225H\0\0\0req\0 \272\301\10"..., 8192) = 8192 fdatasync(5) = 0 semop(1212434, 0xbfe20adc, 1) = 0 write(8, "\27\3\1\0 \'\6\235\323\2\33\321\216R\370\277i\304\217h"..., 90) = 90 sigreturn() = ? (mask now []) read(8, 0x8c40068, 5) = ? ERESTARTSYS (To be restarted) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0 semop(1081358, 0xbfe20b9c, 1) = 0 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 _llseek(6, 0, [6078464], SEEK_END) = 0 time(NULL) = 1186000276 _llseek(5, 9453568, [9453568], SEEK_SET) = 0 write(5, "^\320\1\0\1\0\0\0!\0\0\0\0@\220\225r\4\0\0\0\0\0\0\0\0"..., 8192) = 8192 fdatasync(5) = 0 semop(1081358, 0xbfe20adc, 1) = 0 write(8, "\27\3\1\0 \301\307_\375=\0032\243\212\215\362\3457\335"..., 90) = 90 sigreturn() = ? (mask now []) read(8, 0x8c40068, 5) = ? ERESTARTSYS (To be restarted) --- SIGUSR1 (User defined signal 1) @ 0 (0) --- semop(1081358, 0xbfe20d2c, 1) = 0 semop(1081358, 0xbfe20d2c, 1) = 0 semop(1081358, 0xbfe20d2c, 1) = 0 sigreturn() = ? (mask now []) read(8, ----------------------------------------------------- Blocked process: [root@vero ~]# ps axvw | grep 19583 19583 ? Ss 0:02 0 3265 42518 29368 0.7 postgres: craft csdb zim.cs.wisc.edu(32777) notify interrupt [root@vero ~]# gdb -p 19583 ...Standard loading messages... (gdb) bt #0 0x003bc7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x001bad13 in __write_nocancel () from /lib/tls/libc.so.6 #2 0x00e81ec4 in sock_write () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #3 0x00000008 in ?? () #4 0x08c4484a in ?? () #5 0x00000038 in ?? () #6 0x00000030 in ?? () #7 0x00f160e8 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #8 0x08c3a7f0 in ?? () #9 0x00000000 in ?? () (Stack trace from the same amount of time, same 3 updates) [root@vero ~]# strace -p 19583 Process 19583 attached - interrupt to quit write(8, "\224\337\252\27\3\1\0000\247W\250\2-\307;\204w\276\242"..., 56) = ? ERESTARTSYS (To be restarted) --- SIGUSR1 (User defined signal 1) @ 0 (0) --- sigreturn() = ? (mask now [USR2]) write(8, "\224\337\252\27\3\1\0000\247W\250\2-\307;\204w\276\242"..., 56 From my best guess, looks like something's weird with semop, but I don't know. Note: I'm compiling from source with PAM, krb5 (1.5.3), and openssl (0.9.8e).
pgsql-bugs by date: