BUG #3504: Some listening sessions never return from writing, problems ensue - Mailing list pgsql-bugs

From Peter Koczan
Subject BUG #3504: Some listening sessions never return from writing, problems ensue
Date
Msg-id 200708012119.l71LJeOg031089@wwwmaster.postgresql.org
Whole thread Raw
Responses Re: BUG #3504: Some listening sessions never return from writing, problems ensue
List pgsql-bugs
The following bug has been logged online:

Bug reference:      3504
Logged by:          Peter Koczan
Email address:      pjkoczan@gmail.com
PostgreSQL version: 8.2.4
Operating system:   CentOS 4.5 Linux (RHEL 4), kernel 2.6.9-55.ELsmp
Description:        Some listening sessions never return from writing,
problems ensue
Details:

There is a problem where connections listening for asynchronous notifies
occasionally block for writing on ther server side and never finish,
resulting in connections that always have the status "notify interrupt".
Apparently, this causes no ill effects for the application that's listening
(i.e. it still updates somehow), but no data is written to the connection
when it should be. It also becomes a problem for restarting the server since
postgres cannot kill the connections when they're in that state. I either
have to explicitly kill the connections, kill the client apps, or reboot the
server. If I try to restart postgres, it kills all but the "notify
interrupt" connections, but it doesn't shut down so I can't restart it short
of rebooting.

Here are stack traces from listening sessions, the first is ok, the second
is blocked.

-----------------------------------------------------

OK:
[root@vero ~]# ps axvw | grep 30728
30728 ?        Ss     0:00      0  3265 42526 29240  0.7 postgres: koczan
csdb ator.cs.wisc.edu(38753) idle

[root@vero ~]# gdb -p 30728
...Standard loading messages...
(gdb) bt
#0  0x003bc7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x001bac93 in __read_nocancel () from /lib/tls/libc.so.6
#2  0x00e81e2e in sock_read () from
/s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#3  0x00000008 in ?? ()
#4  0x08c40068 in ?? ()
#5  0x00000005 in ?? ()
#6  0x00f160e8 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#7  0x08c3a310 in ?? ()
#8  0x00000000 in ?? ()


(This stack trace is of the last 3 updates, and is farily representative of
what I've seen).
[root@vero ~]# strace -p 30728
read(8, 0x8c40068, 5)                   = ? ERESTARTSYS (To be restarted)
--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0
semop(1081358, 0xbfe20b9c, 1)           = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
semop(1081358, 0xbfe20b6c, 1)           = 0
_llseek(6, 0, [6078464], SEEK_END)      = 0
time(NULL)                              = 1185998905
_llseek(5, 6569984, [6569984], SEEK_SET) = 0
write(5, "^\320\1\0\1\0\0\0!\0\0\0\0@d\225\34\0\0\0\10\0\0\0(\372"..., 8192)
= 8192
fdatasync(5)                            = 0
semop(1048589, 0xbfe20adc, 1)           = 0
semop(1048589, 0xbfe20b2c, 1)           = 0
write(8, "\27\3\1\0 \221\363{!\235\355\317\223\260\253\tY\242\350"..., 90) =
90
sigreturn()                             = ? (mask now [])
read(8, 0x8c40068, 5)                   = ? ERESTARTSYS (To be restarted)
--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0
semop(1081358, 0xbfe20b9c, 1)           = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
semop(1081358, 0xbfe20b6c, 1)           = 0
_llseek(6, 0, [6078464], SEEK_END)      = 0
time(NULL)                              = 1185998905
write(5, "^\320\1\0\1\0\0\0!\0\0\0\0`d\225H\0\0\0req\0 \272\301\10"...,
8192) = 8192
fdatasync(5)                            = 0
semop(1212434, 0xbfe20adc, 1)           = 0
write(8, "\27\3\1\0 \'\6\235\323\2\33\321\216R\370\277i\304\217h"..., 90) =
90
sigreturn()                             = ? (mask now [])
read(8, 0x8c40068, 5)                   = ? ERESTARTSYS (To be restarted)
--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0
semop(1081358, 0xbfe20b9c, 1)           = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
_llseek(6, 0, [6078464], SEEK_END)      = 0
time(NULL)                              = 1186000276
_llseek(5, 9453568, [9453568], SEEK_SET) = 0
write(5, "^\320\1\0\1\0\0\0!\0\0\0\0@\220\225r\4\0\0\0\0\0\0\0\0"..., 8192)
= 8192
fdatasync(5)                            = 0
semop(1081358, 0xbfe20adc, 1)           = 0
write(8, "\27\3\1\0 \301\307_\375=\0032\243\212\215\362\3457\335"..., 90) =
90
sigreturn()                             = ? (mask now [])
read(8, 0x8c40068, 5)                   = ? ERESTARTSYS (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
semop(1081358, 0xbfe20d2c, 1)           = 0
semop(1081358, 0xbfe20d2c, 1)           = 0
semop(1081358, 0xbfe20d2c, 1)           = 0
sigreturn()                             = ? (mask now [])
read(8,

-----------------------------------------------------

Blocked process:
[root@vero ~]# ps axvw | grep 19583
19583 ?        Ss     0:02      0  3265 42518 29368  0.7 postgres: craft
csdb zim.cs.wisc.edu(32777) notify interrupt

[root@vero ~]# gdb -p 19583
...Standard loading messages...
(gdb) bt
#0  0x003bc7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x001bad13 in __write_nocancel () from /lib/tls/libc.so.6
#2  0x00e81ec4 in sock_write () from
/s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#3  0x00000008 in ?? ()
#4  0x08c4484a in ?? ()
#5  0x00000038 in ?? ()
#6  0x00000030 in ?? ()
#7  0x00f160e8 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#8  0x08c3a7f0 in ?? ()
#9  0x00000000 in ?? ()

(Stack trace from the same amount of time, same 3 updates)
[root@vero ~]# strace -p 19583
Process 19583 attached - interrupt to quit
write(8, "\224\337\252\27\3\1\0000\247W\250\2-\307;\204w\276\242"..., 56) =
? ERESTARTSYS (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
sigreturn()                             = ? (mask now [USR2])
write(8, "\224\337\252\27\3\1\0000\247W\250\2-\307;\204w\276\242"..., 56

From my best guess, looks like something's weird with semop, but I don't
know.

Note: I'm compiling from source with PAM, krb5 (1.5.3), and openssl
(0.9.8e).

pgsql-bugs by date:

Previous
From: Gregory Stark
Date:
Subject: Re: Oddities with ANYARRAY
Next
From: Heikki Linnakangas
Date:
Subject: Re: BUG #3504: Some listening sessions never return from writing, problems ensue