Thread: BUG #3504: Some listening sessions never return from writing, problems ensue
BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
The following bug has been logged online: Bug reference: 3504 Logged by: Peter Koczan Email address: pjkoczan@gmail.com PostgreSQL version: 8.2.4 Operating system: CentOS 4.5 Linux (RHEL 4), kernel 2.6.9-55.ELsmp Description: Some listening sessions never return from writing, problems ensue Details: There is a problem where connections listening for asynchronous notifies occasionally block for writing on ther server side and never finish, resulting in connections that always have the status "notify interrupt". Apparently, this causes no ill effects for the application that's listening (i.e. it still updates somehow), but no data is written to the connection when it should be. It also becomes a problem for restarting the server since postgres cannot kill the connections when they're in that state. I either have to explicitly kill the connections, kill the client apps, or reboot the server. If I try to restart postgres, it kills all but the "notify interrupt" connections, but it doesn't shut down so I can't restart it short of rebooting. Here are stack traces from listening sessions, the first is ok, the second is blocked. ----------------------------------------------------- OK: [root@vero ~]# ps axvw | grep 30728 30728 ? Ss 0:00 0 3265 42526 29240 0.7 postgres: koczan csdb ator.cs.wisc.edu(38753) idle [root@vero ~]# gdb -p 30728 ...Standard loading messages... (gdb) bt #0 0x003bc7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x001bac93 in __read_nocancel () from /lib/tls/libc.so.6 #2 0x00e81e2e in sock_read () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #3 0x00000008 in ?? () #4 0x08c40068 in ?? () #5 0x00000005 in ?? () #6 0x00f160e8 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #7 0x08c3a310 in ?? () #8 0x00000000 in ?? () (This stack trace is of the last 3 updates, and is farily representative of what I've seen). [root@vero ~]# strace -p 30728 read(8, 0x8c40068, 5) = ? ERESTARTSYS (To be restarted) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0 semop(1081358, 0xbfe20b9c, 1) = 0 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 semop(1081358, 0xbfe20b6c, 1) = 0 _llseek(6, 0, [6078464], SEEK_END) = 0 time(NULL) = 1185998905 _llseek(5, 6569984, [6569984], SEEK_SET) = 0 write(5, "^\320\1\0\1\0\0\0!\0\0\0\0@d\225\34\0\0\0\10\0\0\0(\372"..., 8192) = 8192 fdatasync(5) = 0 semop(1048589, 0xbfe20adc, 1) = 0 semop(1048589, 0xbfe20b2c, 1) = 0 write(8, "\27\3\1\0 \221\363{!\235\355\317\223\260\253\tY\242\350"..., 90) = 90 sigreturn() = ? (mask now []) read(8, 0x8c40068, 5) = ? ERESTARTSYS (To be restarted) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0 semop(1081358, 0xbfe20b9c, 1) = 0 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 semop(1081358, 0xbfe20b6c, 1) = 0 _llseek(6, 0, [6078464], SEEK_END) = 0 time(NULL) = 1185998905 write(5, "^\320\1\0\1\0\0\0!\0\0\0\0`d\225H\0\0\0req\0 \272\301\10"..., 8192) = 8192 fdatasync(5) = 0 semop(1212434, 0xbfe20adc, 1) = 0 write(8, "\27\3\1\0 \'\6\235\323\2\33\321\216R\370\277i\304\217h"..., 90) = 90 sigreturn() = ? (mask now []) read(8, 0x8c40068, 5) = ? ERESTARTSYS (To be restarted) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0 semop(1081358, 0xbfe20b9c, 1) = 0 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 _llseek(6, 0, [6078464], SEEK_END) = 0 time(NULL) = 1186000276 _llseek(5, 9453568, [9453568], SEEK_SET) = 0 write(5, "^\320\1\0\1\0\0\0!\0\0\0\0@\220\225r\4\0\0\0\0\0\0\0\0"..., 8192) = 8192 fdatasync(5) = 0 semop(1081358, 0xbfe20adc, 1) = 0 write(8, "\27\3\1\0 \301\307_\375=\0032\243\212\215\362\3457\335"..., 90) = 90 sigreturn() = ? (mask now []) read(8, 0x8c40068, 5) = ? ERESTARTSYS (To be restarted) --- SIGUSR1 (User defined signal 1) @ 0 (0) --- semop(1081358, 0xbfe20d2c, 1) = 0 semop(1081358, 0xbfe20d2c, 1) = 0 semop(1081358, 0xbfe20d2c, 1) = 0 sigreturn() = ? (mask now []) read(8, ----------------------------------------------------- Blocked process: [root@vero ~]# ps axvw | grep 19583 19583 ? Ss 0:02 0 3265 42518 29368 0.7 postgres: craft csdb zim.cs.wisc.edu(32777) notify interrupt [root@vero ~]# gdb -p 19583 ...Standard loading messages... (gdb) bt #0 0x003bc7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x001bad13 in __write_nocancel () from /lib/tls/libc.so.6 #2 0x00e81ec4 in sock_write () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #3 0x00000008 in ?? () #4 0x08c4484a in ?? () #5 0x00000038 in ?? () #6 0x00000030 in ?? () #7 0x00f160e8 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #8 0x08c3a7f0 in ?? () #9 0x00000000 in ?? () (Stack trace from the same amount of time, same 3 updates) [root@vero ~]# strace -p 19583 Process 19583 attached - interrupt to quit write(8, "\224\337\252\27\3\1\0000\247W\250\2-\307;\204w\276\242"..., 56) = ? ERESTARTSYS (To be restarted) --- SIGUSR1 (User defined signal 1) @ 0 (0) --- sigreturn() = ? (mask now [USR2]) write(8, "\224\337\252\27\3\1\0000\247W\250\2-\307;\204w\276\242"..., 56 From my best guess, looks like something's weird with semop, but I don't know. Note: I'm compiling from source with PAM, krb5 (1.5.3), and openssl (0.9.8e).
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Heikki Linnakangas
Date:
Peter Koczan wrote: > There is a problem where connections listening for asynchronous notifies > occasionally block for writing on ther server side and never finish, > resulting in connections that always have the status "notify interrupt". > Apparently, this causes no ill effects for the application that's listening > (i.e. it still updates somehow), but no data is written to the connection > when it should be. It also becomes a problem for restarting the server since > postgres cannot kill the connections when they're in that state. I either > have to explicitly kill the connections, kill the client apps, or reboot the > server. If I try to restart postgres, it kills all but the "notify > interrupt" connections, but it doesn't shut down so I can't restart it short > of rebooting. Does the client read the async notifies? The write in the server will block if the client doesn't keep up with reading the data. > ... > > (This stack trace is of the last 3 updates, and is farily representative of > what I've seen). > [root@vero ~]# strace -p 30728 That's actually not a stack trace, but a system call trace. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Tom Lane
Date:
1Heikki Linnakangas <heikki@enterprisedb.com> writes: > Peter Koczan wrote: >> There is a problem where connections listening for asynchronous notifies >> occasionally block for writing on ther server side and never finish, > Does the client read the async notifies? The write in the server will > block if the client doesn't keep up with reading the data. Loss of network connectivity to the client (eg, timeout of a NAT mapping) seems like another plausible explanation. regards, tom lane
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Peter Koczan
Date:
Heikki Linnakangas wrote: > Peter Koczan wrote: > >> There is a problem where connections listening for asynchronous notifies >> occasionally block for writing on ther server side and never finish, >> resulting in connections that always have the status "notify interrupt". >> Apparently, this causes no ill effects for the application that's listening >> (i.e. it still updates somehow), but no data is written to the connection >> when it should be. It also becomes a problem for restarting the server since >> postgres cannot kill the connections when they're in that state. I either >> have to explicitly kill the connections, kill the client apps, or reboot the >> server. If I try to restart postgres, it kills all but the "notify >> interrupt" connections, but it doesn't shut down so I can't restart it short >> of rebooting. >> > > Does the client read the async notifies? The write in the server will > block if the client doesn't keep up with reading the data. > > Well, the client updates it's GUI properly no matter whether the listening session is blocked or not (it's an ongoing tail of requests). Overtly from the client side, it's completely transparent. Is there any way I can tell definitively from the client side whether or not it's actually reading the async notifies? >> ... >> >> (This stack trace is of the last 3 updates, and is farily representative of >> what I've seen). >> [root@vero ~]# strace -p 30728 >> > > That's actually not a stack trace, but a system call trace. > > D'oh. sorry, still a little new at using these tools for debugging.
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Peter Koczan
Date:
Tom Lane wrote: > 1Heikki Linnakangas <heikki@enterprisedb.com> writes: > >> Peter Koczan wrote: >> >>> There is a problem where connections listening for asynchronous notifies >>> occasionally block for writing on ther server side and never finish, >>> > > >> Does the client read the async notifies? The write in the server will >> block if the client doesn't keep up with reading the data. >> > > Loss of network connectivity to the client (eg, timeout of a NAT > mapping) seems like another plausible explanation. > > regards, tom lane > > It's plausible, but unlikely for our network setup. We rarely use NAT and if the client lost its network, this particular application (it's Perl with GTK and DBD::Pg, by the way) tends to hang since it doesn't try to re-establish a connection to the DB short of killing and restarting it. Peter
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Tom Lane
Date:
Peter Koczan <pjkoczan@gmail.com> writes: > Heikki Linnakangas wrote: >> Does the client read the async notifies? The write in the server will >> block if the client doesn't keep up with reading the data. >> > Well, the client updates it's GUI properly no matter whether the > listening session is blocked or not (it's an ongoing tail of requests). > Overtly from the client side, it's completely transparent. Is there any > way I can tell definitively from the client side whether or not it's > actually reading the async notifies? netstat's Recv-Q column would probably reflect it if the client is failing to consume messages. The send queue on the server side would be worth checking too. regards, tom lane
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
I found something pretty interesting when running netstat's: Before a series of 3 async notifies (second column is Recv-Q): OK connections: [koczan@ator] koczan $ netstat | grep vero tcp 180 0 ator.cs.wisc.edu:32785 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 0 0 ator.cs.wisc.edu:32784 vero.cs.wisc.edu:postgresqlESTABLISHED [koczan@ator] koczan $ ssh newton netstat | grep vero tcp 64260 0 newton.cs.wisc.edu:32778 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 0 0 newton.cs.wisc.edu:32777 vero.cs.wisc.edu:postgresqlESTABLISHED "notify interrupt" connections: [koczan@ator] koczan $ ssh brian netstat | grep vero tcp 0 0 brian.cs.wisc.edu:40365 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 77800 0 brian.cs.wisc.edu:40366 vero.cs.wisc.edu:postgresqlESTABLISHED [koczan@ator] koczan $ ssh zim netstat | grep vero tcp 73402 0 zim.cs.wisc.edu:32777 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 0 0 zim.cs.wisc.edu:32776 vero.cs.wisc.edu:postgresqlESTABLISHED and after said notifies: OK connections: [koczan@ator] koczan $ netstat | grep vero tcp 450 0 ator.cs.wisc.edu:32785 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 0 0 ator.cs.wisc.edu:32784 vero.cs.wisc.edu:postgresqlESTABLISHED [koczan@ator] koczan $ ssh newton netstat | grep vero tcp 64260 0 newton.cs.wisc.edu:32778 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 0 0 newton.cs.wisc.edu:32777 vero.cs.wisc.edu:postgresqlESTABLISHED "notify interrupt" connections: [koczan@ator] koczan $ ssh brian netstat | grep vero tcp 0 0 brian.cs.wisc.edu:40365 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 77800 0 brian.cs.wisc.edu:40366 vero.cs.wisc.edu:postgresqlESTABLISHED [koczan@ator] koczan $ ssh zim netstat | grep vero tcp 73402 0 zim.cs.wisc.edu:32777 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 0 0 zim.cs.wisc.edu:32776 vero.cs.wisc.edu:postgresqlESTABLISHED And on the server side things get a little more interesting (Send-Q is the 3rd column)... OK: [koczan@vero] ~ $ netstat | grep ator tcp 0 0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32785 ESTABLISHED tcp 0 0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32784 ESTABLISHED [koczan@vero] ~ $ netstat | grep newton tcp 0 0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32778 ESTABLISHED tcp 0 0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32777 ESTABLISHED "notify interrupt": [koczan@vero] ~ $ netstat | grep brian tcp 0 0 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40365 ESTABLISHED tcp 0 13032 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40366 ESTABLISHED [koczan@vero] ~ $ netstat | grep zim tcp 0 13032 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32777 ESTABLISHED tcp 0 0 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32776 ESTABLISHED A quick perusal of the other "notify interrupt" connections shows 13032 in the Send-Q column. They all got into this state at the same time. Peter P.S. Thanks for the help, I really appreciate it. On 8/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Koczan <pjkoczan@gmail.com> writes: > > Heikki Linnakangas wrote: > >> Does the client read the async notifies? The write in the server will > >> block if the client doesn't keep up with reading the data. > >> > > Well, the client updates it's GUI properly no matter whether the > > listening session is blocked or not (it's an ongoing tail of requests). > > Overtly from the client side, it's completely transparent. Is there any > > way I can tell definitively from the client side whether or not it's > > actually reading the async notifies? > > netstat's Recv-Q column would probably reflect it if the client is > failing to consume messages. The send queue on the server side would be > worth checking too. > > regards, tom lane >
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
Whoops, a couple quick changes. This: [koczan@ator] koczan $ ssh newton netstat | grep vero tcp 64260 0 newton.cs.wisc.edu:32778 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 0 0 newton.cs.wisc.edu:32777 vero.cs.wisc.edu:postgresqlESTABLISHED should be this: [koczan@ator] koczan $ ssh newton netstat | grep vero tcp 64530 0 newton.cs.wisc.edu:32778 vero.cs.wisc.edu:postgresqlESTABLISHED tcp 0 0 newton.cs.wisc.edu:32777 vero.cs.wisc.edu:postgresqlESTABLISHED Also, the Send-Q column on the server side doesn't change, even after more async notifies. They may not have happened at the same time, but it's a bit perplexing that they all have the exact same amount of data in the queue. Peter On 8/2/07, Peter Koczan <pjkoczan@gmail.com> wrote: > > I found something pretty interesting when running netstat's: > > Before a series of 3 async notifies (second column is Recv-Q): > > OK connections: > [koczan@ator] koczan $ netstat | grep vero > tcp 180 0 ator.cs.wisc.edu:32785 > vero.cs.wisc.edu:postgresql ESTABLISHED > tcp 0 0 ator.cs.wisc.edu:32784 > vero.cs.wisc.edu:postgresql ESTABLISHED > [koczan@ator] koczan $ ssh newton netstat | grep vero > tcp 64260 0 newton.cs.wisc.edu:32778 > vero.cs.wisc.edu:postgresql ESTABLISHED > tcp 0 0 newton.cs.wisc.edu:32777 > vero.cs.wisc.edu:postgresql ESTABLISHED > > "notify interrupt" connections: > [koczan@ator] koczan $ ssh brian netstat | grep vero > tcp 0 0 brian.cs.wisc.edu:40365 > vero.cs.wisc.edu:postgresql ESTABLISHED > tcp 77800 0 brian.cs.wisc.edu:40366 > vero.cs.wisc.edu:postgresql ESTABLISHED > [koczan@ator] koczan $ ssh zim netstat | grep vero > tcp 73402 0 zim.cs.wisc.edu:32777 > vero.cs.wisc.edu:postgresql ESTABLISHED > tcp 0 0 zim.cs.wisc.edu:32776 > vero.cs.wisc.edu:postgresql ESTABLISHED > > and after said notifies: > > OK connections: > [koczan@ator] koczan $ netstat | grep vero > tcp 450 0 ator.cs.wisc.edu:32785 > vero.cs.wisc.edu:postgresql ESTABLISHED > tcp 0 0 ator.cs.wisc.edu:32784 > vero.cs.wisc.edu:postgresql ESTABLISHED > [koczan@ator] koczan $ ssh newton netstat | grep vero > tcp 64260 0 newton.cs.wisc.edu:32778 > vero.cs.wisc.edu:postgresql ESTABLISHED > tcp 0 0 newton.cs.wisc.edu:32777 > vero.cs.wisc.edu:postgresql ESTABLISHED > > "notify interrupt" connections: > [koczan@ator] koczan $ ssh brian netstat | grep vero > tcp 0 0 brian.cs.wisc.edu:40365 > vero.cs.wisc.edu:postgresql ESTABLISHED > tcp 77800 0 brian.cs.wisc.edu:40366 > vero.cs.wisc.edu:postgresql ESTABLISHED > [koczan@ator] koczan $ ssh zim netstat | grep vero > tcp 73402 0 zim.cs.wisc.edu:32777 > vero.cs.wisc.edu:postgresql ESTABLISHED > tcp 0 0 zim.cs.wisc.edu:32776 > vero.cs.wisc.edu:postgresql ESTABLISHED > > And on the server side things get a little more interesting (Send-Q is the > 3rd column)... > > OK: > [koczan@vero] ~ $ netstat | grep ator > tcp 0 0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32785 > ESTABLISHED > tcp 0 0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32784 > ESTABLISHED > [koczan@vero] ~ $ netstat | grep newton > tcp 0 0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32778 > ESTABLISHED > tcp 0 0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32777 > ESTABLISHED > > "notify interrupt": > [koczan@vero] ~ $ netstat | grep brian > tcp 0 0 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40365 > ESTABLISHED > tcp 0 13032 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40366 > ESTABLISHED > [koczan@vero] ~ $ netstat | grep zim > tcp 0 13032 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32777 > ESTABLISHED > tcp 0 0 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32776 > ESTABLISHED > > A quick perusal of the other "notify interrupt" connections shows 13032 in > the Send-Q column. They all got into this state at the same time. > > Peter > > P.S. Thanks for the help, I really appreciate it. > > On 8/2/07, Tom Lane <tgl@sss.pgh.pa.us > wrote: > > > > Peter Koczan <pjkoczan@gmail.com > writes: > > > Heikki Linnakangas wrote: > > >> Does the client read the async notifies? The write in the server will > > >> block if the client doesn't keep up with reading the data. > > >> > > > Well, the client updates it's GUI properly no matter whether the > > > listening session is blocked or not (it's an ongoing tail of > > requests). > > > Overtly from the client side, it's completely transparent. Is there > > any > > > way I can tell definitively from the client side whether or not it's > > > actually reading the async notifies? > > > > netstat's Recv-Q column would probably reflect it if the client is > > failing to consume messages. The send queue on the server side would be > > worth checking too. > > > > regards, tom lane > > > >
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Heikki Linnakangas
Date:
It looks to me that the server is trying to send stuff to the client, but the client is not reading it for some reason. There's nothing in the queues in the direction client->server. What exactly is the client doing? How does it connect to the server, with libpq? Is it possible that the client is just leaking the connection, keeping it open but not using it for anything? Peter Koczan wrote: > Whoops, a couple quick changes. This: > [koczan@ator] koczan $ ssh newton netstat | grep vero > tcp 64260 0 newton.cs.wisc.edu:32778 > vero.cs.wisc.edu:postgresqlESTABLISHED > tcp 0 0 newton.cs.wisc.edu:32777 > vero.cs.wisc.edu:postgresqlESTABLISHED > > should be this: > [koczan@ator] koczan $ ssh newton netstat | grep vero > tcp 64530 0 newton.cs.wisc.edu:32778 > vero.cs.wisc.edu:postgresqlESTABLISHED > tcp 0 0 newton.cs.wisc.edu:32777 > vero.cs.wisc.edu:postgresqlESTABLISHED > > Also, the Send-Q column on the server side doesn't change, even after more > async notifies. They may not have happened at the same time, but it's a bit > perplexing that they all have the exact same amount of data in the queue. > > Peter > > On 8/2/07, Peter Koczan <pjkoczan@gmail.com> wrote: >> I found something pretty interesting when running netstat's: >> >> Before a series of 3 async notifies (second column is Recv-Q): >> >> OK connections: >> [koczan@ator] koczan $ netstat | grep vero >> tcp 180 0 ator.cs.wisc.edu:32785 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> tcp 0 0 ator.cs.wisc.edu:32784 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> [koczan@ator] koczan $ ssh newton netstat | grep vero >> tcp 64260 0 newton.cs.wisc.edu:32778 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> tcp 0 0 newton.cs.wisc.edu:32777 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> >> "notify interrupt" connections: >> [koczan@ator] koczan $ ssh brian netstat | grep vero >> tcp 0 0 brian.cs.wisc.edu:40365 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> tcp 77800 0 brian.cs.wisc.edu:40366 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> [koczan@ator] koczan $ ssh zim netstat | grep vero >> tcp 73402 0 zim.cs.wisc.edu:32777 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> tcp 0 0 zim.cs.wisc.edu:32776 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> >> and after said notifies: >> >> OK connections: >> [koczan@ator] koczan $ netstat | grep vero >> tcp 450 0 ator.cs.wisc.edu:32785 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> tcp 0 0 ator.cs.wisc.edu:32784 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> [koczan@ator] koczan $ ssh newton netstat | grep vero >> tcp 64260 0 newton.cs.wisc.edu:32778 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> tcp 0 0 newton.cs.wisc.edu:32777 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> >> "notify interrupt" connections: >> [koczan@ator] koczan $ ssh brian netstat | grep vero >> tcp 0 0 brian.cs.wisc.edu:40365 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> tcp 77800 0 brian.cs.wisc.edu:40366 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> [koczan@ator] koczan $ ssh zim netstat | grep vero >> tcp 73402 0 zim.cs.wisc.edu:32777 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> tcp 0 0 zim.cs.wisc.edu:32776 >> vero.cs.wisc.edu:postgresql ESTABLISHED >> >> And on the server side things get a little more interesting (Send-Q is the >> 3rd column)... >> >> OK: >> [koczan@vero] ~ $ netstat | grep ator >> tcp 0 0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32785 >> ESTABLISHED >> tcp 0 0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32784 >> ESTABLISHED >> [koczan@vero] ~ $ netstat | grep newton >> tcp 0 0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32778 >> ESTABLISHED >> tcp 0 0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32777 >> ESTABLISHED >> >> "notify interrupt": >> [koczan@vero] ~ $ netstat | grep brian >> tcp 0 0 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40365 >> ESTABLISHED >> tcp 0 13032 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40366 >> ESTABLISHED >> [koczan@vero] ~ $ netstat | grep zim >> tcp 0 13032 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32777 >> ESTABLISHED >> tcp 0 0 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32776 >> ESTABLISHED >> >> A quick perusal of the other "notify interrupt" connections shows 13032 in >> the Send-Q column. They all got into this state at the same time. >> >> Peter >> >> P.S. Thanks for the help, I really appreciate it. >> >> On 8/2/07, Tom Lane <tgl@sss.pgh.pa.us > wrote: >>> Peter Koczan <pjkoczan@gmail.com > writes: >>>> Heikki Linnakangas wrote: >>>>> Does the client read the async notifies? The write in the server will >>>>> block if the client doesn't keep up with reading the data. >>>>> >>>> Well, the client updates it's GUI properly no matter whether the >>>> listening session is blocked or not (it's an ongoing tail of >>> requests). >>>> Overtly from the client side, it's completely transparent. Is there >>> any >>>> way I can tell definitively from the client side whether or not it's >>>> actually reading the async notifies? >>> netstat's Recv-Q column would probably reflect it if the client is >>> failing to consume messages. The send queue on the server side would be >>> worth checking too. >>> >>> regards, tom lane >>> >> > -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Tom Lane
Date:
"Peter Koczan" <pjkoczan@gmail.com> writes: > A quick perusal of the other "notify interrupt" connections shows 13032 in > the Send-Q column. They all got into this state at the same time. That's far too much data to be a notify message, or even a small number of notify messages. Is it possible that the client's failed to collect hundreds of notify events? But in any case this clearly lets the backend off the hook. The problem is on the client side: it's not reading the data. regards, tom lane
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
Either it's not reading the notifies or it's not telling the server that it's read them and flushing out the pipe. This app uses Perl/DBD::Pg, and the code is basically: while ($dbh->func('pg_notifies')) { # do stuff } The corresponding C code in DBD::Pg appears to be: /* ================================================================== */ SV * dbd_db_pg_notifies (dbh, imp_dbh) SV *dbh; imp_dbh_t *imp_dbh; { PGnotify *notify; AV *ret; SV *retsv; int status; if (dbis->debug >= 3) { (void)PerlIO_printf(DBILOGFP, "dbdpg: dbd_db_pg_notifies\n"); } status = PQconsumeInput(imp_dbh->conn); if (0 == status) { status = PQstatus(imp_dbh->conn); pg_error(dbh, status, PQerrorMessage(imp_dbh->conn)); return &sv_undef; } notify = PQnotifies(imp_dbh->conn); if (!notify) return &sv_undef; ret=newAV(); av_push(ret, newSVpv(notify->relname,0) ); av_push(ret, newSViv(notify->be_pid) ); #if PGLIBVERSION >= 70400 PQfreemem(notify); #else Safefree(notify); #endif retsv = newRV(sv_2mortal((SV*)ret)); return retsv; } /* end of dbd_db_pg_notifies */ On 8/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Peter Koczan" <pjkoczan@gmail.com> writes: > > A quick perusal of the other "notify interrupt" connections shows 13032 > in > > the Send-Q column. They all got into this state at the same time. > > That's far too much data to be a notify message, or even a small number > of notify messages. Is it possible that the client's failed to collect > hundreds of notify events? > > But in any case this clearly lets the backend off the hook. The problem > is on the client side: it's not reading the data. > > regards, tom lane >
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Tom Lane
Date:
"Peter Koczan" <pjkoczan@gmail.com> writes: > This app uses Perl/DBD::Pg, and the code is basically: > while ($dbh->func('pg_notifies')) { > # do stuff > } If I'm not mistaken, that loop would process all the currently available notify messages and then fall out. So the real question is what's around this --- what conditions will make the program come back to this loop periodically? Or not do so? regards, tom lane
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Peter Koczan
Date:
I think the code is somehow interrupt based, but I don't know. I've asked the primary maintainer of the program to look into this with me. We should have more insight by early next week. Peter Tom Lane wrote: > "Peter Koczan" <pjkoczan@gmail.com> writes: > >> This app uses Perl/DBD::Pg, and the code is basically: >> > > >> while ($dbh->func('pg_notifies')) { >> # do stuff >> } >> > > If I'm not mistaken, that loop would process all the currently available > notify messages and then fall out. So the real question is what's > around this --- what conditions will make the program come back to this > loop periodically? Or not do so? >
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Tom Lane
Date:
"Peter Koczan" <pjkoczan@gmail.com> writes: > So, it looks like the app polls for async notifies every second. Hopefully > you can glean some insight from that. My guess is that some condition causes that to stop happening. One way to monitor this from outside is to use strace to watch the app. According to the above you should see a recv() attempt on the libpq socket every second. I suspect that you won't see any more once the app gets into the state where the recv-q is backing up. Looking at what happens around the time of the last recv() might give some clues. regards, tom lane
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Tom Lane
Date:
"Peter Koczan" <pjkoczan@gmail.com> writes: > Here's my theory (and feel free to tell me that I'm full of it)...somehow, a > lot of notifies happened at once, or in a very short period of time, to the > point where the app was still processing notifies when the timer clicked off > another second. The connection (or app, or perl module) never marked those > notifies as being processed, or never updated its timestamp of when it > finished, so when the next notify came around, it tried to reprocess the old > data (or data since the last time it finished), and yet again couldn't > finish. Lather, rinse, repeat. In sum, it might be that trying to call > pg_notifies while processing notifies tickles a race condition and tricks > the connection into thinking its in a bad state. Hmm. Is the app trying to do this processing inside an interrupt service routine (a/k/a signal handler)? If so, and if the ISR can interrupt itself, then you've got a problem because you'll be doing reentrant calls of libpq, which it doesn't support. You can only make that work if the handler blocks further occurrences of its signal until it finishes. regards, tom lane
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
Here's what the primary maintainer of the app had to say about how notifies get read. This is in update_reqs in Gtkreq, which gets called by a Gtk timer every second. Then, as long as there are notifies, it will continue to read from the req log. There's a timer for every instance of the main window, so it looks like if you have more than one main window open, update_reqs would be called more often. That might have to be changed. So, it looks like the app polls for async notifies every second. Hopefully you can glean some insight from that. Peter On 8/3/07, Peter Koczan <pjkoczan@gmail.com> wrote: > > I think the code is somehow interrupt based, but I don't know. I've > asked the primary maintainer of the program to look into this with me. > We should have more insight by early next week. > > Peter > > Tom Lane wrote: > > "Peter Koczan" <pjkoczan@gmail.com> writes: > > > >> This app uses Perl/DBD::Pg, and the code is basically: > >> > > > > > >> while ($dbh->func('pg_notifies')) { > >> # do stuff > >> } > >> > > > > If I'm not mistaken, that loop would process all the currently available > > notify messages and then fall out. So the real question is what's > > around this --- what conditions will make the program come back to this > > loop periodically? Or not do so? > > >
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
Not quite, here's what happens every second when idling, for both good and bad connections (with a different values to each syscall). It is trying to read data, I am a little concerned about the EAGAIN return value, though. [koczan@ator] ~ $ strace -p 6987 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, events=POLLIN}], 3, 869) = 0 gettimeofday({1186445557, 631779}, NULL) = 0 read(5, 0x9651540, 5) = -1 EAGAIN (Resource temporarily unavailable) gettimeofday({1186445557, 632197}, NULL) = 0 ioctl(3, FIONREAD, [0]) = 0 poll(<unfinished ...> Process 6987 detached However, looking at straces when an async notify actually happens, I notice something interesting. The number of syscalls over the same interrupt on the bad connection is nearly 5 times that of the good connection. By the time the good connection is finished, the bad connection is still mucking with data. I can send you the actual straces if you think it would help. Here's my theory (and feel free to tell me that I'm full of it)...somehow, a lot of notifies happened at once, or in a very short period of time, to the point where the app was still processing notifies when the timer clicked off another second. The connection (or app, or perl module) never marked those notifies as being processed, or never updated its timestamp of when it finished, so when the next notify came around, it tried to reprocess the old data (or data since the last time it finished), and yet again couldn't finish. Lather, rinse, repeat. In sum, it might be that trying to call pg_notifies while processing notifies tickles a race condition and tricks the connection into thinking its in a bad state. The holes in my theory are that the app doesn't try to read everything every second for bad connections (so notifies are being processed somehow), and it doesn't explain why the recv-q is filling up even for good connections. Hopefully this will help somehow, though. Peter P.S. Just in case its useful, the contents of pg_listener for bad connections (I verified that notification = 0 for good connections). csdb=# select * from pg_listener where notification > 0; relname | listenerpid | notification ---------+-------------+-------------- req | 3292 | 19141 req | 25848 | 16308 req | 17004 | 23815 req | 19583 | 1703 (4 rows) On 8/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Peter Koczan" <pjkoczan@gmail.com> writes: > > So, it looks like the app polls for async notifies every second. > Hopefully > > you can glean some insight from that. > > My guess is that some condition causes that to stop happening. > > One way to monitor this from outside is to use strace to watch the app. > According to the above you should see a recv() attempt on the libpq > socket every second. I suspect that you won't see any more once the app > gets into the state where the recv-q is backing up. Looking at what > happens around the time of the last recv() might give some clues. > > regards, tom lane >
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
On 8/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Peter Koczan" <pjkoczan@gmail.com> writes: > > Here's my theory (and feel free to tell me that I'm full of it)...somehow, a > > lot of notifies happened at once, or in a very short period of time, to the > > point where the app was still processing notifies when the timer clicked off > > another second. The connection (or app, or perl module) never marked those > > notifies as being processed, or never updated its timestamp of when it > > finished, so when the next notify came around, it tried to reprocess the old > > data (or data since the last time it finished), and yet again couldn't > > finish. Lather, rinse, repeat. In sum, it might be that trying to call > > pg_notifies while processing notifies tickles a race condition and tricks > > the connection into thinking its in a bad state. > > Hmm. Is the app trying to do this processing inside an interrupt > service routine (a/k/a signal handler)? If so, and if the ISR can > interrupt itself, then you've got a problem because you'll be doing > reentrant calls of libpq, which it doesn't support. You can only make > that work if the handler blocks further occurrences of its signal until > it finishes. > I'm not entirely sure if this answers your question, but here's what I found out from the primary maintainer of the app. Note that update_reqs is the function calling pg_notifies. If there's more information I can provide or another test we can run, please let me know. ------- BEGIN MESSAGE ------- I just checked and the timer won't interrupt update_reqs, so we'll have to look for another solution. Anyway, update_reqs doesn't do anything with the database except for checking for a notify, so I don't see where it can be interrupted to cause DB problems. ------- END MESSAGE ------- I also found out that one notify gets sent per action (not per batch of actions), so if n requests get resolved at once, n notifies are sent, not 1. In theory this could mitigate this problem, but I don't know how easy it is at this point. Still, it doesn't explain how or why the client's recv-q isn't getting cleared. Hope this helps. Peter
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
On 8/9/07, Peter Koczan <pjkoczan@gmail.com> wrote: > On 8/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Peter Koczan" <pjkoczan@gmail.com> writes: > > > Here's my theory (and feel free to tell me that I'm full of it)...somehow, a > > > lot of notifies happened at once, or in a very short period of time, to the > > > point where the app was still processing notifies when the timer clicked off > > > another second. The connection (or app, or perl module) never marked those > > > notifies as being processed, or never updated its timestamp of when it > > > finished, so when the next notify came around, it tried to reprocess the old > > > data (or data since the last time it finished), and yet again couldn't > > > finish. Lather, rinse, repeat. In sum, it might be that trying to call > > > pg_notifies while processing notifies tickles a race condition and tricks > > > the connection into thinking its in a bad state. > > > > Hmm. Is the app trying to do this processing inside an interrupt > > service routine (a/k/a signal handler)? If so, and if the ISR can > > interrupt itself, then you've got a problem because you'll be doing > > reentrant calls of libpq, which it doesn't support. You can only make > > that work if the handler blocks further occurrences of its signal until > > it finishes. > > > > I'm not entirely sure if this answers your question, but here's what I > found out from the primary maintainer of the app. Note that > update_reqs is the function calling pg_notifies. If there's more > information I can provide or another test we can run, please let me > know. > > ------- BEGIN MESSAGE ------- > I just checked and the timer won't interrupt update_reqs, so we'll > have to look for another solution. Anyway, update_reqs doesn't do > anything with the database except for checking for a notify, so I > don't see where it can be interrupted to cause DB problems. > ------- END MESSAGE ------- > > I also found out that one notify gets sent per action (not per batch > of actions), so if n requests get resolved at once, n notifies are > sent, not 1. In theory this could mitigate this problem, but I don't > know how easy it is at this point. Still, it doesn't explain how or > why the client's recv-q isn't getting cleared. > > Hope this helps. > On our end, we changed the the code in the function calling pg_notifies to use an if statement rather than a while (that way it only updates once per second instead of continuously as long as there are pending async notifies). I looked more closely at the docs for DBD::Pg, and the pg_notifies call grabs *all* pending async notifies and returns them in a hash, not just one at a time. So, what was happening before was that if a new notify came through while processing the previous notifies, the code would reprocess. Lather, rinse, repeat. I think that if the program is waiting for pg_notifies when the timer interrupts it again, causing the client to call pg_notifies while still waiting for something. I think this is what gets the listening connection into the bad state. In theory this change should mitigate the "notify interrupt" behavior on our end, but, again, why the client's recv-q is filling up is as yet unexplained. Peter P.S. In src/backend/commands/async.c, somewhere between lines 910 and 981 (set_ps_display calls) is where the code gets interrupted. How and why, I don't know.
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Tom Lane
Date:
"Peter Koczan" <pjkoczan@gmail.com> writes: > For instance, I can open up two connections on my test server and > illustrate the problem. Well, the fact that psql acts that way is not unexpected, because it doesn't monitor the database connection constantly --- when it's waiting for console input, it just sleeps. It's not so much "send a NOTIFY" that wakes it up, as "do anything at all with the database connection". What you've told us so far about your own application doesn't suggest that it would need to be told to send a notify before it remembered to check for input, but maybe there's some interaction there that you missed. regards, tom lane
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
Hey, I found something that finally clears netstat's recv-q with async notifies. I don't quite understand what's going on, so any enlightenment would be greatly appreciated. As near as I can figure, the client doesn't appear to read the notifies until it sends a notify of its own. Then it reads all pending notifies at once and clears the recv-q. For instance, I can open up two connections on my test server and illustrate the problem. Here's the netstat before everything. [koczan@ator] ~ $ netstat | grep mitchell tcp 0 0 ator.cs.wisc.edu:34279 mitchell.cs.wisc:postgresql ESTABLISHED tcp 0 0 ator.cs.wisc.edu:34280 mitchell.cs.wisc:postgresql ESTABLISHED Connection 1: csdb=> listen req; LISTEN Connection 2: csdb=> notify req; NOTIFY (repeat 5 times) Netstat as of now (after 5 notifies): [koczan@ator] ~ $ netstat | grep mitchell tcp 70 0 ator.cs.wisc.edu:34279 mitchell.cs.wisc:postgresql ESTABLISHED tcp 0 0 ator.cs.wisc.edu:34280 mitchell.cs.wisc:postgresql ESTABLISHED Connection 1: csdb=> notify req; NOTIFY Asynchronous notification "req" received from server process with PID 7268. Asynchronous notification "req" received from server process with PID 7268. Asynchronous notification "req" received from server process with PID 7268. Asynchronous notification "req" received from server process with PID 7268. Asynchronous notification "req" received from server process with PID 7268. Asynchronous notification "req" received from server process with PID 7267. Netstat as of now: [koczan@ator] ~ $ netstat | grep mitchell tcp 0 0 ator.cs.wisc.edu:34279 mitchell.cs.wisc:postgresql ESTABLISHED tcp 0 0 ator.cs.wisc.edu:34280 mitchell.cs.wisc:postgresql ESTABLISHED This does make some sense given our code (the listening connection only listens, it never sends out notifies of its own). However, this doesn't seem like expected behavior. I found nothing in the docs to suggest this being the norm, and it seems odd to require a connection to issue a notify itself before being able to read pending notifies from another connection. Any ideas? Peter
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Aug 16, 2007 at 05:41:32PM -0500, Peter Koczan wrote: > Hey, I found something that finally clears netstat's recv-q with async > notifies. I don't quite understand what's going on, so any > enlightenment would be greatly appreciated. As near as I can figure, > the client doesn't appear to read the notifies until it sends a notify > of its own. Then it reads all pending notifies at once and clears the > recv-q. > > For instance, I can open up two connections on my test server and > illustrate the problem. > > Here's the netstat before everything. > > [koczan@ator] ~ $ netstat | grep mitchell > tcp 0 0 ator.cs.wisc.edu:34279 > mitchell.cs.wisc:postgresql ESTABLISHED > tcp 0 0 ator.cs.wisc.edu:34280 > mitchell.cs.wisc:postgresql ESTABLISHED > > Connection 1: > csdb=> listen req; > LISTEN > > Connection 2: > csdb=> notify req; > NOTIFY > (repeat 5 times) > > Netstat as of now (after 5 notifies): > [koczan@ator] ~ $ netstat | grep mitchell > tcp 70 0 ator.cs.wisc.edu:34279 > mitchell.cs.wisc:postgresql ESTABLISHED > tcp 0 0 ator.cs.wisc.edu:34280 > mitchell.cs.wisc:postgresql ESTABLISHED > > Connection 1: > csdb=> notify req; > NOTIFY > Asynchronous notification "req" received from server process with PID 7268. > Asynchronous notification "req" received from server process with PID 7268. [...] Hm. To me it looks like the first psql session isn't prepared to gather async notifies (checking...) indeed, psql only checks for async notifies when issuing commands (you don't have to issue a NOTIFY in "Connection 1" to receive the pending notifies -- *any* request directed to the server would do). > Any ideas? If you see the queue building up as above, the problem lies almost certainly on the client not gathering the notifies. The backend is obviously willing to talk to you :-) I'd look for the problem in the client library (It's DBD::Pg in your "real" case, isn't it?) Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFGxZCABcgs9XrR2kYRAkAsAJ9lvE86sUMQstU0S08/eWT64XqgpwCfZzjt XVQrPm6b82m47QDFxcGyFTg= =Omtb -----END PGP SIGNATURE-----
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
Heikki Linnakangas
Date:
Peter Koczan wrote: >> Hm. To me it looks like the first psql session isn't prepared to gather >> async notifies (checking...) indeed, psql only checks for async notifies when >> issuing commands (you don't have to issue a NOTIFY in "Connection 1" to >> receive the pending notifies -- *any* request directed to the server would do). >> >> If you see the queue building up as above, the problem lies almost >> certainly on the client not gathering the notifies. The backend is >> obviously willing to talk to you :-) >> >> I'd look for the problem in the client library (It's DBD::Pg in your >> "real" case, isn't it?) > > *Light bulb lights up* Ah....I see what's going on now. > > I think I have enough information to test and fix the problem on our > end. Thank you all for your help and patience. This can be considered > resolved. Please share the ending with us, I'm really curious after following this thread :). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
> Hm. To me it looks like the first psql session isn't prepared to gather > async notifies (checking...) indeed, psql only checks for async notifies when > issuing commands (you don't have to issue a NOTIFY in "Connection 1" to > receive the pending notifies -- *any* request directed to the server would do). > > If you see the queue building up as above, the problem lies almost > certainly on the client not gathering the notifies. The backend is > obviously willing to talk to you :-) > > I'd look for the problem in the client library (It's DBD::Pg in your > "real" case, isn't it?) *Light bulb lights up* Ah....I see what's going on now. I think I have enough information to test and fix the problem on our end. Thank you all for your help and patience. This can be considered resolved. Peter
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
From
"Peter Koczan"
Date:
> Please share the ending with us, I'm really curious after following this > thread :). It turns out that it was part DBD::Pg and part our app (a big part). A plugin for the app creates a separate db handle which listens, but never does anything with the notifications. Apparently, the recv-q cleared whenever a request was made via this plugin (which is rare for most people who use this), but nothing was processed, so notifications would build up. If they're left running for that long, then they get into the bad state. We fixed our app to not have the plugin's db handle listen (it was pointless to have it listen anyway) and all seems well. While it may technically be a bug that a database handle will get into this state after ~800 unprocessed notifications, I'd say that's much more the fault of the application programmers. Anyway, thanks for all your help. Peter