Thread: BUG #3504: Some listening sessions never return from writing, problems ensue

BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
The following bug has been logged online:

Bug reference:      3504
Logged by:          Peter Koczan
Email address:      pjkoczan@gmail.com
PostgreSQL version: 8.2.4
Operating system:   CentOS 4.5 Linux (RHEL 4), kernel 2.6.9-55.ELsmp
Description:        Some listening sessions never return from writing,
problems ensue
Details:

There is a problem where connections listening for asynchronous notifies
occasionally block for writing on ther server side and never finish,
resulting in connections that always have the status "notify interrupt".
Apparently, this causes no ill effects for the application that's listening
(i.e. it still updates somehow), but no data is written to the connection
when it should be. It also becomes a problem for restarting the server since
postgres cannot kill the connections when they're in that state. I either
have to explicitly kill the connections, kill the client apps, or reboot the
server. If I try to restart postgres, it kills all but the "notify
interrupt" connections, but it doesn't shut down so I can't restart it short
of rebooting.

Here are stack traces from listening sessions, the first is ok, the second
is blocked.

-----------------------------------------------------

OK:
[root@vero ~]# ps axvw | grep 30728
30728 ?        Ss     0:00      0  3265 42526 29240  0.7 postgres: koczan
csdb ator.cs.wisc.edu(38753) idle

[root@vero ~]# gdb -p 30728
...Standard loading messages...
(gdb) bt
#0  0x003bc7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x001bac93 in __read_nocancel () from /lib/tls/libc.so.6
#2  0x00e81e2e in sock_read () from
/s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#3  0x00000008 in ?? ()
#4  0x08c40068 in ?? ()
#5  0x00000005 in ?? ()
#6  0x00f160e8 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#7  0x08c3a310 in ?? ()
#8  0x00000000 in ?? ()


(This stack trace is of the last 3 updates, and is farily representative of
what I've seen).
[root@vero ~]# strace -p 30728
read(8, 0x8c40068, 5)                   = ? ERESTARTSYS (To be restarted)
--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0
semop(1081358, 0xbfe20b9c, 1)           = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
semop(1081358, 0xbfe20b6c, 1)           = 0
_llseek(6, 0, [6078464], SEEK_END)      = 0
time(NULL)                              = 1185998905
_llseek(5, 6569984, [6569984], SEEK_SET) = 0
write(5, "^\320\1\0\1\0\0\0!\0\0\0\0@d\225\34\0\0\0\10\0\0\0(\372"..., 8192)
= 8192
fdatasync(5)                            = 0
semop(1048589, 0xbfe20adc, 1)           = 0
semop(1048589, 0xbfe20b2c, 1)           = 0
write(8, "\27\3\1\0 \221\363{!\235\355\317\223\260\253\tY\242\350"..., 90) =
90
sigreturn()                             = ? (mask now [])
read(8, 0x8c40068, 5)                   = ? ERESTARTSYS (To be restarted)
--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0
semop(1081358, 0xbfe20b9c, 1)           = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
semop(1081358, 0xbfe20b6c, 1)           = 0
_llseek(6, 0, [6078464], SEEK_END)      = 0
time(NULL)                              = 1185998905
write(5, "^\320\1\0\1\0\0\0!\0\0\0\0`d\225H\0\0\0req\0 \272\301\10"...,
8192) = 8192
fdatasync(5)                            = 0
semop(1212434, 0xbfe20adc, 1)           = 0
write(8, "\27\3\1\0 \'\6\235\323\2\33\321\216R\370\277i\304\217h"..., 90) =
90
sigreturn()                             = ? (mask now [])
read(8, 0x8c40068, 5)                   = ? ERESTARTSYS (To be restarted)
--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0
semop(1081358, 0xbfe20b9c, 1)           = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
_llseek(6, 0, [6078464], SEEK_END)      = 0
time(NULL)                              = 1186000276
_llseek(5, 9453568, [9453568], SEEK_SET) = 0
write(5, "^\320\1\0\1\0\0\0!\0\0\0\0@\220\225r\4\0\0\0\0\0\0\0\0"..., 8192)
= 8192
fdatasync(5)                            = 0
semop(1081358, 0xbfe20adc, 1)           = 0
write(8, "\27\3\1\0 \301\307_\375=\0032\243\212\215\362\3457\335"..., 90) =
90
sigreturn()                             = ? (mask now [])
read(8, 0x8c40068, 5)                   = ? ERESTARTSYS (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
semop(1081358, 0xbfe20d2c, 1)           = 0
semop(1081358, 0xbfe20d2c, 1)           = 0
semop(1081358, 0xbfe20d2c, 1)           = 0
sigreturn()                             = ? (mask now [])
read(8,

-----------------------------------------------------

Blocked process:
[root@vero ~]# ps axvw | grep 19583
19583 ?        Ss     0:02      0  3265 42518 29368  0.7 postgres: craft
csdb zim.cs.wisc.edu(32777) notify interrupt

[root@vero ~]# gdb -p 19583
...Standard loading messages...
(gdb) bt
#0  0x003bc7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x001bad13 in __write_nocancel () from /lib/tls/libc.so.6
#2  0x00e81ec4 in sock_write () from
/s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#3  0x00000008 in ?? ()
#4  0x08c4484a in ?? ()
#5  0x00000038 in ?? ()
#6  0x00000030 in ?? ()
#7  0x00f160e8 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#8  0x08c3a7f0 in ?? ()
#9  0x00000000 in ?? ()

(Stack trace from the same amount of time, same 3 updates)
[root@vero ~]# strace -p 19583
Process 19583 attached - interrupt to quit
write(8, "\224\337\252\27\3\1\0000\247W\250\2-\307;\204w\276\242"..., 56) =
? ERESTARTSYS (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
sigreturn()                             = ? (mask now [USR2])
write(8, "\224\337\252\27\3\1\0000\247W\250\2-\307;\204w\276\242"..., 56

From my best guess, looks like something's weird with semop, but I don't
know.

Note: I'm compiling from source with PAM, krb5 (1.5.3), and openssl
(0.9.8e).

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
Heikki Linnakangas
Date:
Peter Koczan wrote:
> There is a problem where connections listening for asynchronous notifies
> occasionally block for writing on ther server side and never finish,
> resulting in connections that always have the status "notify interrupt".
> Apparently, this causes no ill effects for the application that's listening
> (i.e. it still updates somehow), but no data is written to the connection
> when it should be. It also becomes a problem for restarting the server since
> postgres cannot kill the connections when they're in that state. I either
> have to explicitly kill the connections, kill the client apps, or reboot the
> server. If I try to restart postgres, it kills all but the "notify
> interrupt" connections, but it doesn't shut down so I can't restart it short
> of rebooting.

Does the client read the async notifies? The write in the server will
block if the client doesn't keep up with reading the data.

> ...
>
> (This stack trace is of the last 3 updates, and is farily representative of
> what I've seen).
> [root@vero ~]# strace -p 30728

That's actually not a stack trace, but a system call trace.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
1Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Peter Koczan wrote:
>> There is a problem where connections listening for asynchronous notifies
>> occasionally block for writing on ther server side and never finish,

> Does the client read the async notifies? The write in the server will
> block if the client doesn't keep up with reading the data.

Loss of network connectivity to the client (eg, timeout of a NAT
mapping) seems like another plausible explanation.

            regards, tom lane
Heikki Linnakangas wrote:
> Peter Koczan wrote:
>
>> There is a problem where connections listening for asynchronous notifies
>> occasionally block for writing on ther server side and never finish,
>> resulting in connections that always have the status "notify interrupt".
>> Apparently, this causes no ill effects for the application that's listening
>> (i.e. it still updates somehow), but no data is written to the connection
>> when it should be. It also becomes a problem for restarting the server since
>> postgres cannot kill the connections when they're in that state. I either
>> have to explicitly kill the connections, kill the client apps, or reboot the
>> server. If I try to restart postgres, it kills all but the "notify
>> interrupt" connections, but it doesn't shut down so I can't restart it short
>> of rebooting.
>>
>
> Does the client read the async notifies? The write in the server will
> block if the client doesn't keep up with reading the data.
>
>
Well, the client updates it's GUI properly no matter whether the
listening session is blocked or not (it's an ongoing tail of requests).
Overtly from the client side, it's completely transparent. Is there any
way I can tell definitively from the client side whether or not it's
actually reading the async notifies?
>> ...
>>
>> (This stack trace is of the last 3 updates, and is farily representative of
>> what I've seen).
>> [root@vero ~]# strace -p 30728
>>
>
> That's actually not a stack trace, but a system call trace.
>
>
D'oh. sorry, still a little new at using these tools for debugging.
Tom Lane wrote:
> 1Heikki Linnakangas <heikki@enterprisedb.com> writes:
>
>> Peter Koczan wrote:
>>
>>> There is a problem where connections listening for asynchronous notifies
>>> occasionally block for writing on ther server side and never finish,
>>>
>
>
>> Does the client read the async notifies? The write in the server will
>> block if the client doesn't keep up with reading the data.
>>
>
> Loss of network connectivity to the client (eg, timeout of a NAT
> mapping) seems like another plausible explanation.
>
>             regards, tom lane
>
>
It's plausible, but unlikely for our network setup. We rarely use NAT
and if the client lost its network, this particular application (it's
Perl with GTK and DBD::Pg, by the way) tends to hang since it doesn't
try to re-establish a connection to the DB short of killing and
restarting it.

Peter
Peter Koczan <pjkoczan@gmail.com> writes:
> Heikki Linnakangas wrote:
>> Does the client read the async notifies? The write in the server will
>> block if the client doesn't keep up with reading the data.
>>
> Well, the client updates it's GUI properly no matter whether the
> listening session is blocked or not (it's an ongoing tail of requests).
> Overtly from the client side, it's completely transparent. Is there any
> way I can tell definitively from the client side whether or not it's
> actually reading the async notifies?

netstat's Recv-Q column would probably reflect it if the client is
failing to consume messages.  The send queue on the server side would be
worth checking too.

            regards, tom lane

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
I found something pretty interesting when running netstat's:

Before a series of 3 async notifies (second column is Recv-Q):

OK connections:
[koczan@ator] koczan $ netstat | grep vero
tcp      180      0 ator.cs.wisc.edu:32785
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp        0      0 ator.cs.wisc.edu:32784
vero.cs.wisc.edu:postgresqlESTABLISHED
[koczan@ator] koczan $ ssh newton netstat | grep vero
tcp    64260      0 newton.cs.wisc.edu:32778
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp        0      0 newton.cs.wisc.edu:32777
vero.cs.wisc.edu:postgresqlESTABLISHED

"notify interrupt" connections:
[koczan@ator] koczan $ ssh brian netstat | grep vero
tcp        0      0 brian.cs.wisc.edu:40365
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp    77800      0 brian.cs.wisc.edu:40366
vero.cs.wisc.edu:postgresqlESTABLISHED
[koczan@ator] koczan $ ssh zim netstat | grep vero
tcp    73402      0 zim.cs.wisc.edu:32777
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp        0      0 zim.cs.wisc.edu:32776
vero.cs.wisc.edu:postgresqlESTABLISHED

and after said notifies:

OK connections:
[koczan@ator] koczan $ netstat | grep vero
tcp      450      0 ator.cs.wisc.edu:32785
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp        0      0 ator.cs.wisc.edu:32784
vero.cs.wisc.edu:postgresqlESTABLISHED
[koczan@ator] koczan $ ssh newton netstat | grep vero
tcp    64260      0 newton.cs.wisc.edu:32778
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp        0      0 newton.cs.wisc.edu:32777
vero.cs.wisc.edu:postgresqlESTABLISHED

"notify interrupt" connections:
[koczan@ator] koczan $ ssh brian netstat | grep vero
tcp        0      0 brian.cs.wisc.edu:40365
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp    77800      0 brian.cs.wisc.edu:40366
vero.cs.wisc.edu:postgresqlESTABLISHED
[koczan@ator] koczan $ ssh zim netstat | grep vero
tcp    73402      0 zim.cs.wisc.edu:32777
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp        0      0 zim.cs.wisc.edu:32776
vero.cs.wisc.edu:postgresqlESTABLISHED

And on the server side things get a little more interesting (Send-Q is the
3rd column)...

OK:
[koczan@vero] ~ $ netstat | grep ator
tcp        0      0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32785
ESTABLISHED
tcp        0      0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32784
ESTABLISHED
[koczan@vero] ~ $ netstat | grep newton
tcp        0      0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32778
ESTABLISHED
tcp        0      0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32777
ESTABLISHED

"notify interrupt":
[koczan@vero] ~ $ netstat | grep brian
tcp        0      0 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40365
ESTABLISHED
tcp        0  13032 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40366
ESTABLISHED
[koczan@vero] ~ $ netstat | grep zim
tcp        0  13032 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32777
ESTABLISHED
tcp        0      0 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32776
ESTABLISHED

A quick perusal of the other "notify interrupt" connections shows 13032 in
the Send-Q column. They all got into this state at the same time.

Peter

P.S. Thanks for the help, I really appreciate it.

On 8/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Koczan <pjkoczan@gmail.com> writes:
> > Heikki Linnakangas wrote:
> >> Does the client read the async notifies? The write in the server will
> >> block if the client doesn't keep up with reading the data.
> >>
> > Well, the client updates it's GUI properly no matter whether the
> > listening session is blocked or not (it's an ongoing tail of requests).
> > Overtly from the client side, it's completely transparent. Is there any
> > way I can tell definitively from the client side whether or not it's
> > actually reading the async notifies?
>
> netstat's Recv-Q column would probably reflect it if the client is
> failing to consume messages.  The send queue on the server side would be
> worth checking too.
>
>                         regards, tom lane
>

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
Whoops, a couple quick changes. This:
[koczan@ator] koczan $ ssh newton netstat | grep vero
tcp    64260      0 newton.cs.wisc.edu:32778
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp        0      0 newton.cs.wisc.edu:32777
vero.cs.wisc.edu:postgresqlESTABLISHED

should be this:
[koczan@ator] koczan $ ssh newton netstat | grep vero
tcp    64530      0 newton.cs.wisc.edu:32778
vero.cs.wisc.edu:postgresqlESTABLISHED
tcp        0      0 newton.cs.wisc.edu:32777
vero.cs.wisc.edu:postgresqlESTABLISHED

Also, the Send-Q column on the server side doesn't change, even after more
async notifies. They may not have happened at the same time, but it's a bit
perplexing that they all have the exact same amount of data in the queue.

Peter

On 8/2/07, Peter Koczan <pjkoczan@gmail.com> wrote:
>
> I found something pretty interesting when running netstat's:
>
> Before a series of 3 async notifies (second column is Recv-Q):
>
> OK connections:
> [koczan@ator] koczan $ netstat | grep vero
> tcp      180      0 ator.cs.wisc.edu:32785
> vero.cs.wisc.edu:postgresql ESTABLISHED
> tcp        0      0 ator.cs.wisc.edu:32784
> vero.cs.wisc.edu:postgresql ESTABLISHED
> [koczan@ator] koczan $ ssh newton netstat | grep vero
> tcp    64260      0 newton.cs.wisc.edu:32778
> vero.cs.wisc.edu:postgresql ESTABLISHED
> tcp        0      0 newton.cs.wisc.edu:32777
> vero.cs.wisc.edu:postgresql ESTABLISHED
>
> "notify interrupt" connections:
> [koczan@ator] koczan $ ssh brian netstat | grep vero
> tcp        0      0 brian.cs.wisc.edu:40365
> vero.cs.wisc.edu:postgresql ESTABLISHED
> tcp    77800      0 brian.cs.wisc.edu:40366
> vero.cs.wisc.edu:postgresql ESTABLISHED
> [koczan@ator] koczan $ ssh zim netstat | grep vero
> tcp    73402      0 zim.cs.wisc.edu:32777
> vero.cs.wisc.edu:postgresql ESTABLISHED
> tcp        0      0 zim.cs.wisc.edu:32776
> vero.cs.wisc.edu:postgresql ESTABLISHED
>
> and after said notifies:
>
> OK connections:
> [koczan@ator] koczan $ netstat | grep vero
> tcp      450      0 ator.cs.wisc.edu:32785
> vero.cs.wisc.edu:postgresql ESTABLISHED
> tcp        0      0 ator.cs.wisc.edu:32784
> vero.cs.wisc.edu:postgresql ESTABLISHED
> [koczan@ator] koczan $ ssh newton netstat | grep vero
> tcp    64260      0 newton.cs.wisc.edu:32778
> vero.cs.wisc.edu:postgresql ESTABLISHED
> tcp        0      0 newton.cs.wisc.edu:32777
> vero.cs.wisc.edu:postgresql ESTABLISHED
>
> "notify interrupt" connections:
> [koczan@ator] koczan $ ssh brian netstat | grep vero
> tcp        0      0 brian.cs.wisc.edu:40365
> vero.cs.wisc.edu:postgresql ESTABLISHED
> tcp    77800      0 brian.cs.wisc.edu:40366
> vero.cs.wisc.edu:postgresql ESTABLISHED
> [koczan@ator] koczan $ ssh zim netstat | grep vero
> tcp    73402      0 zim.cs.wisc.edu:32777
> vero.cs.wisc.edu:postgresql ESTABLISHED
> tcp        0      0 zim.cs.wisc.edu:32776
> vero.cs.wisc.edu:postgresql ESTABLISHED
>
> And on the server side things get a little more interesting (Send-Q is the
> 3rd column)...
>
> OK:
> [koczan@vero] ~ $ netstat | grep ator
> tcp        0      0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32785
> ESTABLISHED
> tcp        0      0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32784
> ESTABLISHED
> [koczan@vero] ~ $ netstat | grep newton
> tcp        0      0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32778
> ESTABLISHED
> tcp        0      0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32777
> ESTABLISHED
>
> "notify interrupt":
> [koczan@vero] ~ $ netstat | grep brian
> tcp        0      0 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40365
> ESTABLISHED
> tcp        0  13032 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40366
> ESTABLISHED
> [koczan@vero] ~ $ netstat | grep zim
> tcp        0  13032 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32777
> ESTABLISHED
> tcp        0      0 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32776
> ESTABLISHED
>
> A quick perusal of the other "notify interrupt" connections shows 13032 in
> the Send-Q column. They all got into this state at the same time.
>
> Peter
>
> P.S. Thanks for the help, I really appreciate it.
>
> On 8/2/07, Tom Lane <tgl@sss.pgh.pa.us > wrote:
> >
> > Peter Koczan <pjkoczan@gmail.com > writes:
> > > Heikki Linnakangas wrote:
> > >> Does the client read the async notifies? The write in the server will
> > >> block if the client doesn't keep up with reading the data.
> > >>
> > > Well, the client updates it's GUI properly no matter whether the
> > > listening session is blocked or not (it's an ongoing tail of
> > requests).
> > > Overtly from the client side, it's completely transparent. Is there
> > any
> > > way I can tell definitively from the client side whether or not it's
> > > actually reading the async notifies?
> >
> > netstat's Recv-Q column would probably reflect it if the client is
> > failing to consume messages.  The send queue on the server side would be
> > worth checking too.
> >
> >                         regards, tom lane
> >
>
>

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
Heikki Linnakangas
Date:
It looks to me that the server is trying to send stuff to the client,
but the client is not reading it for some reason. There's nothing in the
queues in the direction client->server.

What exactly is the client doing? How does it connect to the server,
with libpq? Is it possible that the client is just leaking the
connection, keeping it open but not using it for anything?

Peter Koczan wrote:
> Whoops, a couple quick changes. This:
> [koczan@ator] koczan $ ssh newton netstat | grep vero
> tcp    64260      0 newton.cs.wisc.edu:32778
> vero.cs.wisc.edu:postgresqlESTABLISHED
> tcp        0      0 newton.cs.wisc.edu:32777
> vero.cs.wisc.edu:postgresqlESTABLISHED
>
> should be this:
> [koczan@ator] koczan $ ssh newton netstat | grep vero
> tcp    64530      0 newton.cs.wisc.edu:32778
> vero.cs.wisc.edu:postgresqlESTABLISHED
> tcp        0      0 newton.cs.wisc.edu:32777
> vero.cs.wisc.edu:postgresqlESTABLISHED
>
> Also, the Send-Q column on the server side doesn't change, even after more
> async notifies. They may not have happened at the same time, but it's a bit
> perplexing that they all have the exact same amount of data in the queue.
>
> Peter
>
> On 8/2/07, Peter Koczan <pjkoczan@gmail.com> wrote:
>> I found something pretty interesting when running netstat's:
>>
>> Before a series of 3 async notifies (second column is Recv-Q):
>>
>> OK connections:
>> [koczan@ator] koczan $ netstat | grep vero
>> tcp      180      0 ator.cs.wisc.edu:32785
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> tcp        0      0 ator.cs.wisc.edu:32784
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> [koczan@ator] koczan $ ssh newton netstat | grep vero
>> tcp    64260      0 newton.cs.wisc.edu:32778
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> tcp        0      0 newton.cs.wisc.edu:32777
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>>
>> "notify interrupt" connections:
>> [koczan@ator] koczan $ ssh brian netstat | grep vero
>> tcp        0      0 brian.cs.wisc.edu:40365
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> tcp    77800      0 brian.cs.wisc.edu:40366
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> [koczan@ator] koczan $ ssh zim netstat | grep vero
>> tcp    73402      0 zim.cs.wisc.edu:32777
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> tcp        0      0 zim.cs.wisc.edu:32776
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>>
>> and after said notifies:
>>
>> OK connections:
>> [koczan@ator] koczan $ netstat | grep vero
>> tcp      450      0 ator.cs.wisc.edu:32785
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> tcp        0      0 ator.cs.wisc.edu:32784
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> [koczan@ator] koczan $ ssh newton netstat | grep vero
>> tcp    64260      0 newton.cs.wisc.edu:32778
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> tcp        0      0 newton.cs.wisc.edu:32777
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>>
>> "notify interrupt" connections:
>> [koczan@ator] koczan $ ssh brian netstat | grep vero
>> tcp        0      0 brian.cs.wisc.edu:40365
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> tcp    77800      0 brian.cs.wisc.edu:40366
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> [koczan@ator] koczan $ ssh zim netstat | grep vero
>> tcp    73402      0 zim.cs.wisc.edu:32777
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>> tcp        0      0 zim.cs.wisc.edu:32776
>> vero.cs.wisc.edu:postgresql ESTABLISHED
>>
>> And on the server side things get a little more interesting (Send-Q is the
>> 3rd column)...
>>
>> OK:
>> [koczan@vero] ~ $ netstat | grep ator
>> tcp        0      0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32785
>> ESTABLISHED
>> tcp        0      0 vero.cs.wisc.edu:postgresql ator.cs.wisc.edu:32784
>> ESTABLISHED
>> [koczan@vero] ~ $ netstat | grep newton
>> tcp        0      0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32778
>> ESTABLISHED
>> tcp        0      0 vero.cs.wisc.edu:postgresql newton.cs.wisc.edu:32777
>> ESTABLISHED
>>
>> "notify interrupt":
>> [koczan@vero] ~ $ netstat | grep brian
>> tcp        0      0 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40365
>> ESTABLISHED
>> tcp        0  13032 vero.cs.wisc.edu:postgresql brian.cs.wisc.edu:40366
>> ESTABLISHED
>> [koczan@vero] ~ $ netstat | grep zim
>> tcp        0  13032 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32777
>> ESTABLISHED
>> tcp        0      0 vero.cs.wisc.edu:postgresql zim.cs.wisc.edu:32776
>> ESTABLISHED
>>
>> A quick perusal of the other "notify interrupt" connections shows 13032 in
>> the Send-Q column. They all got into this state at the same time.
>>
>> Peter
>>
>> P.S. Thanks for the help, I really appreciate it.
>>
>> On 8/2/07, Tom Lane <tgl@sss.pgh.pa.us > wrote:
>>> Peter Koczan <pjkoczan@gmail.com > writes:
>>>> Heikki Linnakangas wrote:
>>>>> Does the client read the async notifies? The write in the server will
>>>>> block if the client doesn't keep up with reading the data.
>>>>>
>>>> Well, the client updates it's GUI properly no matter whether the
>>>> listening session is blocked or not (it's an ongoing tail of
>>> requests).
>>>> Overtly from the client side, it's completely transparent. Is there
>>> any
>>>> way I can tell definitively from the client side whether or not it's
>>>> actually reading the async notifies?
>>> netstat's Recv-Q column would probably reflect it if the client is
>>> failing to consume messages.  The send queue on the server side would be
>>> worth checking too.
>>>
>>>                         regards, tom lane
>>>
>>
>


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
"Peter Koczan" <pjkoczan@gmail.com> writes:
> A quick perusal of the other "notify interrupt" connections shows 13032 in
> the Send-Q column. They all got into this state at the same time.

That's far too much data to be a notify message, or even a small number
of notify messages.  Is it possible that the client's failed to collect
hundreds of notify events?

But in any case this clearly lets the backend off the hook.  The problem
is on the client side: it's not reading the data.

            regards, tom lane

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
Either it's not reading the notifies or it's not telling the server that
it's read them and flushing out the pipe.

This app uses Perl/DBD::Pg, and the code is basically:

while ($dbh->func('pg_notifies')) {
  # do stuff
}

The corresponding C code in DBD::Pg appears to be:
/* ================================================================== */
SV * dbd_db_pg_notifies (dbh, imp_dbh)
                 SV *dbh;
                 imp_dbh_t *imp_dbh;
{
        PGnotify *notify;
        AV *ret;
        SV *retsv;
        int status;

        if (dbis->debug >= 3) { (void)PerlIO_printf(DBILOGFP, "dbdpg:
dbd_db_pg_notifies\n"); }

        status = PQconsumeInput(imp_dbh->conn);
        if (0 == status) {
                status = PQstatus(imp_dbh->conn);
                pg_error(dbh, status, PQerrorMessage(imp_dbh->conn));
                return &sv_undef;
        }

        notify = PQnotifies(imp_dbh->conn);

        if (!notify)
                return &sv_undef;

        ret=newAV();

        av_push(ret, newSVpv(notify->relname,0) );
        av_push(ret, newSViv(notify->be_pid) );

#if PGLIBVERSION >= 70400
        PQfreemem(notify);
#else
        Safefree(notify);
#endif

        retsv = newRV(sv_2mortal((SV*)ret));

        return retsv;

} /* end of dbd_db_pg_notifies */


On 8/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> "Peter Koczan" <pjkoczan@gmail.com> writes:
> > A quick perusal of the other "notify interrupt" connections shows 13032
> in
> > the Send-Q column. They all got into this state at the same time.
>
> That's far too much data to be a notify message, or even a small number
> of notify messages.  Is it possible that the client's failed to collect
> hundreds of notify events?
>
> But in any case this clearly lets the backend off the hook.  The problem
> is on the client side: it's not reading the data.
>
>                         regards, tom lane
>
"Peter Koczan" <pjkoczan@gmail.com> writes:
> This app uses Perl/DBD::Pg, and the code is basically:

> while ($dbh->func('pg_notifies')) {
>   # do stuff
> }

If I'm not mistaken, that loop would process all the currently available
notify messages and then fall out.  So the real question is what's
around this --- what conditions will make the program come back to this
loop periodically?  Or not do so?

            regards, tom lane
I think the code is somehow interrupt based, but I don't know. I've
asked the primary maintainer of the program to look into this with me.
We should have more insight by early next week.

Peter

Tom Lane wrote:
> "Peter Koczan" <pjkoczan@gmail.com> writes:
>
>> This app uses Perl/DBD::Pg, and the code is basically:
>>
>
>
>> while ($dbh->func('pg_notifies')) {
>>   # do stuff
>> }
>>
>
> If I'm not mistaken, that loop would process all the currently available
> notify messages and then fall out.  So the real question is what's
> around this --- what conditions will make the program come back to this
> loop periodically?  Or not do so?
>
"Peter Koczan" <pjkoczan@gmail.com> writes:
> So, it looks like the app polls for async notifies every second. Hopefully
> you can glean some insight from that.

My guess is that some condition causes that to stop happening.

One way to monitor this from outside is to use strace to watch the app.
According to the above you should see a recv() attempt on the libpq
socket every second.  I suspect that you won't see any more once the app
gets into the state where the recv-q is backing up.  Looking at what
happens around the time of the last recv() might give some clues.

            regards, tom lane
"Peter Koczan" <pjkoczan@gmail.com> writes:
> Here's my theory (and feel free to tell me that I'm full of it)...somehow, a
> lot of notifies happened at once, or in a very short period of time, to the
> point where the app was still processing notifies when the timer clicked off
> another second. The connection (or app, or perl module) never marked those
> notifies as being processed, or never updated its timestamp of when it
> finished, so when the next notify came around, it tried to reprocess the old
> data (or data since the last time it finished), and yet again couldn't
> finish. Lather, rinse, repeat. In sum, it might be that trying to call
> pg_notifies while processing notifies tickles a race condition and tricks
> the connection into thinking its in a bad state.

Hmm.  Is the app trying to do this processing inside an interrupt
service routine (a/k/a signal handler)?  If so, and if the ISR can
interrupt itself, then you've got a problem because you'll be doing
reentrant calls of libpq, which it doesn't support.  You can only make
that work if the handler blocks further occurrences of its signal until
it finishes.

            regards, tom lane

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
Here's what the primary maintainer of the app had to say about how notifies
get read.

This is in update_reqs in Gtkreq, which gets called by a Gtk timer
every second.  Then, as long as there are notifies, it will
continue to read from the req log.

There's a timer for every instance of the main window, so it looks
like if you have more than one main window open, update_reqs would
be called more often.  That might have to be changed.

So, it looks like the app polls for async notifies every second. Hopefully
you can glean some insight from that.

Peter

On 8/3/07, Peter Koczan <pjkoczan@gmail.com> wrote:
>
> I think the code is somehow interrupt based, but I don't know. I've
> asked the primary maintainer of the program to look into this with me.
> We should have more insight by early next week.
>
> Peter
>
> Tom Lane wrote:
> > "Peter Koczan" <pjkoczan@gmail.com> writes:
> >
> >> This app uses Perl/DBD::Pg, and the code is basically:
> >>
> >
> >
> >> while ($dbh->func('pg_notifies')) {
> >>   # do stuff
> >> }
> >>
> >
> > If I'm not mistaken, that loop would process all the currently available
> > notify messages and then fall out.  So the real question is what's
> > around this --- what conditions will make the program come back to this
> > loop periodically?  Or not do so?
> >
>

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
Not quite, here's what happens every second when idling, for both good and
bad connections (with a different values to each syscall). It is trying to
read data, I am a little concerned about the EAGAIN return value, though.

[koczan@ator] ~ $ strace -p 6987
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}, {fd=8,
events=POLLIN}], 3, 869) = 0
gettimeofday({1186445557, 631779}, NULL) = 0
read(5, 0x9651540, 5)                   = -1 EAGAIN (Resource temporarily
unavailable)
gettimeofday({1186445557, 632197}, NULL) = 0
ioctl(3, FIONREAD, [0])                 = 0
poll(<unfinished ...>
Process 6987 detached

However, looking at straces when an async notify actually happens, I notice
something interesting. The number of syscalls over the same interrupt on the
bad connection is nearly 5 times that of the good connection. By the time
the good connection is finished, the bad connection is still mucking with
data. I can send you the actual straces if you think it would help.

Here's my theory (and feel free to tell me that I'm full of it)...somehow, a
lot of notifies happened at once, or in a very short period of time, to the
point where the app was still processing notifies when the timer clicked off
another second. The connection (or app, or perl module) never marked those
notifies as being processed, or never updated its timestamp of when it
finished, so when the next notify came around, it tried to reprocess the old
data (or data since the last time it finished), and yet again couldn't
finish. Lather, rinse, repeat. In sum, it might be that trying to call
pg_notifies while processing notifies tickles a race condition and tricks
the connection into thinking its in a bad state.

The holes in my theory are that the app doesn't try to read everything every
second for bad connections (so notifies are being processed somehow), and it
doesn't explain why the recv-q is filling up even for good connections.
Hopefully this will help somehow, though.

Peter

P.S. Just in case its useful, the contents of pg_listener for bad
connections (I verified that notification = 0 for good connections).
csdb=# select * from pg_listener where notification > 0;
 relname | listenerpid | notification
---------+-------------+--------------
 req     |        3292 |        19141
 req     |       25848 |        16308
 req     |       17004 |        23815
 req     |       19583 |         1703
(4 rows)

On 8/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> "Peter Koczan" <pjkoczan@gmail.com> writes:
> > So, it looks like the app polls for async notifies every second.
> Hopefully
> > you can glean some insight from that.
>
> My guess is that some condition causes that to stop happening.
>
> One way to monitor this from outside is to use strace to watch the app.
> According to the above you should see a recv() attempt on the libpq
> socket every second.  I suspect that you won't see any more once the app
> gets into the state where the recv-q is backing up.  Looking at what
> happens around the time of the last recv() might give some clues.
>
>                         regards, tom lane
>

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
On 8/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Peter Koczan" <pjkoczan@gmail.com> writes:
> > Here's my theory (and feel free to tell me that I'm full of it)...somehow, a
> > lot of notifies happened at once, or in a very short period of time, to the
> > point where the app was still processing notifies when the timer clicked off
> > another second. The connection (or app, or perl module) never marked those
> > notifies as being processed, or never updated its timestamp of when it
> > finished, so when the next notify came around, it tried to reprocess the old
> > data (or data since the last time it finished), and yet again couldn't
> > finish. Lather, rinse, repeat. In sum, it might be that trying to call
> > pg_notifies while processing notifies tickles a race condition and tricks
> > the connection into thinking its in a bad state.
>
> Hmm.  Is the app trying to do this processing inside an interrupt
> service routine (a/k/a signal handler)?  If so, and if the ISR can
> interrupt itself, then you've got a problem because you'll be doing
> reentrant calls of libpq, which it doesn't support.  You can only make
> that work if the handler blocks further occurrences of its signal until
> it finishes.
>

I'm not entirely sure if this answers your question, but here's what I
found out from the primary maintainer of the app. Note that
update_reqs is the function calling pg_notifies. If there's more
information I can provide or another test we can run, please let me
know.

------- BEGIN MESSAGE -------
I just checked and the timer won't interrupt update_reqs, so we'll
have to look for another solution.  Anyway, update_reqs doesn't do
anything with the database except for checking for a notify, so I
don't see where it can be interrupted to cause DB problems.
------- END MESSAGE -------

I also found out that one notify gets sent per action (not per batch
of actions), so if n requests get resolved at once, n notifies are
sent, not 1. In theory this could mitigate this problem, but I don't
know how easy it is at this point. Still, it doesn't explain how or
why the client's recv-q isn't getting cleared.

Hope this helps.

Peter

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
On 8/9/07, Peter Koczan <pjkoczan@gmail.com> wrote:
> On 8/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > "Peter Koczan" <pjkoczan@gmail.com> writes:
> > > Here's my theory (and feel free to tell me that I'm full of it)...somehow, a
> > > lot of notifies happened at once, or in a very short period of time, to the
> > > point where the app was still processing notifies when the timer clicked off
> > > another second. The connection (or app, or perl module) never marked those
> > > notifies as being processed, or never updated its timestamp of when it
> > > finished, so when the next notify came around, it tried to reprocess the old
> > > data (or data since the last time it finished), and yet again couldn't
> > > finish. Lather, rinse, repeat. In sum, it might be that trying to call
> > > pg_notifies while processing notifies tickles a race condition and tricks
> > > the connection into thinking its in a bad state.
> >
> > Hmm.  Is the app trying to do this processing inside an interrupt
> > service routine (a/k/a signal handler)?  If so, and if the ISR can
> > interrupt itself, then you've got a problem because you'll be doing
> > reentrant calls of libpq, which it doesn't support.  You can only make
> > that work if the handler blocks further occurrences of its signal until
> > it finishes.
> >
>
> I'm not entirely sure if this answers your question, but here's what I
> found out from the primary maintainer of the app. Note that
> update_reqs is the function calling pg_notifies. If there's more
> information I can provide or another test we can run, please let me
> know.
>
> ------- BEGIN MESSAGE -------
> I just checked and the timer won't interrupt update_reqs, so we'll
> have to look for another solution.  Anyway, update_reqs doesn't do
> anything with the database except for checking for a notify, so I
> don't see where it can be interrupted to cause DB problems.
> ------- END MESSAGE -------
>
> I also found out that one notify gets sent per action (not per batch
> of actions), so if n requests get resolved at once, n notifies are
> sent, not 1. In theory this could mitigate this problem, but I don't
> know how easy it is at this point. Still, it doesn't explain how or
> why the client's recv-q isn't getting cleared.
>
> Hope this helps.
>

On our end, we changed the the code in the function calling
pg_notifies to use an if statement rather than a while (that way it
only updates once per second instead of continuously as long as there
are pending async notifies).

I looked more closely at the docs for DBD::Pg, and the pg_notifies
call grabs *all* pending async notifies and returns them in a hash,
not just one at a time. So, what was happening before was that if a
new notify came through while processing the previous notifies, the
code would reprocess. Lather, rinse, repeat. I think that if the
program is waiting for pg_notifies when the timer interrupts it again,
causing the client to call pg_notifies while still waiting for
something. I think this is what gets the listening connection into the
bad state.

In theory this change should mitigate the "notify interrupt" behavior
on our end, but, again, why the client's recv-q is filling up is as
yet unexplained.

Peter

P.S. In src/backend/commands/async.c, somewhere between lines 910 and
981 (set_ps_display calls) is where the code gets interrupted. How and
why, I don't know.
"Peter Koczan" <pjkoczan@gmail.com> writes:
> For instance, I can open up two connections on my test server and
> illustrate the problem.

Well, the fact that psql acts that way is not unexpected, because it
doesn't monitor the database connection constantly --- when it's waiting
for console input, it just sleeps.  It's not so much "send a NOTIFY"
that wakes it up, as "do anything at all with the database connection".

What you've told us so far about your own application doesn't suggest
that it would need to be told to send a notify before it remembered to
check for input, but maybe there's some interaction there that you
missed.

            regards, tom lane

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
Hey, I found something that finally clears netstat's recv-q with async
notifies. I don't quite understand what's going on, so any
enlightenment would be greatly appreciated. As near as I can figure,
the client doesn't appear to read the notifies until it sends a notify
of its own. Then it reads all pending notifies at once and clears the
recv-q.

For instance, I can open up two connections on my test server and
illustrate the problem.

Here's the netstat before everything.

[koczan@ator] ~ $ netstat | grep mitchell
tcp        0      0 ator.cs.wisc.edu:34279
mitchell.cs.wisc:postgresql ESTABLISHED
tcp        0      0 ator.cs.wisc.edu:34280
mitchell.cs.wisc:postgresql ESTABLISHED

Connection 1:
csdb=> listen req;
LISTEN

Connection 2:
csdb=> notify req;
NOTIFY
(repeat 5 times)

Netstat as of now (after 5 notifies):
[koczan@ator] ~ $ netstat | grep mitchell
tcp       70      0 ator.cs.wisc.edu:34279
mitchell.cs.wisc:postgresql ESTABLISHED
tcp        0      0 ator.cs.wisc.edu:34280
mitchell.cs.wisc:postgresql ESTABLISHED

Connection 1:
csdb=> notify req;
NOTIFY
Asynchronous notification "req" received from server process with PID 7268.
Asynchronous notification "req" received from server process with PID 7268.
Asynchronous notification "req" received from server process with PID 7268.
Asynchronous notification "req" received from server process with PID 7268.
Asynchronous notification "req" received from server process with PID 7268.
Asynchronous notification "req" received from server process with PID 7267.

Netstat as of now:
[koczan@ator] ~ $ netstat | grep mitchell
tcp        0      0 ator.cs.wisc.edu:34279
mitchell.cs.wisc:postgresql ESTABLISHED
tcp        0      0 ator.cs.wisc.edu:34280
mitchell.cs.wisc:postgresql ESTABLISHED

This does make some sense given our code (the listening connection
only listens, it never sends out notifies of its own). However, this
doesn't seem like expected behavior. I found nothing in the docs to
suggest this being the norm, and it seems odd to require a connection
to issue a notify itself before being able to read pending notifies
from another connection.

Any ideas?

Peter

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Aug 16, 2007 at 05:41:32PM -0500, Peter Koczan wrote:
> Hey, I found something that finally clears netstat's recv-q with async
> notifies. I don't quite understand what's going on, so any
> enlightenment would be greatly appreciated. As near as I can figure,
> the client doesn't appear to read the notifies until it sends a notify
> of its own. Then it reads all pending notifies at once and clears the
> recv-q.
>
> For instance, I can open up two connections on my test server and
> illustrate the problem.
>
> Here's the netstat before everything.
>
> [koczan@ator] ~ $ netstat | grep mitchell
> tcp        0      0 ator.cs.wisc.edu:34279
> mitchell.cs.wisc:postgresql ESTABLISHED
> tcp        0      0 ator.cs.wisc.edu:34280
> mitchell.cs.wisc:postgresql ESTABLISHED
>
> Connection 1:
> csdb=> listen req;
> LISTEN
>
> Connection 2:
> csdb=> notify req;
> NOTIFY
> (repeat 5 times)
>
> Netstat as of now (after 5 notifies):
> [koczan@ator] ~ $ netstat | grep mitchell
> tcp       70      0 ator.cs.wisc.edu:34279
> mitchell.cs.wisc:postgresql ESTABLISHED
> tcp        0      0 ator.cs.wisc.edu:34280
> mitchell.cs.wisc:postgresql ESTABLISHED
>
> Connection 1:
> csdb=> notify req;
> NOTIFY
> Asynchronous notification "req" received from server process with PID 7268.
> Asynchronous notification "req" received from server process with PID 7268.
[...]

Hm. To me it looks like the first psql session isn't prepared to gather
async notifies (checking...) indeed, psql only checks for async notifies when
issuing commands (you don't have to issue a NOTIFY in "Connection 1" to
receive the pending notifies -- *any* request directed to the server would do).

> Any ideas?

If you see the queue building up as above, the problem lies almost
certainly on the client not gathering the notifies. The backend is
obviously willing to talk to you :-)

I'd look for the problem in the client library (It's DBD::Pg in your
"real" case, isn't it?)

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFGxZCABcgs9XrR2kYRAkAsAJ9lvE86sUMQstU0S08/eWT64XqgpwCfZzjt
XVQrPm6b82m47QDFxcGyFTg=
=Omtb
-----END PGP SIGNATURE-----

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
Heikki Linnakangas
Date:
Peter Koczan wrote:
>> Hm. To me it looks like the first psql session isn't prepared to gather
>> async notifies (checking...) indeed, psql only checks for async notifies when
>> issuing commands (you don't have to issue a NOTIFY in "Connection 1" to
>> receive the pending notifies -- *any* request directed to the server would do).
>>
>> If you see the queue building up as above, the problem lies almost
>> certainly on the client not gathering the notifies. The backend is
>> obviously willing to talk to you :-)
>>
>> I'd look for the problem in the client library (It's DBD::Pg in your
>> "real" case, isn't it?)
>
> *Light bulb lights up* Ah....I see what's going on now.
>
> I think I have enough information to test and fix the problem on our
> end. Thank you all for your help and patience. This can be considered
> resolved.

Please share the ending with us, I'm really curious after following this
thread :).

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
> Hm. To me it looks like the first psql session isn't prepared to gather
> async notifies (checking...) indeed, psql only checks for async notifies when
> issuing commands (you don't have to issue a NOTIFY in "Connection 1" to
> receive the pending notifies -- *any* request directed to the server would do).
>
> If you see the queue building up as above, the problem lies almost
> certainly on the client not gathering the notifies. The backend is
> obviously willing to talk to you :-)
>
> I'd look for the problem in the client library (It's DBD::Pg in your
> "real" case, isn't it?)

*Light bulb lights up* Ah....I see what's going on now.

I think I have enough information to test and fix the problem on our
end. Thank you all for your help and patience. This can be considered
resolved.

Peter

Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From
"Peter Koczan"
Date:
> Please share the ending with us, I'm really curious after following this
> thread :).

It turns out that it was part DBD::Pg and part our app (a big part). A
plugin for the app creates a separate db handle which listens, but
never does anything with the notifications. Apparently, the recv-q
cleared whenever a request was made via this plugin (which is rare for
most people who use this), but nothing was processed, so notifications
would build up. If they're left running for that long, then they get
into the bad state.

We fixed our app to not have the plugin's db handle listen (it was
pointless to have it listen anyway) and all seems well.

While it may technically be a bug that a database handle will get into
this state after ~800 unprocessed notifications, I'd say that's much
more the fault of the application programmers.

Anyway, thanks for all your help.

Peter