Re: BUG #3504: Some listening sessions never return from writing, problems ensue - Mailing list pgsql-bugs
From | Peter Koczan |
---|---|
Subject | Re: BUG #3504: Some listening sessions never return from writing, problems ensue |
Date | |
Msg-id | 4544e0330708100859ibd71a7brddf224669bd58eab@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #3504: Some listening sessions never return from writing, problems ensue ("Peter Koczan" <pjkoczan@gmail.com>) |
Responses |
Re: BUG #3504: Some listening sessions never return from writing, problems ensue
|
List | pgsql-bugs |
On 8/9/07, Peter Koczan <pjkoczan@gmail.com> wrote: > On 8/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Peter Koczan" <pjkoczan@gmail.com> writes: > > > Here's my theory (and feel free to tell me that I'm full of it)...somehow, a > > > lot of notifies happened at once, or in a very short period of time, to the > > > point where the app was still processing notifies when the timer clicked off > > > another second. The connection (or app, or perl module) never marked those > > > notifies as being processed, or never updated its timestamp of when it > > > finished, so when the next notify came around, it tried to reprocess the old > > > data (or data since the last time it finished), and yet again couldn't > > > finish. Lather, rinse, repeat. In sum, it might be that trying to call > > > pg_notifies while processing notifies tickles a race condition and tricks > > > the connection into thinking its in a bad state. > > > > Hmm. Is the app trying to do this processing inside an interrupt > > service routine (a/k/a signal handler)? If so, and if the ISR can > > interrupt itself, then you've got a problem because you'll be doing > > reentrant calls of libpq, which it doesn't support. You can only make > > that work if the handler blocks further occurrences of its signal until > > it finishes. > > > > I'm not entirely sure if this answers your question, but here's what I > found out from the primary maintainer of the app. Note that > update_reqs is the function calling pg_notifies. If there's more > information I can provide or another test we can run, please let me > know. > > ------- BEGIN MESSAGE ------- > I just checked and the timer won't interrupt update_reqs, so we'll > have to look for another solution. Anyway, update_reqs doesn't do > anything with the database except for checking for a notify, so I > don't see where it can be interrupted to cause DB problems. > ------- END MESSAGE ------- > > I also found out that one notify gets sent per action (not per batch > of actions), so if n requests get resolved at once, n notifies are > sent, not 1. In theory this could mitigate this problem, but I don't > know how easy it is at this point. Still, it doesn't explain how or > why the client's recv-q isn't getting cleared. > > Hope this helps. > On our end, we changed the the code in the function calling pg_notifies to use an if statement rather than a while (that way it only updates once per second instead of continuously as long as there are pending async notifies). I looked more closely at the docs for DBD::Pg, and the pg_notifies call grabs *all* pending async notifies and returns them in a hash, not just one at a time. So, what was happening before was that if a new notify came through while processing the previous notifies, the code would reprocess. Lather, rinse, repeat. I think that if the program is waiting for pg_notifies when the timer interrupts it again, causing the client to call pg_notifies while still waiting for something. I think this is what gets the listening connection into the bad state. In theory this change should mitigate the "notify interrupt" behavior on our end, but, again, why the client's recv-q is filling up is as yet unexplained. Peter P.S. In src/backend/commands/async.c, somewhere between lines 910 and 981 (set_ps_display calls) is where the code gets interrupted. How and why, I don't know.
pgsql-bugs by date: