Thread: Clients disconnect but query still runs
Hi. I noticed that when clients (both psql and pgAdmin) disconnect or cancel, queries are often still running on the server. A few questions:
1) Is there a way to reconnect and get the results?
2) Is there a way to tell postgres to automatically stop all queries when the client who queried them disconnects?
3) Is there a way to see all queries whose clients have disconnected?
4) And finally: Why is this the behavior? Doesn't this keep some very long queries running which drain performance but don't seem to benefit anyone?
Robert James <srobertjames@gmail.com> writes: > Hi. I noticed that when clients (both psql and pgAdmin) disconnect or > cancel, queries are often still running on the server. A few questions: > 1) Is there a way to reconnect and get the results? No. > 2) Is there a way to tell postgres to automatically stop all queries when > the client who queried them disconnects? No. > 3) Is there a way to see all queries whose clients have disconnected? No. > 4) And finally: Why is this the behavior? It's not easy to tell whether a client has disconnected (particularly if the network stack is unhelpful, which is depressingly often true). Postgres will cancel a query if it gets told that the connection's been dropped, but it will only discover this when an attempt to output to the client fails. It does not spend cycles looking aside to see if the connection has dropped when it is doing something that doesn't involve output to the client. If your client code is polite enough to send a cancel request before disconnecting, that should terminate the query reasonably promptly. But just "yanking the plug" doesn't do that. regards, tom lane
I see - thanks, Tom, for the informative explanation.
In my experience admining high volume servers, I found this to a major failure pattern: Client tries query which seems to go on forever (either do to contention or resource exhaustion or some other problem), client gives up / fails / gets shut down or rebooted, yet the database is left hanging working on the sloooow query, which is probably consuming all of its resources. Perhaps the client restarts and tries again, now making the problem much worse, and the vicious cycle continues until the server is rebooted.
Is there no way to have the OS interrupt the postgres process when a TCP/IP disconnect happens? Or is the OS also in the dark that the TCP/IP connection was dropped? I believe that there is a way to monitor this using TCP/IP keep alives.
Or perhaps Postgres could check once every minute? Either way, in my experience, solving this would be a major boon to high volume servers, at least in the usage patterns I've worked with.
On Mon, Jul 27, 2009 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert James <srobertjames@gmail.com> writes:No.
> Hi. I noticed that when clients (both psql and pgAdmin) disconnect or
> cancel, queries are often still running on the server. A few questions:
> 1) Is there a way to reconnect and get the results?No.
> 2) Is there a way to tell postgres to automatically stop all queries when
> the client who queried them disconnects?No.
> 3) Is there a way to see all queries whose clients have disconnected?It's not easy to tell whether a client has disconnected (particularly if
> 4) And finally: Why is this the behavior?
the network stack is unhelpful, which is depressingly often true).
Postgres will cancel a query if it gets told that the connection's been
dropped, but it will only discover this when an attempt to output to the
client fails. It does not spend cycles looking aside to see if the
connection has dropped when it is doing something that doesn't involve
output to the client.
If your client code is polite enough to send a cancel request before
disconnecting, that should terminate the query reasonably promptly.
But just "yanking the plug" doesn't do that.
regards, tom lane
Robert James wrote: > Is there no way to have the OS interrupt the postgres process > when a TCP/IP disconnect happens? Or is the OS also in the > dark that the TCP/IP connection was dropped? I believe that > there is a way to monitor this using TCP/IP keep alives. > Or perhaps Postgres could check once every minute? Either > way, in my experience, solving this would be a major boon to > high volume servers, at least in the usage patterns I've worked with. The server machine has no way of knowing that the client died unless the client closes the connection gracefully. There are server configuration parameters "tcp_keepalives_idle", "tcp_keepalives_interval" and "tcp_keepalives_count" which, when used, will make the operating system check idle connections regularly. They are not supported on all operating systems (only on these whose socket options include TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT). Maybe they can help you. Yours, Laurenz Albe
Robert James wrote: > I see - thanks, Tom, for the informative explanation. > In my experience admining high volume servers, I found this to a major > failure pattern: Client tries query which seems to go on forever (either > do to contention or resource exhaustion or some other problem), client > gives up / fails / gets shut down or rebooted The client should always make its best effort to notify the server if it's disconnecting. How it's done depends on client OS, client program language, etc, but it generally ends up meaning AT LEAST that the client sends a TCP RST to the server to close the client <-> server socket. I don't know off the top of my head if the server backend will immediately notice an RST on the socket and terminate. If it doesn't, then that's certainly something that'd be desirable. If the client doesn't send an RST and just "vanishes" then of course the server has no way to know anything's changed. As you say, you'd need to have tcp keepalives in use to find out. , yet the database is left > hanging working on the sloooow query, which is probably consuming all of > its resources. Perhaps the client restarts and tries again, now making > the problem much worse, and the vicious cycle continues until the server > is rebooted. The server should never need to be rebooted. What about pg_cancel_backend() ? What about killing the backends with SIGTERM (not SIGKILL, -9) or similar? -- Craig Ringer
On 2009-07-28, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert James <srobertjames@gmail.com> writes: >> Hi. I noticed that when clients (both psql and pgAdmin) disconnect or >> cancel, queries are often still running on the server. A few questions: >> 1) Is there a way to reconnect and get the results? > > No. > >> 2) Is there a way to tell postgres to automatically stop all queries when >> the client who queried them disconnects? > > No. > >> 3) Is there a way to see all queries whose clients have disconnected? > > No. > >> 4) And finally: Why is this the behavior? > > It's not easy to tell whether a client has disconnected (particularly if > the network stack is unhelpful, which is depressingly often true). > Postgres will cancel a query if it gets told that the connection's been > dropped, but it will only discover this when an attempt to output to the > client fails. It does not spend cycles looking aside to see if the > connection has dropped when it is doing something that doesn't involve > output to the client. > > If your client code is polite enough to send a cancel request before > disconnecting, that should terminate the query reasonably promptly. > But just "yanking the plug" doesn't do that. can't coerce a signal from the network stack? the linux socket(2) manpage is full of promise (SIGPIPE, SIGURG, SIGIO)
On Wed, Jul 29, 2009 at 1:58 PM, Jasen Betts<jasen@xnet.co.nz> wrote: > can't coerce a signal from the network stack? the linux socket(2) > manpage is full of promise (SIGPIPE, SIGURG, SIGIO) [please don't quote the entire message back, just the part you're responding to] Well SIGPIPE is no help since it would only fire if we tried to write to the socket anyways. SIGIO on the other hand looks like exactly what we would need. I'm not sure if it can be set to fire a signal only when the connection is disconnected and not for other state changes but if so it would be interesting. SIGURG might be useful but it would be more complex to use and less widely useful since it would only work if the client disconnects gracefully (though it might be worth checking into as an alternative to our existing query cancel method). -- greg http://mit.edu/~gsstark/resume.pdf
On Mon, Jul 27, 2009 at 09:49:04PM -0400, Tom Lane wrote: > It does not spend cycles looking aside to see if the > connection has dropped when it is doing something that doesn't involve > output to the client. Is this ever an interesting case? It would seem possible for something to test the client connections every once in a while to see if they're still valid. The postmaster seems like a reasonable place to do this to me, it has all the descriptors it just discards them at the moment. -- Sam http://samason.me.uk/
Greg Stark <gsstark@mit.edu> writes: > On Wed, Jul 29, 2009 at 1:58 PM, Jasen Betts<jasen@xnet.co.nz> wrote: >> can't coerce a signal from the network stack? the linux socket(2) >> manpage is full of promise (SIGPIPE, SIGURG, SIGIO) > SIGIO on the other hand looks like exactly what we would need. I'm not > sure if it can be set to fire a signal only when the connection is > disconnected and not for other state changes but if so it would be > interesting. And the other question is how much of what you read in the Linux manpage is portable to any other system... regards, tom lane
On Wed, Jul 29, 2009 at 3:17 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> On Wed, Jul 29, 2009 at 1:58 PM, Jasen Betts<jasen@xnet.co.nz> wrote: >>> can't coerce a signal from the network stack? the linux socket(2) >>> manpage is full of promise (SIGPIPE, SIGURG, SIGIO) > > > And the other question is how much of what you read in the Linux manpage > is portable to any other system... That is a question. But actually I think sigio might be fairly portable -- at least the first hit I found was for someone complaining that it wasn't working on Linux (due to a bug) and this broke their app which worked everywhere else. In any case this would be a feature which if it didn't work would leave us just where we are today. That's another advantage over trying to do something with sigurg which would be far more likely to cause headaches if it behave incorrectly. -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> writes: > That is a question. But actually I think sigio might be fairly > portable -- at least the first hit I found was for someone complaining > that it wasn't working on Linux (due to a bug) and this broke their > app which worked everywhere else. > In any case this would be a feature which if it didn't work would > leave us just where we are today. That's another advantage over trying > to do something with sigurg which would be far more likely to cause > headaches if it behave incorrectly. [ reads man pages for awhile... ] It looks to me like SIGIO is sent whenever the socket comes ready for either reading or writing, which makes it pretty nearly useless for detecting a broken-connection condition. You'd be too busy filtering out uninteresting signals --- and the signal handler itself can't do very much of that work. regards, tom lane
On Wed, 2009-07-29 at 14:56 +0100, Greg Stark wrote: > SIGURG might be useful but it would be more complex to use and less > widely useful since it would only work if the client disconnects > gracefully (though it might be worth checking into as an alternative > to our existing query cancel method). Might it not also fire if the client disconnects without notice, but tcp keepalives are enabled? I might have to write a little test program and see. [much later] My test program did not appear to receive SIGURB, even after registering for it with fcntl(sockfd, F_SETOWN, ...) and setting a signal handler for it. This was the case whether the connection was dropped due to a tcp keepalive failure, the dropping of a network interface, or a normal disconnect. The next read() or recv() returned zero bytes read but no asynchronous notification appeared to occur. I'm under the impression it's really for use with asynchronous sockets, but haven't tested this yet. What does work well is occasionally poking the socket with recv(..., MSG_DONTWAIT) while doing other work. Program attached. TCP keepalives seem to work very well at least on my Linux test system, and it's easy to test for a dud connection using recv(...) with the MSG_DONTWAIT and (if desired) MSG_PEEK flags. If the connection has exited cleanly it'll return a zero-size read; if the connection has dropped due to keepalive failure it'll return ETIMEDOUT. Pg's backend code already supports keepalives. I guess what'd be helpful would be a periodic recv(..., MSG_DONTWAIT) on the client<->server socket while the backend is working hard on a query. A SIGLARM would be handy for that, though I guess Pg isn't used to having to test for EINTR on syscalls... -- Craig Ringer
Attachment
> Well SIGPIPE is no help since it would only fire if we tried to write > to the socket anyways. Right. For this purpose, pgpool sends param packet to client periodically while waiting for a reply from backend to detect if the connection to the client is broken. If it's broken, pgool sends cancel packet to backend not to waste backend machine's CPU cycle. -- Tatsuo Ishii SRA OSS, Inc. Japan
On Thu, Jul 30, 2009 at 8:41 AM, Tatsuo Ishii<ishii@postgresql.org> wrote: >> Well SIGPIPE is no help since it would only fire if we tried to write >> to the socket anyways. > > Right. For this purpose, pgpool sends param packet to client > periodically while waiting for a reply from backend to detect if the > connection to the client is broken. If it's broken, pgool sends cancel > packet to backend not to waste backend machine's CPU cycle. The downside to this is that it will cause spurious failures for transient network failures even if the network comes back before it's actually needed. -- greg http://mit.edu/~gsstark/resume.pdf
On Thu, Jul 30, 2009 at 7:43 AM, Craig Ringer<craig@postnewspapers.com.au> wrote: > On Wed, 2009-07-29 at 14:56 +0100, Greg Stark wrote: > >> SIGURG might be useful but it would be more complex to use and less >> widely useful since it would only work if the client disconnects >> gracefully (though it might be worth checking into as an alternative >> to our existing query cancel method). > > Might it not also fire if the client disconnects without notice, but tcp > keepalives are enabled? > > I might have to write a little test program and see. > > [much later] My test program did not appear to receive SIGURB, even > after registering for it with fcntl(sockfd, F_SETOWN, ...) and setting a > signal handler for it. This was the case whether the connection was > dropped due to a tcp keepalive failure, the dropping of a network > interface, or a normal disconnect. The next read() or recv() returned > zero bytes read but no asynchronous notification appeared to occur. I'm > under the impression it's really for use with asynchronous sockets, but > haven't tested this yet. Right, you'll only get SIGURG if there's actually any urgent data received. The client would have to actively send such data periodically. That would make this a portability headache since it wouldn't just be an add-on which would fail gracefully if it's unsupported. The server and client would both have to be sure they understood whether they both supported this feature. > > What does work well is occasionally poking the socket with recv(..., > MSG_DONTWAIT) while doing other work. Program attached. TCP keepalives > seem to work very well at least on my Linux test system, and it's easy > to test for a dud connection using recv(...) with the MSG_DONTWAIT and > (if desired) MSG_PEEK flags. If the connection has exited cleanly it'll > return a zero-size read; if the connection has dropped due to keepalive > failure it'll return ETIMEDOUT. The problem with this is that it introduces spurious failures for transient network failures. Also it requires the server to periodically take time out from processing the query to do this. I think we want a zero-cost method which will interrupt processing if the client actively disconnects. If there's a network failure we'll find out about it in the normal course of events. -- greg http://mit.edu/~gsstark/resume.pdf
On Thu, Jul 30, 2009 at 10:27 AM, Csaba Nagy<nagy@ecircle-ag.com> wrote: > > Sorry, I have to disagree here. If there's a spurious network error, you > have usually bigger problems. I prefer to have the connection killed > even if the network recovers I know this is a popular feeling. But you're throwing away decades of work in making TCP reliable. You would change feelings quickly if you ever faced this scenario too. All it takes is some bad memory or a bad wire and you would be turning a performance drain into random connection drops. > than risk an idle in transaction connection > to live forever when the client/network crashes for any reason. In case > of network failure the connection will probably be cleaned eventually, > but it did happen to me that a client machine crashed in the middle of a > transaction while not executing any SQL, and that connection stayed > until I killed it manually. Well it ought to have eventually died. Your patience may have ran out before the keep-alive timeouts fired though. -- greg http://mit.edu/~gsstark/resume.pdf
Hi all, On Thu, 2009-07-30 at 11:02 +0200, Greg Stark wrote: > On Thu, Jul 30, 2009 at 7:43 AM, Craig > Ringer<craig@postnewspapers.com.au> wrote: > > On Wed, 2009-07-29 at 14:56 +0100, Greg Stark wrote: > > What does work well is occasionally poking the socket with recv(..., > > MSG_DONTWAIT) while doing other work. Program attached. TCP keepalives > > seem to work very well at least on my Linux test system, and it's easy > > to test for a dud connection using recv(...) with the MSG_DONTWAIT and > > (if desired) MSG_PEEK flags. If the connection has exited cleanly it'll > > return a zero-size read; if the connection has dropped due to keepalive > > failure it'll return ETIMEDOUT. > > > The problem with this is that it introduces spurious failures for > transient network failures. Also it requires the server to > periodically take time out from processing the query to do this. I > think we want a zero-cost method which will interrupt processing if > the client actively disconnects. If there's a network failure we'll > find out about it in the normal course of events. Sorry, I have to disagree here. If there's a spurious network error, you have usually bigger problems. I prefer to have the connection killed even if the network recovers than risk an idle in transaction connection to live forever when the client/network crashes for any reason. In case of network failure the connection will probably be cleaned eventually, but it did happen to me that a client machine crashed in the middle of a transaction while not executing any SQL, and that connection stayed until I killed it manually. A simple ping to the client would have cleared the fact that the client is not there anymore. I would also be happy to pay the cost of pinging the clients let's say once per a minute (or configurable interval). Considering that the connections are one to one with a client, it's enough to have a single timer which periodically signals each backend to ping it's client, but this is implementation details for which I have no clue how it would be best, the main thing is: I would love to have this functionality. It's extremely hard to secure all clients against crash, and a crash of one of the clients in the middle of a transaction can have very bad consequences (think indefinitely stucked open transaction). Cheers, Csaba.
On Thu, 2009-07-30 at 11:41 +0200, Greg Stark wrote: > I know this is a popular feeling. But you're throwing away decades of > work in making TCP reliable. You would change feelings quickly if you > ever faced this scenario too. All it takes is some bad memory or a bad > wire and you would be turning a performance drain into random > connection drops. But if I get bad memory or bad wire I'll get much worse problems already, and don't tell me it will work more reliably if you don't kill the connection. It's a lot better to find out sooner that you have those problems and fix them than having spurious errors which you'll get even if you don't kill the connection in case of such problems. > Well it ought to have eventually died. Your patience may have ran out > before the keep-alive timeouts fired though. Well it lived for at least one hour (could be more, I don't remember for sure) keeping vacuum from doing it's job on a heavily updated DB. It was not so much about my patience as about starting to have abysmal performance, AFTER we fixed the initial cause of the crash, and without any warning, except of course I did find out immediately that bloat happens and found the idle transactions and killed them, but I imagine the hair-pulling for a less experienced postgres DBA. I would have also preferred that postgres solves this issue on it's own - the network stack is clearly not fast enough in resolving it. Cheers, Csaba.
On Thu, Jul 30, 2009 at 10:59 AM, Csaba Nagy<nagy@ecircle-ag.com> wrote: > But if I get bad memory or bad wire I'll get much worse problems > already, and don't tell me it will work more reliably if you don't kill > the connection. It's a lot better to find out sooner that you have those > problems and fix them than having spurious errors which you'll get even > if you don't kill the connection in case of such problems. Are you sure? Do you know how many times you haven't even found out you had a problem because TCP just silently kept working despite the problem? Having had to use protocols which imposed their own timeouts on lame hotel networks, buggy wireless drivers, and bad DSL connections and found my connections dying every few minutes I can say it's maddeningly frustrating. Especially knowing that TCP was *supposed* to work in this scenario and they had broken it by trying to be clever. >> Well it ought to have eventually died. Your patience may have ran out >> before the keep-alive timeouts fired though. > > Well it lived for at least one hour (could be more, I don't remember for > sure) keeping vacuum from doing it's job on a heavily updated DB. It was > not so much about my patience as about starting to have abysmal > performance, AFTER we fixed the initial cause of the crash, and without > any warning, except of course I did find out immediately that bloat > happens and found the idle transactions and killed them, but I imagine > the hair-pulling for a less experienced postgres DBA. I would have also > preferred that postgres solves this issue on it's own - the network > stack is clearly not fast enough in resolving it. Indeed, properly set TCP keepalives don't time out for over 2 hours. But that's configurable in postgresql.conf. -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark wrote: > Right, you'll only get SIGURG if there's actually any urgent data > received. The client would have to actively send such data > periodically. That would make this a portability headache since it > wouldn't just be an add-on which would fail gracefully if it's > unsupported. It'd also have the same issue as relying on keepalives - ie transient network drop-outs would be much more likely to cause an unnecessary query cancel. Worse, the client wouldn't even know about it, because it was unreachable at the time the server sent it a RST, so it'd be waiting for an answer from the server that'd never come... >> What does work well is occasionally poking the socket with recv(..., >> MSG_DONTWAIT) while doing other work. > The problem with this is that it introduces spurious failures for > transient network failures. Yep, and often failures where only one side notices (unless _both_ sides are relying on keepalives and are checking the connection status). Ick. > Also it requires the server to > periodically take time out from processing the query to do this. This aspect I'm not to bothered about. I doubt it'd cost anything detectable if done a few times a minute - unless it required restructuring of query processing to accomodate it. In that case, no way. > I > think we want a zero-cost method which will interrupt processing if > the client actively disconnects. If there's a network failure we'll > find out about it in the normal course of events. Personally, I'm with you. I think the _real_ problem here is clients that're just giving up and vanishing without issuing a query cancel or, apparently, even closing the connection. It's _not_ hard to time out on a query and use another connection to cancel the backend. If you've set a server-side timeout, of course, there's no need to do anything, but if you're timing out client-side or your user has cancelled the request, it's still not exactly hard to clean up after yourself. It was perhaps twenty minute's work to implement a generic query cancel command with Java/JDBC - just spawn another thread or grab one from a worker pool, open a new Pg connection, issue the backend cancel, and close the connection. In the thread that just had its backend cancelled you receive an informative SQLException. In fact, I'm not even sure _how_ one goes about exiting without sending an RST. A quick check shows that when I `kill -9' a process with an open client socket (ssh, in this case) the OS sends a FIN, and responds to the server's FIN,ACK with its own ACK. So the OS is closing the socket for the dead process. If I try this with a `psql' process, the server cleans up the orphaned backend promptly. So, barring network breaks (wifi down / out of range, ethernet cable fell out, etc etc) how is the OP managing to leave backends running queries? Hard-resetting the machine? -- Craig Ringer
Csaba Nagy wrote: > On Thu, 2009-07-30 at 11:41 +0200, Greg Stark wrote: >> I know this is a popular feeling. But you're throwing away decades of >> work in making TCP reliable. You would change feelings quickly if you >> ever faced this scenario too. All it takes is some bad memory or a bad >> wire and you would be turning a performance drain into random >> connection drops. > > But if I get bad memory or bad wire I'll get much worse problems > already, and don't tell me it will work more reliably if you don't kill > the connection. It's a lot better to find out sooner that you have those > problems and fix them than having spurious errors which you'll get even > if you don't kill the connection in case of such problems. Transient connection issues are not infrequent, and shouldn't promptly kill connections. A user's wifi might drop and then re-establish service. They might bump the Ethernet cable out (and it's inevitably lost its retaining clip). A router _somewhere_ along the route might reboot. Etc. That said, TCP keepalives are designed to allow for this, and only consider the connection dead if it's failed to respond for a reasonable period and hasn't acknowledged several requests. > Well it lived for at least one hour (could be more, I don't remember for > sure) keeping vacuum from doing it's job on a heavily updated DB. Unless you've changed the defaults, TCP keepalives will take several hours to notice a dead connection - if they're enabled at all. > It was > not so much about my patience as about starting to have abysmal > performance, AFTER we fixed the initial cause of the crash, and without > any warning, except of course I did find out immediately that bloat > happens and found the idle transactions Idle? I thought your issue was _active_ queries running, servicing requests from clients that'd since ceased to care? How did you manage to kill the client in such a way as that the OS on the client didn't send a FIN to the server anyway? Hard-reset the client machine(s)? > and killed them, but I imagine > the hair-pulling for a less experienced postgres DBA. I would have also > preferred that postgres solves this issue on it's own - the network > stack is clearly not fast enough in resolving it. It's not really meant to happen in the first place. I do think that if you have a lot of connections from unreliable machines (say hosts with intermittent connectivity) then you'd want to make sure tcp keepalives are active and that you've tuned the keepalive params to be much more aggressive. I thought your issue was the backend not terminating a query when the client died while the backend was in the middle of a long-running query. Keepalives alone won't solve that one. -- Craig Ringer
On Thu, 2009-07-30 at 13:22 +0200, Craig Ringer wrote: > So, barring network breaks (wifi down / out of range, ethernet cable > fell out, etc etc) how is the OP managing to leave backends running > queries? Hard-resetting the machine? It happened to us when a client box went out of memory and started swapping up to the point it was unaccessible even for console login. The connections of that machine were still live but unusable, as the client box will never get out of that state until hard resetting... which we would promptly do, but the connections on postgres side would still live on. I will probably have to check out now the network connection parameters in the postgres configuration, never had a look at them before... in any case >2 hours mentioned in an earlier post seems a bad default to me. Cheers, Csaba.
Csaba Nagy wrote: > A simple ping to the client would have > cleared the fact that the client is not there anymore. Yep. It'd also stop PostgreSQL working for clients with software firewalls, since most of them drop ICMP ECHO ("ping"). TCP keepalives are designed to do the same thing, but do it reliably and properly. Why not configure your tcp keepalive intervals instead? Note that tcp keepalives won't kill the backend if it's in the middle of working on an I/O or CPU intensive query. It'll die when it tries to send the client data, or tries to read something more from the client. One trick you can use if you're really worried about this is `RAISE NOTICE' periodically during a big complicated PL/PgSQL procedure. The NOTICE should be sent to the client immediately, which will cause Pg to notice the socket is dead. It won't help you with long running normal SQL queries, though. I do think it'd be kind of nice to occasionally poke the connection to see if it's alive while executing a query. Doing it in a SIGALRM handler so as not to require the query execution path to be scattered with "checkConnStatus()" or whatever calls might be one approach. It'd have _very_ minimal cost, would only cost anything while the backend was working, and would help avoid wasted work when a client vanishes. When the backend is idle (including "idle in transaction") it can disable any timer. It'll be poll()ing or select()ing or read()ing or whatever on the socket and will immediately notice if it breaks. > I would also be > happy to pay the cost of pinging the clients let's say once per a minute > (or configurable interval). Considering that the connections are one to > one with a client, it's enough to have a single timer which periodically > signals each backend to ping it's client, but this is implementation > details for which I have no clue how it would be best, That applies in your situation, but for many others that's not the case. Not that it matters a great deal. > the main thing > is: I would love to have this functionality. It's extremely hard to > secure all clients against crash, and a crash of one of the clients in > the middle of a transaction can have very bad consequences (think > indefinitely stucked open transaction). Nope. Just tune your keepalives if you have hopelessly flakey clients. Even if the client _program_ crashes, though, you shouldn't have anything left lying around. It's only if the client _OS_ crashes or the machine is hard-reset that you should be left with a backend lying around until tcp keepalives notice. -- Craig Ringer
Csaba Nagy wrote: > It happened to us when a client box went out of memory and started > swapping up to the point it was unaccessible even for console login. The > connections of that machine were still live but unusable, as the client > box will never get out of that state until hard resetting... which we > would promptly do You know, you can prevent that on any sane operating system by setting resource limits on any at-risk processes, limiting total available swap to a reasonable amount, etc. On Linux you may want to turn off memory overcommit or dramatically reduce how much swap you allocate. A host with a runaway process hogging memory shouldn't be dying. It should really be killing off the problem process, or the problem process should be dying its self after failing to allocate requested memory. If this isn't the case, YOU HAVE TOO MUCH SWAP. After all, swap is useless if there's so much that using it brings the system to a halt. > I will probably have to check out now the network connection > parameters in the postgres configuration, never had a look at them > before... in any case >2 hours mentioned in an earlier post seems a bad > default to me. It's the OS's default. PostgreSQL just doesn't change it. -- Craig Ringer
On Thu, 2009-07-30 at 13:40 +0200, Craig Ringer wrote: > > A simple ping to the client would have > > cleared the fact that the client is not there anymore. > > Yep. It'd also stop PostgreSQL working for clients with software > firewalls, since most of them drop ICMP ECHO ("ping"). I wasn't meaning TCP 'ping', but a higher level one... > TCP keepalives are designed to do the same thing, but do it reliably and > properly. Why not configure your tcp keepalive intervals instead? Will do, normally we have good networking, never had to touch it before (and have no experience in network problems anyway)... > > the main thing > > is: I would love to have this functionality. It's extremely hard to > > secure all clients against crash, and a crash of one of the clients in > > the middle of a transaction can have very bad consequences (think > > indefinitely stucked open transaction). > > Nope. Just tune your keepalives if you have hopelessly flakey clients. On the contrary, we do have very stable networking here, the problem was never a networking one... > Even if the client _program_ crashes, though, you shouldn't have > anything left lying around. It's only if the client _OS_ crashes or the > machine is hard-reset that you should be left with a backend lying > around until tcp keepalives notice. As explained in earlier email, the client box's OS went down in SWAP hell. Cheers, Csaba.
[just to make things clear, I'm not the one who brought up this discussion, only that I was also bitten once by zombie connections] On Thu, 2009-07-30 at 13:29 +0200, Craig Ringer wrote: > Idle? I thought your issue was _active_ queries running, servicing > requests from clients that'd since ceased to care? No, the problem was zombie connections idle in transaction. > How did you manage to kill the client in such a way as that the OS on > the client didn't send a FIN to the server anyway? Hard-reset the client > machine(s)? It was a runaway client box in swapping hell. It had plenty of open connections, mostly of which idle in transaction (because it started to take veeery long to do it's processing between the queries). > I thought your issue was the backend not terminating a query when the > client died while the backend was in the middle of a long-running query. > Keepalives alone won't solve that one. No, that is not so dangerous for us, usually we chunk our queries coming from applications that they can't run too long at all. But it seems that if I lower the network timeout then even the crashed client scenario could be OK, I will have to test that... Cheers, Csaba.
On Thu, Jul 30, 2009 at 12:22 PM, Craig Ringer<craig@postnewspapers.com.au> wrote: > > In fact, I'm not even sure _how_ one goes about exiting without sending an > RST. A quick check shows that when I `kill -9' a process with an open client > socket (ssh, in this case) the OS sends a FIN, and responds to the server's > FIN,ACK with its own ACK. So the OS is closing the socket for the dead > process. If I try this with a `psql' process, the server cleans up the > orphaned backend promptly. Fwiw this description sounds like you're leaving out part of the picture. TCP connections consist of two independent flows, one in each direction. Each flow can be closed by a FIN independently. If the client program dies and the client sends a FIN to close the client->server flow that doesn't imply the server will necessarily close the server->client flow. For the situation you described to arise the server would have to notice the EOF on the client connection and explicitly call close. That would happen if the connection was idle or idle-in-transaction. If the server was in the middle of running a query then it won't notice anything until it tries to write to the socket and gets a RST in response because the listening process has died. So there are really two different use cases here. Case 1: the server is idle and reading from a connection and the client has disappeared completely without closing the connection. In that case keepalives or having the server try to send data are the only ways to notice the problem. Case 2: The server is busy and the client has either cleanly closed the connection or died but the server doesn't notice for a long time because it isn't sending any data and it isn't trying to read any data either. -- greg http://mit.edu/~gsstark/resume.pdf
[this is getting off topic] On Thu, 2009-07-30 at 13:44 +0200, Craig Ringer wrote: > A host with a runaway process hogging memory shouldn't be dying. It > should really be killing off the problem process, or the problem process > should be dying its self after failing to allocate requested memory. If > this isn't the case, YOU HAVE TOO MUCH SWAP. > > After all, swap is useless if there's so much that using it brings the > system to a halt. In theory you're right, in practice I can't control any of this - it's the client boxes, I control the DB. The most I can do about it is to friendly ask the colleagues in charge with that to make sure it won't happen again, and then still there will be cases like a virtual machine just crashing. > > I will probably have to check out now the network connection > > parameters in the postgres configuration, never had a look at them > > before... in any case >2 hours mentioned in an earlier post seems a bad > > default to me. > > It's the OS's default. PostgreSQL just doesn't change it. Well, then looks like I will have to learn a bit about TCP keep-alive and how linux handles it... Cheers, Csaba.
Craig Ringer <craig@postnewspapers.com.au> writes: > Greg Stark wrote: >> Also it requires the server to >> periodically take time out from processing the query to do this. > This aspect I'm not to bothered about. I doubt it'd cost anything > detectable if done a few times a minute - unless it required > restructuring of query processing to accomodate it. In that case, no way. Well, it would. Your suggestion of doing it in a signal handler is a nonstarter --- signal handlers can't really do anything very useful for fear of messing up whatever processing they interrupted. The earlier part of the discussion was focused on getting the kernel to actively tell us when the connection had dropped. That would be workable if we found a way to request it, but I think we'd run out of options :-( regards, tom lane
On Thu, Jul 30, 2009 at 3:20 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > > The earlier part of the discussion was focused on getting the kernel > to actively tell us when the connection had dropped. That would be > workable if we found a way to request it, but I think we'd run out of > options :-( Yeah, everything I've found on SIGIO indicates it would fire whenever the socket was readable which isn't what we want. Does anyone have a pointer for a full documentation on how to use it? -- greg http://mit.edu/~gsstark/resume.pdf
> > Right. For this purpose, pgpool sends param packet to client > > periodically while waiting for a reply from backend to detect if the > > connection to the client is broken. If it's broken, pgool sends cancel > > packet to backend not to waste backend machine's CPU cycle. > > The downside to this is that it will cause spurious failures for > transient network failures even if the network comes back before it's > actually needed. Can't we expect such a transient error to be fixed by underlying TCP/IP stack? -- Tatsuo Ishii SRA OSS, Inc. Japan
On 27 juil, 21:49, t...@sss.pgh.pa.us (Tom Lane) wrote: > Robert James <srobertja...@gmail.com> writes: > > Hi. I noticed that when clients (both psql and pgAdmin) disconnect or > > cancel, queries are often still running on the server. A few questions: > > 1) Is there a way to reconnect and get the results? > > No. > > > 2) Is there a way to tell postgres to automatically stop all queries when > > the client who queried them disconnects? > > No. > > > 3) Is there a way to see all queries whose clients have disconnected? > > No. Ok then. Disconnection detection is hard. Now, I'm in a situation where I have many "IDLE in transaction". I spotted them myself. I could kill them, but that wouldn't prevent them from happening again. In fact, it happens regularily. Is there a way to get the list of SQL statements that were previously executed as part of a given transaction? If I could get that, it would be a hell lot easier to figure out the misbehaving client. Thanks for any pointers, Nicolas
On Thu, Aug 06, 2009 at 07:23:41AM -0700, Nicolas wrote: > Now, I'm in a situation where I have many "IDLE in transaction". I > spotted them myself. I could kill them, but that wouldn't prevent them > from happening again. In fact, it happens regularily. > > Is there a way to get the list of SQL statements that were previously > executed as part of a given transaction? I'd play with logging and setting your "log_line_prefix"[1] to include the process id and "log_statement" to "all". It's then a simple matter of searching back through the logs to find out what was going on. -- Sam http://samason.me.uk/ [1] http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LINE-PREFIX [2] http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-STATEMENT