Thread: Clients disconnect but query still runs

Clients disconnect but query still runs

From

Robert James

Date:

27 July 2009, 21:38:32

Hi. I noticed that when clients (both psql and pgAdmin) disconnect or cancel, queries are often still running on the server. A few questions:

1) Is there a way to reconnect and get the results?

2) Is there a way to tell postgres to automatically stop all queries when the client who queried them disconnects?

3) Is there a way to see all queries whose clients have disconnected?

4) And finally: Why is this the behavior? Doesn't this keep some very long queries running which drain performance but don't seem to benefit anyone?

Re: Clients disconnect but query still runs

From

Tom Lane

Date:

27 July 2009, 22:49:17

Robert James <srobertjames@gmail.com> writes:
> Hi.  I noticed that when clients (both psql and pgAdmin) disconnect or
> cancel, queries are often still running on the server.  A few questions:
> 1) Is there a way to reconnect and get the results?

No.

> 2) Is there a way to tell postgres to automatically stop all queries when
> the client who queried them disconnects?

No.

> 3) Is there a way to see all queries whose clients have disconnected?

No.

> 4) And finally: Why is this the behavior?

It's not easy to tell whether a client has disconnected (particularly if
the network stack is unhelpful, which is depressingly often true).
Postgres will cancel a query if it gets told that the connection's been
dropped, but it will only discover this when an attempt to output to the
client fails.  It does not spend cycles looking aside to see if the
connection has dropped when it is doing something that doesn't involve
output to the client.

If your client code is polite enough to send a cancel request before
disconnecting, that should terminate the query reasonably promptly.
But just "yanking the plug" doesn't do that.

            regards, tom lane

Re: Clients disconnect but query still runs

From

Robert James

Date:

28 July 2009, 10:32:49

I see - thanks, Tom, for the informative explanation.

In my experience admining high volume servers, I found this to a major failure pattern: Client tries query which seems to go on forever (either do to contention or resource exhaustion or some other problem), client gives up / fails / gets shut down or rebooted, yet the database is left hanging working on the sloooow query, which is probably consuming all of its resources. Perhaps the client restarts and tries again, now making the problem much worse, and the vicious cycle continues until the server is rebooted.

Is there no way to have the OS interrupt the postgres process when a TCP/IP disconnect happens? Or is the OS also in the dark that the TCP/IP connection was dropped? I believe that there is a way to monitor this using TCP/IP keep alives.

Or perhaps Postgres could check once every minute? Either way, in my experience, solving this would be a major boon to high volume servers, at least in the usage patterns I've worked with.

On Mon, Jul 27, 2009 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert James <srobertjames@gmail.com> writes:
> Hi. I noticed that when clients (both psql and pgAdmin) disconnect or
> cancel, queries are often still running on the server. A few questions:
> 1) Is there a way to reconnect and get the results?

No.

> 2) Is there a way to tell postgres to automatically stop all queries when
> the client who queried them disconnects?

No.

> 3) Is there a way to see all queries whose clients have disconnected?

No.

> 4) And finally: Why is this the behavior?

It's not easy to tell whether a client has disconnected (particularly if
the network stack is unhelpful, which is depressingly often true).
Postgres will cancel a query if it gets told that the connection's been
dropped, but it will only discover this when an attempt to output to the
client fails. It does not spend cycles looking aside to see if the
connection has dropped when it is doing something that doesn't involve
output to the client.

If your client code is polite enough to send a cancel request before
disconnecting, that should terminate the query reasonably promptly.
But just "yanking the plug" doesn't do that.

regards, tom lane

Re: Clients disconnect but query still runs

From

"Albe Laurenz"

Date:

29 July 2009, 04:52:32

Robert James wrote:
> Is there no way to have the OS interrupt the postgres process
> when a TCP/IP disconnect happens? Or is the OS also in the
> dark that the TCP/IP connection was dropped? I believe that
> there is a way to monitor this using TCP/IP keep alives.
> Or perhaps Postgres could check once every minute? Either
> way, in my experience, solving this would be a major boon to
> high volume servers, at least in the usage patterns I've worked with.

The server machine has no way of knowing that the client died
unless the client closes the connection gracefully.

There are server configuration parameters "tcp_keepalives_idle",
"tcp_keepalives_interval" and "tcp_keepalives_count" which, when
used, will make the operating system check idle connections
regularly.
They are not supported on all operating systems (only on these
whose socket options include TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT).

Maybe they can help you.

Yours,
Laurenz Albe

Re: Clients disconnect but query still runs

From

Craig Ringer

Date:

29 July 2009, 05:07:57

Robert James wrote:
> I see - thanks, Tom, for the informative explanation.
> In my experience admining high volume servers, I found this to a major
> failure pattern: Client tries query which seems to go on forever (either
> do to contention or resource exhaustion or some other problem), client
> gives up / fails / gets shut down or rebooted

The client should always make its best effort to notify the server if
it's disconnecting. How it's done depends on client OS, client program
language, etc, but it generally ends up meaning AT LEAST that the client
sends a TCP RST to the server to close the client <-> server socket.

I don't know off the top of my head if the server backend will
immediately notice an RST on the socket and terminate. If it doesn't,
then that's certainly something that'd be desirable.

If the client doesn't send an RST and just "vanishes" then of course the
server has no way to know anything's changed. As you say, you'd need to
have tcp keepalives in use to find out.

, yet the database is left
> hanging working on the sloooow query, which is probably consuming all of
> its resources.  Perhaps the client restarts and tries again, now making
> the problem much worse, and the vicious cycle continues until the server
> is rebooted.

The server should never need to be rebooted. What about
pg_cancel_backend() ? What about killing the backends with SIGTERM (not
SIGKILL, -9) or similar?

--
Craig Ringer

Re: Clients disconnect but query still runs

From

Jasen Betts

Date:

29 July 2009, 09:58:43

On 2009-07-28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert James <srobertjames@gmail.com> writes:
>> Hi.  I noticed that when clients (both psql and pgAdmin) disconnect or
>> cancel, queries are often still running on the server.  A few questions:
>> 1) Is there a way to reconnect and get the results?
>
> No.
>
>> 2) Is there a way to tell postgres to automatically stop all queries when
>> the client who queried them disconnects?
>
> No.
>
>> 3) Is there a way to see all queries whose clients have disconnected?
>
> No.
>
>> 4) And finally: Why is this the behavior?
>
> It's not easy to tell whether a client has disconnected (particularly if
> the network stack is unhelpful, which is depressingly often true).
> Postgres will cancel a query if it gets told that the connection's been
> dropped, but it will only discover this when an attempt to output to the
> client fails.  It does not spend cycles looking aside to see if the
> connection has dropped when it is doing something that doesn't involve
> output to the client.
>
> If your client code is polite enough to send a cancel request before
> disconnecting, that should terminate the query reasonably promptly.
> But just "yanking the plug" doesn't do that.

can't coerce a signal from the network stack? the linux socket(2)
manpage is full of promise (SIGPIPE, SIGURG, SIGIO)

Re: Clients disconnect but query still runs

From

Greg Stark

Date:

29 July 2009, 10:56:52

On Wed, Jul 29, 2009 at 1:58 PM, Jasen Betts<jasen@xnet.co.nz> wrote:
> can't coerce a signal from the network stack? the linux socket(2)
> manpage is full of promise (SIGPIPE, SIGURG, SIGIO)

[please don't quote the entire message back, just the part you're responding to]

Well SIGPIPE is no help since it would only fire if we tried to write
to the socket anyways.

SIGIO on the other hand looks like exactly what we would need. I'm not
sure if it can be set to fire a signal only when the connection is
disconnected and not for other state changes but if so it would be
interesting.

SIGURG might be useful but it would be more complex to use and less
widely useful since it would only work if the client disconnects
gracefully (though it might be worth checking into as an alternative
to our existing query cancel method).

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Clients disconnect but query still runs

From

Sam Mason

Date:

29 July 2009, 11:15:53

On Mon, Jul 27, 2009 at 09:49:04PM -0400, Tom Lane wrote:
> It does not spend cycles looking aside to see if the
> connection has dropped when it is doing something that doesn't involve
> output to the client.

Is this ever an interesting case?  It would seem possible for something
to test the client connections every once in a while to see if they're
still valid.

The postmaster seems like a reasonable place to do this to me, it has
all the descriptors it just discards them at the moment.

--
  Sam  http://samason.me.uk/

Re: Clients disconnect but query still runs

From

Tom Lane

Date:

29 July 2009, 11:18:15

Greg Stark <gsstark@mit.edu> writes:
> On Wed, Jul 29, 2009 at 1:58 PM, Jasen Betts<jasen@xnet.co.nz> wrote:
>> can't coerce a signal from the network stack? the linux socket(2)
>> manpage is full of promise (SIGPIPE, SIGURG, SIGIO)

> SIGIO on the other hand looks like exactly what we would need. I'm not
> sure if it can be set to fire a signal only when the connection is
> disconnected and not for other state changes but if so it would be
> interesting.

And the other question is how much of what you read in the Linux manpage
is portable to any other system...

            regards, tom lane

Re: Clients disconnect but query still runs

From

Greg Stark

Date:

29 July 2009, 13:05:26

On Wed, Jul 29, 2009 at 3:17 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> On Wed, Jul 29, 2009 at 1:58 PM, Jasen Betts<jasen@xnet.co.nz> wrote:
>>> can't coerce a signal from the network stack? the linux socket(2)
>>> manpage is full of promise (SIGPIPE, SIGURG, SIGIO)
>
>
> And the other question is how much of what you read in the Linux manpage
> is portable to any other system...

That is a question. But actually I think sigio might be fairly
portable -- at least the first hit I found was for someone complaining
that it wasn't working on Linux (due to a bug) and this broke their
app which worked everywhere else.

In any case this would be a feature which if it didn't work would
leave us just where we are today. That's another advantage over trying
to do something with sigurg which would be far more likely to cause
headaches if it behave incorrectly.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Clients disconnect but query still runs

From

Tom Lane

Date:

29 July 2009, 13:32:09

Greg Stark <gsstark@mit.edu> writes:
> That is a question. But actually I think sigio might be fairly
> portable -- at least the first hit I found was for someone complaining
> that it wasn't working on Linux (due to a bug) and this broke their
> app which worked everywhere else.

> In any case this would be a feature which if it didn't work would
> leave us just where we are today. That's another advantage over trying
> to do something with sigurg which would be far more likely to cause
> headaches if it behave incorrectly.

[ reads man pages for awhile... ]  It looks to me like SIGIO is sent
whenever the socket comes ready for either reading or writing, which
makes it pretty nearly useless for detecting a broken-connection
condition.  You'd be too busy filtering out uninteresting signals ---
and the signal handler itself can't do very much of that work.

            regards, tom lane

Re: Clients disconnect but query still runs

From

Craig Ringer

Date:

30 July 2009, 03:43:30

On Wed, 2009-07-29 at 14:56 +0100, Greg Stark wrote:

> SIGURG might be useful but it would be more complex to use and less
> widely useful since it would only work if the client disconnects
> gracefully (though it might be worth checking into as an alternative
> to our existing query cancel method).

Might it not also fire if the client disconnects without notice, but tcp
keepalives are enabled?

I might have to write a little test program and see.

[much later] My test program did not appear to receive SIGURB, even
after registering for it with fcntl(sockfd, F_SETOWN, ...) and setting a
signal handler for it. This was the case whether the connection was
dropped due to a tcp keepalive failure, the dropping of a network
interface, or a normal disconnect. The next read() or recv() returned
zero bytes read but no asynchronous notification appeared to occur. I'm
under the impression it's really for use with asynchronous sockets, but
haven't tested this yet.

What does work well is occasionally poking the socket with recv(...,
MSG_DONTWAIT) while doing other work. Program attached. TCP keepalives
seem to work very well at least on my Linux test system, and it's easy
to test for a dud connection using recv(...) with the MSG_DONTWAIT and
(if desired) MSG_PEEK flags. If the connection has exited cleanly it'll
return a zero-size read; if the connection has dropped due to keepalive
failure it'll return ETIMEDOUT.

Pg's backend code already supports keepalives. I guess what'd be helpful
would be a periodic recv(..., MSG_DONTWAIT) on the client<->server
socket while the backend is working hard on a query. A SIGLARM would be
handy for that, though I guess Pg isn't used to having to test for EINTR
on syscalls...

--
Craig Ringer

Attachment

sockbreak3.c

Re: Clients disconnect but query still runs

From

Tatsuo Ishii

Date:

30 July 2009, 04:41:52

> Well SIGPIPE is no help since it would only fire if we tried to write
> to the socket anyways.

Right. For this purpose, pgpool sends param packet to client
periodically while waiting for a reply from backend to detect if the
connection to the client is broken. If it's broken, pgool sends cancel
packet to backend not to waste backend machine's CPU cycle.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: Clients disconnect but query still runs

From

Greg Stark

Date:

30 July 2009, 05:55:25

On Thu, Jul 30, 2009 at 8:41 AM, Tatsuo Ishii<ishii@postgresql.org> wrote:
>> Well SIGPIPE is no help since it would only fire if we tried to write
>> to the socket anyways.
>
> Right. For this purpose, pgpool sends param packet to client
> periodically while waiting for a reply from backend to detect if the
> connection to the client is broken. If it's broken, pgool sends cancel
> packet to backend not to waste backend machine's CPU cycle.

The downside to this is that it will cause spurious failures for
transient network failures even if the network comes back before it's
actually needed.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Clients disconnect but query still runs

From

Greg Stark

Date:

30 July 2009, 06:03:18

On Thu, Jul 30, 2009 at 7:43 AM, Craig
Ringer<craig@postnewspapers.com.au> wrote:
> On Wed, 2009-07-29 at 14:56 +0100, Greg Stark wrote:
>
>> SIGURG might be useful but it would be more complex to use and less
>> widely useful since it would only work if the client disconnects
>> gracefully (though it might be worth checking into as an alternative
>> to our existing query cancel method).
>
> Might it not also fire if the client disconnects without notice, but tcp
> keepalives are enabled?
>
> I might have to write a little test program and see.
>
> [much later] My test program did not appear to receive SIGURB, even
> after registering for it with fcntl(sockfd, F_SETOWN, ...) and setting a
> signal handler for it. This was the case whether the connection was
> dropped due to a tcp keepalive failure, the dropping of a network
> interface, or a normal disconnect. The next read() or recv() returned
> zero bytes read but no asynchronous notification appeared to occur. I'm
> under the impression it's really for use with asynchronous sockets, but
> haven't tested this yet.

Right, you'll only get SIGURG if there's actually any urgent data
received. The client would have to actively send such data
periodically. That would make this a portability headache since it
wouldn't just be an add-on which would fail gracefully if it's
unsupported. The server and client would both have to be sure they
understood whether they both supported this feature.

>
> What does work well is occasionally poking the socket with recv(...,
> MSG_DONTWAIT) while doing other work. Program attached. TCP keepalives
> seem to work very well at least on my Linux test system, and it's easy
> to test for a dud connection using recv(...) with the MSG_DONTWAIT and
> (if desired) MSG_PEEK flags. If the connection has exited cleanly it'll
> return a zero-size read; if the connection has dropped due to keepalive
> failure it'll return ETIMEDOUT.

The problem with this is that it introduces spurious failures for
transient network failures. Also it requires the server to
periodically take time out from processing the query to do this. I
think we want a zero-cost method which will interrupt processing if
the client actively disconnects. If there's a network failure we'll
find out about it in the normal course of events.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Clients disconnect but query still runs

From

Greg Stark

Date:

30 July 2009, 06:42:14

On Thu, Jul 30, 2009 at 10:27 AM, Csaba Nagy<nagy@ecircle-ag.com> wrote:
>
> Sorry, I have to disagree here. If there's a spurious network error, you
> have usually bigger problems. I prefer to have the connection killed
> even if the network recovers

I know this is a popular feeling. But you're throwing away decades of
work in making TCP reliable. You would change feelings quickly if you
ever faced this scenario too. All it takes is some bad memory or a bad
wire and you would be turning a performance drain into random
connection drops.

> than risk an idle in transaction connection
> to live forever when the client/network crashes for any reason. In case
> of network failure the connection will probably be cleaned eventually,
> but it did happen to me that a client machine crashed in the middle of a
> transaction while not executing any SQL, and that connection stayed
> until I killed it manually.

Well it ought to have eventually died. Your patience may have ran out
before the keep-alive timeouts fired though.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Clients disconnect but query still runs

From

Csaba Nagy

Date:

30 July 2009, 06:49:50

Hi all,

On Thu, 2009-07-30 at 11:02 +0200, Greg Stark wrote:
> On Thu, Jul 30, 2009 at 7:43 AM, Craig
> Ringer<craig@postnewspapers.com.au> wrote:
> > On Wed, 2009-07-29 at 14:56 +0100, Greg Stark wrote:
> > What does work well is occasionally poking the socket with recv(...,
> > MSG_DONTWAIT) while doing other work. Program attached. TCP keepalives
> > seem to work very well at least on my Linux test system, and it's easy
> > to test for a dud connection using recv(...) with the MSG_DONTWAIT and
> > (if desired) MSG_PEEK flags. If the connection has exited cleanly it'll
> > return a zero-size read; if the connection has dropped due to keepalive
> > failure it'll return ETIMEDOUT.
>
>
> The problem with this is that it introduces spurious failures for
> transient network failures. Also it requires the server to
> periodically take time out from processing the query to do this. I
> think we want a zero-cost method which will interrupt processing if
> the client actively disconnects. If there's a network failure we'll
> find out about it in the normal course of events.

Sorry, I have to disagree here. If there's a spurious network error, you
have usually bigger problems. I prefer to have the connection killed
even if the network recovers than risk an idle in transaction connection
to live forever when the client/network crashes for any reason. In case
of network failure the connection will probably be cleaned eventually,
but it did happen to me that a client machine crashed in the middle of a
transaction while not executing any SQL, and that connection stayed
until I killed it manually. A simple ping to the client would have
cleared the fact that the client is not there anymore. I would also be
happy to pay the cost of pinging the clients let's say once per a minute
(or configurable interval). Considering that the connections are one to
one with a client, it's enough to have a single timer which periodically
signals each backend to ping it's client, but this is implementation
details for which I have no clue how it would be best, the main thing
is: I would love to have this functionality. It's extremely hard to
secure all clients against crash, and a crash of one of the clients in
the middle of a transaction can have very bad consequences (think
indefinitely stucked open transaction).

Cheers,
Csaba.

Re: Clients disconnect but query still runs

From

Csaba Nagy

Date:

30 July 2009, 06:58:44

On Thu, 2009-07-30 at 11:41 +0200, Greg Stark wrote:
> I know this is a popular feeling. But you're throwing away decades of
> work in making TCP reliable. You would change feelings quickly if you
> ever faced this scenario too. All it takes is some bad memory or a bad
> wire and you would be turning a performance drain into random
> connection drops.

But if I get bad memory or bad wire I'll get much worse problems
already, and don't tell me it will work more reliably if you don't kill
the connection. It's a lot better to find out sooner that you have those
problems and fix them than having spurious errors which you'll get even
if you don't kill the connection in case of such problems.

> Well it ought to have eventually died. Your patience may have ran out
> before the keep-alive timeouts fired though.

Well it lived for at least one hour (could be more, I don't remember for
sure) keeping vacuum from doing it's job on a heavily updated DB. It was
not so much about my patience as about starting to have abysmal
performance, AFTER we fixed the initial cause of the crash, and without
any warning, except of course I did find out immediately that bloat
happens and found the idle transactions and killed them, but I imagine
the hair-pulling for a less experienced postgres DBA. I would have also
preferred that postgres solves this issue on it's own - the network
stack is clearly not fast enough in resolving it.

Cheers,
Csaba.

Re: Clients disconnect but query still runs

From

Greg Stark

Date:

30 July 2009, 07:36:08

On Thu, Jul 30, 2009 at 10:59 AM, Csaba Nagy<nagy@ecircle-ag.com> wrote:
> But if I get bad memory or bad wire I'll get much worse problems
> already, and don't tell me it will work more reliably if you don't kill
> the connection. It's a lot better to find out sooner that you have those
> problems and fix them than having spurious errors which you'll get even
> if you don't kill the connection in case of such problems.

Are you sure? Do you know how many times you haven't even found out
you had a problem because TCP just silently kept working despite the
problem?

Having had to use protocols which imposed their own timeouts on lame
hotel networks, buggy wireless drivers, and bad DSL connections and
found my connections dying every few minutes I can say it's
maddeningly frustrating. Especially knowing that TCP was *supposed* to
work in this scenario and they had broken it by trying to be clever.

>> Well it ought to have eventually died. Your patience may have ran out
>> before the keep-alive timeouts fired though.
>
> Well it lived for at least one hour (could be more, I don't remember for
> sure) keeping vacuum from doing it's job on a heavily updated DB. It was
> not so much about my patience as about starting to have abysmal
> performance, AFTER we fixed the initial cause of the crash, and without
> any warning, except of course I did find out immediately that bloat
> happens and found the idle transactions and killed them, but I imagine
> the hair-pulling for a less experienced postgres DBA. I would have also
> preferred that postgres solves this issue on it's own - the network
> stack is clearly not fast enough in resolving it.

Indeed, properly set TCP keepalives don't time out for over 2 hours.
But that's configurable in postgresql.conf.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Clients disconnect but query still runs

From

Craig Ringer

Date:

30 July 2009, 08:23:16

Greg Stark wrote:

> Right, you'll only get SIGURG if there's actually any urgent data
> received. The client would have to actively send such data
> periodically. That would make this a portability headache since it
> wouldn't just be an add-on which would fail gracefully if it's
> unsupported.

It'd also have the same issue as relying on keepalives - ie transient
network drop-outs would be much more likely to cause an unnecessary
query cancel. Worse, the client wouldn't even know about it, because it
was unreachable at the time the server sent it a RST, so it'd be waiting
  for an answer from the server that'd never come...

>> What does work well is occasionally poking the socket with recv(...,
>> MSG_DONTWAIT) while doing other work.

> The problem with this is that it introduces spurious failures for
> transient network failures.

Yep, and often failures where only one side notices (unless _both_ sides
are relying on keepalives and are checking the connection status). Ick.

> Also it requires the server to
> periodically take time out from processing the query to do this.

This aspect I'm not to bothered about. I doubt it'd cost anything
detectable if done a few times a minute - unless it required
restructuring of query processing to accomodate it. In that case, no way.

> I
> think we want a zero-cost method which will interrupt processing if
> the client actively disconnects. If there's a network failure we'll
> find out about it in the normal course of events.

Personally, I'm with you. I think the _real_ problem here is clients
that're just giving up and vanishing without issuing a query cancel or,
apparently, even closing the connection.

It's _not_ hard to time out on a query and use another connection to
cancel the backend. If you've set a server-side timeout, of course,
there's no need to do anything, but if you're timing out client-side or
your user has cancelled the request, it's still not exactly hard to
clean up after yourself. It was perhaps twenty minute's work to
implement a generic query cancel command with Java/JDBC - just spawn
another thread or grab one from a worker pool, open a new Pg connection,
issue the backend cancel, and close the connection. In the thread that
just had its backend cancelled you receive an informative SQLException.

In fact, I'm not even sure _how_ one goes about exiting without sending
an RST. A quick check shows that when I `kill -9' a process with an open
client socket (ssh, in this case) the OS sends a FIN, and responds to
the server's FIN,ACK with its own ACK. So the OS is closing the socket
for the dead process. If I try this with a `psql' process, the server
cleans up the orphaned backend promptly.

So, barring network breaks (wifi down / out of range, ethernet cable
fell out, etc etc) how is the OP managing to leave backends running
queries? Hard-resetting the machine?

--
Craig Ringer

Re: Clients disconnect but query still runs

From

Craig Ringer

Date:

30 July 2009, 08:30:17

Csaba Nagy wrote:
> On Thu, 2009-07-30 at 11:41 +0200, Greg Stark wrote:
>> I know this is a popular feeling. But you're throwing away decades of
>> work in making TCP reliable. You would change feelings quickly if you
>> ever faced this scenario too. All it takes is some bad memory or a bad
>> wire and you would be turning a performance drain into random
>> connection drops.
>
> But if I get bad memory or bad wire I'll get much worse problems
> already, and don't tell me it will work more reliably if you don't kill
> the connection. It's a lot better to find out sooner that you have those
> problems and fix them than having spurious errors which you'll get even
> if you don't kill the connection in case of such problems.

Transient connection issues are not infrequent, and shouldn't promptly
kill connections.

A user's wifi might drop and then re-establish service. They might bump
the Ethernet cable out (and it's inevitably lost its retaining clip). A
router _somewhere_ along the route might reboot. Etc.

That said, TCP keepalives are designed to allow for this, and only
consider the connection dead if it's failed to respond for a reasonable
period and hasn't acknowledged several requests.

> Well it lived for at least one hour (could be more, I don't remember for
> sure) keeping vacuum from doing it's job on a heavily updated DB.

Unless you've changed the defaults, TCP keepalives will take several
hours to notice a dead connection - if they're enabled at all.

> It was
> not so much about my patience as about starting to have abysmal
> performance, AFTER we fixed the initial cause of the crash, and without
> any warning, except of course I did find out immediately that bloat
> happens and found the idle transactions

Idle? I thought your issue was _active_ queries running, servicing
requests from clients that'd since ceased to care?

How did you manage to kill the client in such a way as that the OS on
the client didn't send a FIN to the server anyway? Hard-reset the client
machine(s)?

> and killed them, but I imagine
> the hair-pulling for a less experienced postgres DBA. I would have also
> preferred that postgres solves this issue on it's own - the network
> stack is clearly not fast enough in resolving it.

It's not really meant to happen in the first place. I do think that if
you have a lot of connections from unreliable machines (say hosts with
intermittent connectivity) then you'd want to make sure tcp keepalives
are active and that you've tuned the keepalive params to be much more
aggressive.

I thought your issue was the backend not terminating a query when the
client died while the backend was in the middle of a long-running query.
Keepalives alone won't solve that one.

--
Craig Ringer

Re: Clients disconnect but query still runs

From

Csaba Nagy

Date:

30 July 2009, 08:36:23

On Thu, 2009-07-30 at 13:22 +0200, Craig Ringer wrote:
> So, barring network breaks (wifi down / out of range, ethernet cable
> fell out, etc etc) how is the OP managing to leave backends running
> queries? Hard-resetting the machine?

It happened to us when a client box went out of memory and started
swapping up to the point it was unaccessible even for console login. The
connections of that machine were still live but unusable, as the client
box will never get out of that state until hard resetting... which we
would promptly do, but the connections on postgres side would still live
on. I will probably have to check out now the network connection
parameters in the postgres configuration, never had a look at them
before... in any case >2 hours mentioned in an earlier post seems a bad
default to me.

Cheers,
Csaba.

Re: Clients disconnect but query still runs

From

Craig Ringer

Date:

30 July 2009, 08:42:16

Csaba Nagy wrote:

 > A simple ping to the client would have
> cleared the fact that the client is not there anymore.

Yep. It'd also stop PostgreSQL working for clients with software
firewalls, since most of them drop ICMP ECHO ("ping").

TCP keepalives are designed to do the same thing, but do it reliably and
properly. Why not configure your tcp keepalive intervals instead?

Note that tcp keepalives won't kill the backend if it's in the middle of
working on an I/O or CPU intensive query. It'll die when it tries to
send the client data, or tries to read something more from the client.

One trick you can use if you're really worried about this is `RAISE
NOTICE' periodically during a big complicated PL/PgSQL procedure. The
NOTICE should be sent to the client immediately, which will cause Pg to
notice the socket is dead. It won't help you with long running normal
SQL queries, though.

I do think it'd be kind of nice to occasionally poke the connection to
see if it's alive while executing a query. Doing it in a SIGALRM handler
so as not to require the query execution path to be scattered with
"checkConnStatus()" or whatever calls might be one approach. It'd have
_very_ minimal cost, would only cost anything while the backend was
working, and would help avoid wasted work when a client vanishes.

When the backend is idle (including "idle in transaction") it can
disable any timer. It'll be poll()ing or select()ing or read()ing or
whatever on the socket and will immediately notice if it breaks.

> I would also be
> happy to pay the cost of pinging the clients let's say once per a minute
> (or configurable interval). Considering that the connections are one to
> one with a client, it's enough to have a single timer which periodically
> signals each backend to ping it's client, but this is implementation
> details for which I have no clue how it would be best,

That applies in your situation, but for many others that's not the case.
Not that it matters a great deal.

> the main thing
> is: I would love to have this functionality. It's extremely hard to
> secure all clients against crash, and a crash of one of the clients in
> the middle of a transaction can have very bad consequences (think
> indefinitely stucked open transaction).

Nope. Just tune your keepalives if you have hopelessly flakey clients.

Even if the client _program_ crashes, though, you shouldn't have
anything left lying around. It's only if the client _OS_ crashes or the
machine is hard-reset that you should be left with a backend lying
around until tcp keepalives notice.

--
Craig Ringer

Re: Clients disconnect but query still runs

From

Craig Ringer

Date:

30 July 2009, 08:48:04

Csaba Nagy wrote:

> It happened to us when a client box went out of memory and started
> swapping up to the point it was unaccessible even for console login. The
> connections of that machine were still live but unusable, as the client
> box will never get out of that state until hard resetting... which we
> would promptly do

You know, you can prevent that on any sane operating system by setting
resource limits on any at-risk processes, limiting total available swap
to a reasonable amount, etc. On Linux you may want to turn off memory
overcommit or dramatically reduce how much swap you allocate.

A host with a runaway process hogging memory shouldn't be dying. It
should really be killing off the problem process, or the problem process
should be dying its self after failing to allocate requested memory. If
this isn't the case, YOU HAVE TOO MUCH SWAP.

After all, swap is useless if there's so much that using it brings the
system to a halt.

> I will probably have to check out now the network connection
> parameters in the postgres configuration, never had a look at them
> before... in any case >2 hours mentioned in an earlier post seems a bad
> default to me.

It's the OS's default. PostgreSQL just doesn't change it.

--
Craig Ringer

Re: Clients disconnect but query still runs

From

Csaba Nagy

Date:

30 July 2009, 08:57:58

On Thu, 2009-07-30 at 13:40 +0200, Craig Ringer wrote:
>  > A simple ping to the client would have
> > cleared the fact that the client is not there anymore.
>
> Yep. It'd also stop PostgreSQL working for clients with software
> firewalls, since most of them drop ICMP ECHO ("ping").

I wasn't meaning TCP 'ping', but a higher level one...

> TCP keepalives are designed to do the same thing, but do it reliably and
> properly. Why not configure your tcp keepalive intervals instead?

Will do, normally we have good networking, never had to touch it before
(and have no experience in network problems anyway)...

> > the main thing
> > is: I would love to have this functionality. It's extremely hard to
> > secure all clients against crash, and a crash of one of the clients in
> > the middle of a transaction can have very bad consequences (think
> > indefinitely stucked open transaction).
>
> Nope. Just tune your keepalives if you have hopelessly flakey clients.

On the contrary, we do have very stable networking here, the problem was
never a networking one...

> Even if the client _program_ crashes, though, you shouldn't have
> anything left lying around. It's only if the client _OS_ crashes or the
> machine is hard-reset that you should be left with a backend lying
> around until tcp keepalives notice.

As explained in earlier email, the client box's OS went down in SWAP
hell.

Cheers,
Csaba.

Re: Clients disconnect but query still runs

From

Csaba Nagy

Date:

30 July 2009, 08:57:59

[just to make things clear, I'm not the one who brought up this
discussion, only that I was also bitten once by zombie connections]

On Thu, 2009-07-30 at 13:29 +0200, Craig Ringer wrote:
> Idle? I thought your issue was _active_ queries running, servicing
> requests from clients that'd since ceased to care?

No, the problem was zombie connections idle in transaction.

> How did you manage to kill the client in such a way as that the OS on
> the client didn't send a FIN to the server anyway? Hard-reset the client
> machine(s)?

It was a runaway client box in swapping hell. It had plenty of open
connections, mostly of which idle in transaction (because it started to
take veeery long to do it's processing between the queries).

> I thought your issue was the backend not terminating a query when the
> client died while the backend was in the middle of a long-running query.
> Keepalives alone won't solve that one.

No, that is not so dangerous for us, usually we chunk our queries coming
from applications that they can't run too long at all. But it seems that
if I lower the network timeout then even the crashed client scenario
could be OK, I will have to test that...

Cheers,
Csaba.

Re: Clients disconnect but query still runs

From

Greg Stark

Date:

30 July 2009, 09:15:00

On Thu, Jul 30, 2009 at 12:22 PM, Craig
Ringer<craig@postnewspapers.com.au> wrote:
>
> In fact, I'm not even sure _how_ one goes about exiting without sending an
> RST. A quick check shows that when I `kill -9' a process with an open client
> socket (ssh, in this case) the OS sends a FIN, and responds to the server's
> FIN,ACK with its own ACK. So the OS is closing the socket for the dead
> process. If I try this with a `psql' process, the server cleans up the
> orphaned backend promptly.

Fwiw this description sounds like you're leaving out part of the picture.

TCP connections consist of two independent flows, one in each
direction. Each flow can be closed by a FIN independently. If the
client program dies and the client sends a FIN to close the
client->server flow that doesn't imply the server will necessarily
close the server->client flow.

For the situation you described to arise the server would have to
notice the EOF on the client connection and explicitly call close.
That would happen if the connection was idle or idle-in-transaction.

If the server was in the middle of running a query then it won't
notice anything until it tries to write to the socket and gets a RST
in response because the listening process has died.

So there are really two different use cases here.

Case 1: the server is idle and reading from a connection and the
client has disappeared completely without closing the connection. In
that case keepalives or having the server try to send data are the
only ways to notice the problem.

Case 2: The server is busy and the client has either cleanly closed
the connection or died but the server doesn't notice for a long time
because it isn't sending any data and it isn't trying to read any data
either.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Clients disconnect but query still runs

From

Csaba Nagy

Date:

30 July 2009, 09:15:57

[this is getting off topic]

On Thu, 2009-07-30 at 13:44 +0200, Craig Ringer wrote:
> A host with a runaway process hogging memory shouldn't be dying. It
> should really be killing off the problem process, or the problem process
> should be dying its self after failing to allocate requested memory. If
> this isn't the case, YOU HAVE TOO MUCH SWAP.
>
> After all, swap is useless if there's so much that using it brings the
> system to a halt.

In theory you're right, in practice I can't control any of this - it's
the client boxes, I control the DB. The most I can do about it is to
friendly ask the colleagues in charge with that to make sure it won't
happen again, and then still there will be cases like a virtual machine
just crashing.

> > I will probably have to check out now the network connection
> > parameters in the postgres configuration, never had a look at them
> > before... in any case >2 hours mentioned in an earlier post seems a bad
> > default to me.
>
> It's the OS's default. PostgreSQL just doesn't change it.

Well, then looks like I will have to learn a bit about TCP keep-alive
and how linux handles it...

Cheers,
Csaba.

Re: Clients disconnect but query still runs

From

Tom Lane

Date:

30 July 2009, 11:21:02

Craig Ringer <craig@postnewspapers.com.au> writes:
> Greg Stark wrote:
>> Also it requires the server to
>> periodically take time out from processing the query to do this.

> This aspect I'm not to bothered about. I doubt it'd cost anything
> detectable if done a few times a minute - unless it required
> restructuring of query processing to accomodate it. In that case, no way.

Well, it would.  Your suggestion of doing it in a signal handler is a
nonstarter --- signal handlers can't really do anything very useful
for fear of messing up whatever processing they interrupted.

The earlier part of the discussion was focused on getting the kernel
to actively tell us when the connection had dropped.  That would be
workable if we found a way to request it, but I think we'd run out of
options :-(

            regards, tom lane

Re: Clients disconnect but query still runs

From

Greg Stark

Date:

30 July 2009, 11:30:52

On Thu, Jul 30, 2009 at 3:20 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>
> The earlier part of the discussion was focused on getting the kernel
> to actively tell us when the connection had dropped.  That would be
> workable if we found a way to request it, but I think we'd run out of
> options :-(

Yeah, everything I've found on SIGIO indicates it would fire whenever
the socket was readable which isn't what we want. Does anyone have a
pointer for a full documentation on how to use it?

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Clients disconnect but query still runs

From

Tatsuo Ishii

Date:

31 July 2009, 11:13:07

> > Right. For this purpose, pgpool sends param packet to client
> > periodically while waiting for a reply from backend to detect if the
> > connection to the client is broken. If it's broken, pgool sends cancel
> > packet to backend not to waste backend machine's CPU cycle.
>
> The downside to this is that it will cause spurious failures for
> transient network failures even if the network comes back before it's
> actually needed.

Can't we expect such a transient error to be fixed by underlying
TCP/IP stack?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: Clients disconnect but query still runs

From

Nicolas

Date:

06 August 2009, 11:52:53

On 27 juil, 21:49, t...@sss.pgh.pa.us (Tom Lane) wrote:
> Robert James <srobertja...@gmail.com> writes:
> > Hi.  I noticed that when clients (both psql and pgAdmin) disconnect or
> > cancel, queries are often still running on the server.  A few questions:
> > 1) Is there a way to reconnect and get the results?
>
> No.
>
> > 2) Is there a way to tell postgres to automatically stop all queries when
> > the client who queried them disconnects?
>
> No.
>
> > 3) Is there a way to see all queries whose clients have disconnected?
>
> No.

Ok then. Disconnection detection is hard.

Now, I'm in a situation where I have many "IDLE in transaction". I
spotted them myself. I could kill them, but that wouldn't prevent them
from happening again. In fact, it happens regularily.

Is there a way to get the list of SQL statements that were previously
executed as part of a given transaction?

If I could get that, it would be a hell lot easier to figure out the
misbehaving client.

Thanks for any pointers,

Nicolas

Re: Clients disconnect but query still runs

From

Sam Mason

Date:

06 August 2009, 14:07:41

On Thu, Aug 06, 2009 at 07:23:41AM -0700, Nicolas wrote:
> Now, I'm in a situation where I have many "IDLE in transaction". I
> spotted them myself. I could kill them, but that wouldn't prevent them
> from happening again. In fact, it happens regularily.
>
> Is there a way to get the list of SQL statements that were previously
> executed as part of a given transaction?

I'd play with logging and setting your "log_line_prefix"[1] to include
the process id and "log_statement" to "all".  It's then a simple matter
of searching back through the logs to find out what was going on.

--
  Sam  http://samason.me.uk/

 [1] http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LINE-PREFIX
 [2] http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-STATEMENT