Thread: killing pg_dump leaves backend process

killing pg_dump leaves backend process

From

Tatsuo Ishii

Date:

10 August 2013, 02:39:58

I noticed pg_dump does not exit gracefully when killed.

start pg_dump
kill pg_dump by ctrl-c
ps x

27246 ?        Ds    96:02 postgres: t-ishii dbt3 [local] COPY    
29920 ?        S      0:00 sshd: ishii@pts/5
29921 pts/5    Ss     0:00 -bash
30172 ?        Ss     0:00 postgres: t-ishii dbt3 [local] LOCK TABLE waiting

As you can see, after killing pg_dump, a backend process is (LOCK
TABLE waiting) left behind. I think this could be easily fixed by
adding signal handler to pg_dump so that it catches the signal and
issues a query cancel request.

Thoughts?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: killing pg_dump leaves backend process

From

Tom Lane

Date:

10 August 2013, 04:30:34

Tatsuo Ishii <ishii@postgresql.org> writes:
> I noticed pg_dump does not exit gracefully when killed.
> start pg_dump
> kill pg_dump by ctrl-c
> ps x

> 27246 ?        Ds    96:02 postgres: t-ishii dbt3 [local] COPY    
> 29920 ?        S      0:00 sshd: ishii@pts/5
> 29921 pts/5    Ss     0:00 -bash
> 30172 ?        Ss     0:00 postgres: t-ishii dbt3 [local] LOCK TABLE waiting

> As you can see, after killing pg_dump, a backend process is (LOCK
> TABLE waiting) left behind. I think this could be easily fixed by
> adding signal handler to pg_dump so that it catches the signal and
> issues a query cancel request.

If we think that's a problem (which I'm not convinced of) then pg_dump
is the wrong place to fix it.  Any other client would behave the same
if it were killed while waiting for some backend query.  So the right
fix would involve figuring out a way for the backend to kill itself
if the client connection goes away while it's waiting.
        regards, tom lane

Re: killing pg_dump leaves backend process

From

Greg Stark

Date:

10 August 2013, 11:27:30

On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Any other client would behave the same
> if it were killed while waiting for some backend query.  So the right
> fix would involve figuring out a way for the backend to kill itself
> if the client connection goes away while it's waiting.

Well I'm not sure. Maybe every other client should also issue a query
cancel and close the connection if it gets killed. libpq could offer a
function specifically for programs to call from atexit(), signal
handlers, or exception handlers (yes, that might be a bit tricky).

But I do see a convincing argument for doing something in the server.
Namely that if you kill -9 the client surely the server should still
detect that the connection has gone away immediately.

The problem is that I don't know of any way to detect eof on a socket
other than trying to read from it (or calling poll or select). So the
server would have to periodically poll the client even when it's not
expecting any data. The inefficiency is annoying enough and it still
won't detect the eof immediately.

I would actually tend to think libpq should offer a way to easily send
a cancel and disconnect message immediately upon exiting or closing
the connection *and* the server should periodically poll to check for
the connection being cleanly closed to handle kill -9.

I'm surprised this is the first time we're hearing people complain
about this. I know I've seen similar behaviour from Mysql and thought
to myself that represented pretty poor behaviour and assumed Postgres
did better.

-- 
greg

Re: killing pg_dump leaves backend process

From

Tatsuo Ishii

Date:

10 August 2013, 12:08:14

> On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Any other client would behave the same
>> if it were killed while waiting for some backend query.  So the right
>> fix would involve figuring out a way for the backend to kill itself
>> if the client connection goes away while it's waiting.

I am a little bit surprised to hear the response. I'm talking about
one of the client programs that are part of PostgreSQL. IMO they
should satisfy higher standard than other PostgreSQL application
programs in error control and signal handling.

> Well I'm not sure. Maybe every other client should also issue a query
> cancel and close the connection if it gets killed. libpq could offer a
> function specifically for programs to call from atexit(), signal
> handlers, or exception handlers (yes, that might be a bit tricky).

I'm not sure if it's a duty of libpq. Different applications need to
behave differently when catching signals. I think It would be better
to leave the job for applications.

> But I do see a convincing argument for doing something in the server.
> Namely that if you kill -9 the client surely the server should still
> detect that the connection has gone away immediately.
> 
> The problem is that I don't know of any way to detect eof on a socket
> other than trying to read from it (or calling poll or select). So the
> server would have to periodically poll the client even when it's not
> expecting any data. The inefficiency is annoying enough and it still
> won't detect the eof immediately.

I think in some cases reading from socket is not reliable enough to
detect a broken socket. Writing to the socket is more reliable.  For
this prupose Pgpool-II periodically sends "parameter status" packet to
frontend while waiting for response from backend to detect the socket
is broken or not. Probably PostgreSQL backend could do similar thing.

> I would actually tend to think libpq should offer a way to easily send
> a cancel and disconnect message immediately upon exiting or closing
> the connection *and* the server should periodically poll to check for
> the connection being cleanly closed to handle kill -9.
>
> I'm surprised this is the first time we're hearing people complain
> about this. I know I've seen similar behaviour from Mysql and thought
> to myself that represented pretty poor behaviour and assumed Postgres
> did better.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: killing pg_dump leaves backend process

From

Christopher Browne

Date:

10 August 2013, 16:30:52

On Sat, Aug 10, 2013 at 12:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tatsuo Ishii <ishii@postgresql.org> writes:
>> I noticed pg_dump does not exit gracefully when killed.
>> start pg_dump
>> kill pg_dump by ctrl-c
>> ps x
>
>> 27246 ?        Ds    96:02 postgres: t-ishii dbt3 [local] COPY
>> 29920 ?        S      0:00 sshd: ishii@pts/5
>> 29921 pts/5    Ss     0:00 -bash
>> 30172 ?        Ss     0:00 postgres: t-ishii dbt3 [local] LOCK TABLE waiting
>
>> As you can see, after killing pg_dump, a backend process is (LOCK
>> TABLE waiting) left behind. I think this could be easily fixed by
>> adding signal handler to pg_dump so that it catches the signal and
>> issues a query cancel request.
>
> If we think that's a problem (which I'm not convinced of) then pg_dump
> is the wrong place to fix it.  Any other client would behave the same
> if it were killed while waiting for some backend query.  So the right
> fix would involve figuring out a way for the backend to kill itself
> if the client connection goes away while it's waiting.

This seems to me to be quite a bit like the TCP keepalive issue.

We noticed with Slony that if something ungraceful happens in the
networking layer (the specific thing noticed was someone shutting off
networking, e.g. "/etc/init.d/networking stop" before shutting down
Postgres+Slony), the usual timeouts are really rather excessive, on
the order of a couple hours.

Probably it would be desirable to reduce the timeout period, so that
the server could figure out that clients are incommunicado "reasonably
quickly."  It's conceivable that it would be apropos to diminish the
timeout values in postgresql.conf, or at least to recommend that users
consider doing so.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: killing pg_dump leaves backend process

From

Noah Misch

Date:

11 August 2013, 02:10:26

On Sat, Aug 10, 2013 at 12:26:43PM +0100, Greg Stark wrote:
> On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Any other client would behave the same
> > if it were killed while waiting for some backend query.  So the right
> > fix would involve figuring out a way for the backend to kill itself
> > if the client connection goes away while it's waiting.
> 
> Well I'm not sure. Maybe every other client should also issue a query
> cancel and close the connection if it gets killed. libpq could offer a
> function specifically for programs to call from atexit(), signal
> handlers, or exception handlers (yes, that might be a bit tricky).
> 
> But I do see a convincing argument for doing something in the server.
> Namely that if you kill -9 the client surely the server should still
> detect that the connection has gone away immediately.

I agree that both efforts have value.  A client-side change can't replace the
server-side change, and tightening the client side will be more of a neatness
measure once the server-side mechanism is in place.

> The problem is that I don't know of any way to detect eof on a socket
> other than trying to read from it (or calling poll or select). So the
> server would have to periodically poll the client even when it's not
> expecting any data. The inefficiency is annoying enough and it still
> won't detect the eof immediately.

Yes, I think that is the way to do it.  The check interval could default to
something like 90s, high enough to make the cost disappear into the noise and
yet a dramatic improvement over the current "no fixed time limit".

I bet the utils/timeout.h infrastructure added in 9.3 will make this at least
60% easier to implement than it would have been before.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: killing pg_dump leaves backend process

From

Greg Stark

Date:

11 August 2013, 02:24:41

<p dir="ltr">I think this is utterly the won't way to think about this.<p dir="ltr">TCP is designed to be robust
againsttransient network outages. They are *not* supposed to cause disconnections. The purpose of keepalives is to
detectconnections that are still valid live connections that are stale and the remote end is not longer present for.<p
dir="ltr">Keepalivesthat trigger on the timescale of less than several times the msl are just broken and make TCP
unreliable.That means they cannot trigger in less than many minutes.<p dir="ltr">This case is one that should just work
andshould work immediately. From the users point of view when a client cleanly dies the kernel on the client is fully
awareof the connection being closed and the network is working fine. The server should be aware the client has gone
away*immediately*. There's no excuse for any polling or timeouts.<br /><br /><p dir="ltr">-- <br /> greg<div
class="gmail_quote">On10 Aug 2013 17:30, "Christopher Browne" <<a
href="mailto:cbbrowne@gmail.com">cbbrowne@gmail.com</a>>wrote:<br type="attribution" /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On Sat, Aug 10, 2013 at 12:30
AM,Tom Lane <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br /> > Tatsuo Ishii <<a
href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>>writes:<br /> >> I noticed pg_dump does not exit
gracefullywhen killed.<br /> >> start pg_dump<br /> >> kill pg_dump by ctrl-c<br /> >> ps x<br />
><br/> >> 27246 ?        Ds    96:02 postgres: t-ishii dbt3 [local] COPY<br /> >> 29920 ?        S    
 0:00sshd: ishii@pts/5<br /> >> 29921 pts/5    Ss     0:00 -bash<br /> >> 30172 ?        Ss     0:00
postgres:t-ishii dbt3 [local] LOCK TABLE waiting<br /> ><br /> >> As you can see, after killing pg_dump, a
backendprocess is (LOCK<br /> >> TABLE waiting) left behind. I think this could be easily fixed by<br /> >>
addingsignal handler to pg_dump so that it catches the signal and<br /> >> issues a query cancel request.<br />
><br/> > If we think that's a problem (which I'm not convinced of) then pg_dump<br /> > is the wrong place to
fixit.  Any other client would behave the same<br /> > if it were killed while waiting for some backend query.  So
theright<br /> > fix would involve figuring out a way for the backend to kill itself<br /> > if the client
connectiongoes away while it's waiting.<br /><br /> This seems to me to be quite a bit like the TCP keepalive issue.<br
/><br/> We noticed with Slony that if something ungraceful happens in the<br /> networking layer (the specific thing
noticedwas someone shutting off<br /> networking, e.g. "/etc/init.d/networking stop" before shutting down<br />
Postgres+Slony),the usual timeouts are really rather excessive, on<br /> the order of a couple hours.<br /><br />
Probablyit would be desirable to reduce the timeout period, so that<br /> the server could figure out that clients are
incommunicado"reasonably<br /> quickly."  It's conceivable that it would be apropos to diminish the<br /> timeout
valuesin postgresql.conf, or at least to recommend that users<br /> consider doing so.<br /> --<br /> When confronted
bya difficult problem, solve it by reducing it to the<br /> question, "How would the Lone Ranger handle this?"<br /><br
/><br/> --<br /> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></blockquote></div>

Re: killing pg_dump leaves backend process

From

Josh Berkus

Date:

11 August 2013, 20:26:01

On 08/10/2013 04:26 AM, Greg Stark wrote:
> On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Any other client would behave the same
>> if it were killed while waiting for some backend query.  So the right
>> fix would involve figuring out a way for the backend to kill itself
>> if the client connection goes away while it's waiting.

I've been waiting forever to have something we can justifiably call the
"loner suicide patch".  ;-)

> I'm surprised this is the first time we're hearing people complain
> about this. I know I've seen similar behaviour from Mysql and thought
> to myself that represented pretty poor behaviour and assumed Postgres
> did better.

No, it's been a chronic issue since we got SMP support, pretty much
forever.  Why do you think we have pg_terminate_backend()?

The problem, as explored downthread, is that there's no clear way to fix
this.  It's a problem which goes pretty far beyond PostgreSQL; you can
experience the same issue on Apache with stuck downloads.

Our advantage over MySQL is that the idle process isn't likely to crash
anything ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: killing pg_dump leaves backend process

From

Greg Stark

Date:

11 August 2013, 21:05:01

On Sun, Aug 11, 2013 at 9:25 PM, Josh Berkus <josh@agliodbs.com> wrote:
> No, it's been a chronic issue since we got SMP support, pretty much
> forever.  Why do you think we have pg_terminate_backend()?
>
> The problem, as explored downthread, is that there's no clear way to fix
> this.  It's a problem which goes pretty far beyond PostgreSQL; you can
> experience the same issue on Apache with stuck downloads.

No. There are multiple problems that can cause a stuck orphaned server
process and I think you're conflating different kinds of problems.

a) If the client dies due to C-c or kill or any other normal exit
path. There's really no excuse for not detecting that situation
*immediately*. As suggested in the original post the client could
notify the server before it dies that it's about to die.

b) If the client dies in some abnormal path such as kill -9. In that
case we could easily detect the situation as quickly as we want but
the more often we probe the more time we waste and cpu wakeups we
waste sending probes. We would only need to react to errors on that
connection (RST packets which will cause a SIGIO or eof depending on
what we ask for), not a lack of response so it doesn't need to make
things more fragile.

c) If the client goes away either because it's turned off or the
network is disconnected. This is the problem Apache faces because it's
exposed to the internet at large. We're not entirely immune to it but
we have much less of a problem with it. The problem here is that
there's really no easy solution at all. If you send keep-alives and
time them out then transient network connections become spurious fatal
errors.

> Our advantage over MySQL is that the idle process isn't likely to crash
> anything ...

-- 
greg

Re: killing pg_dump leaves backend process

From

Jeff Janes

Date:

12 August 2013, 16:27:07

On Sat, Aug 10, 2013 at 4:26 AM, Greg Stark <stark@mit.edu> wrote:
>
> The problem is that I don't know of any way to detect eof on a socket
> other than trying to read from it (or calling poll or select). So the
> server would have to periodically poll the client even when it's not
> expecting any data. The inefficiency is annoying enough and it still
> won't detect the eof immediately.

Do we know how inefficient it is, compared to the baseline work done
by CHECK_FOR_INTERRUPTS() and its affiliated machinery?

...

>
> I'm surprised this is the first time we're hearing people complain
> about this. I know I've seen similar behaviour from Mysql and thought
> to myself that represented pretty poor behaviour and assumed Postgres
> did better.

I've seen other complaints about it (and made at least one myself)

Cheers,

Jeff

Re: killing pg_dump leaves backend process

From

Tom Lane

Date:

12 August 2013, 17:57:01

Jeff Janes <jeff.janes@gmail.com> writes:
> On Sat, Aug 10, 2013 at 4:26 AM, Greg Stark <stark@mit.edu> wrote:
>> The problem is that I don't know of any way to detect eof on a socket
>> other than trying to read from it (or calling poll or select).

> Do we know how inefficient it is, compared to the baseline work done
> by CHECK_FOR_INTERRUPTS() and its affiliated machinery?

CHECK_FOR_INTERRUPTS() is about two instructions (test a global variable
and branch) in the normal case with nothing to do.  Don't even think of
putting a kernel call into it.
        regards, tom lane

Re: killing pg_dump leaves backend process

From

Greg Stark

Date:

12 August 2013, 22:19:04

On Mon, Aug 12, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> On Sat, Aug 10, 2013 at 4:26 AM, Greg Stark <stark@mit.edu> wrote:
>>> The problem is that I don't know of any way to detect eof on a socket
>>> other than trying to read from it (or calling poll or select).
>
>> Do we know how inefficient it is, compared to the baseline work done
>> by CHECK_FOR_INTERRUPTS() and its affiliated machinery?
>
> CHECK_FOR_INTERRUPTS() is about two instructions (test a global variable
> and branch) in the normal case with nothing to do.  Don't even think of
> putting a kernel call into it.

So I poked around a bit. It looks like Linux does send a SIGIO when a
tcp connection is closed (with POLL_HUP if it's closed and POLL_IN if
it's half-closed). So it should be possible to arrange to get a signal
which CHECK_FOR_INTERRUPTS could handle the normal way.

However this would mean getting a signal every time there's data
available from the client. I don't know how inefficient that would be
or how convenient it would be to turn it off and on all the time so we
aren't constantly receiving useless signals.

I'm not sure how portal this behaviour is either. There may well be
platforms where having the socket closed doesn't generate a SIGIO.

I'm not sure this is the end of the story either. Ok, so the tcp
stream is closed, does that mean it's safe to end the currently
executing command? There may be a commit buffered up in the stream
that hasn't been processed yet. If you connect and send "vacuum" and
then close the connection do you expect the vacuum to just cancel
immediately?

It does seem obvious that a select shouldn't keep running since it
will die as soon as it produces any output. It may well be that
Postgres should just document it as part of the protocol that if the
tcp connection is closed then whatever command was running may be
terminated at any time since that's effectively true since really any
WARNING or INFO would do that anyways and we don't have any policy of
discouraging those for fear of causing spurious failures.

-- 
greg

Re: killing pg_dump leaves backend process

From

Tom Lane

Date:

12 August 2013, 22:41:51

Greg Stark <stark@mit.edu> writes:
> So I poked around a bit. It looks like Linux does send a SIGIO when a
> tcp connection is closed (with POLL_HUP if it's closed and POLL_IN if
> it's half-closed). So it should be possible to arrange to get a signal
> which CHECK_FOR_INTERRUPTS could handle the normal way.

> However this would mean getting a signal every time there's data
> available from the client. I don't know how inefficient that would be
> or how convenient it would be to turn it off and on all the time so we
> aren't constantly receiving useless signals.

That sounds like a mess --- race conditions all over the place,
even aside from efficiency worries.

> I'm not sure how portal this behaviour is either. There may well be
> platforms where having the socket closed doesn't generate a SIGIO.

AFAICS, the POSIX spec doesn't define SIGIO at all, so this worry is
probably very real.

What I *do* see standardized in POSIX is SIGURG (out-of-band data is
available).  If that's delivered upon socket close, which unfortunately
POSIX doesn't say, then it'd avoid the race condition issue.  We don't
use out-of-band data in the protocol and could easily say that we'll
never do so in future.

Of course the elephant in the room is Windows --- does it support
any of this stuff?
        regards, tom lane

Re: killing pg_dump leaves backend process

From

Greg Stark

Date:

12 August 2013, 23:02:26

On Mon, Aug 12, 2013 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> That sounds like a mess --- race conditions all over the place, even aside from efficiency worries.

This I don't understand. All I'm envisioning is setting a flag in the
signal handler. If that flag is set then the next CHECK_FOR_INTERRUPTS
would check for eof on the client input anyways (by reading some
additional data into the so any spurious signals due to races would
just be ignored anyways.

It occurs to me it can be kind of tricky to arrange for the
communication layer to actually try to read however. It may have some
data buffered up and choose not to read anything. It's possibly even
going through openssl so we may not even know whether the read
actually happened. Still, at least trying is better than not.

> AFAICS, the POSIX spec doesn't define SIGIO at all, so this worry is
> probably very real.
>
> What I *do* see standardized in POSIX is SIGURG (out-of-band data is
> available).  If that's delivered upon socket close

It's not. You're not going to get SIGURG unless any data is sent with
MSG_OOB. That's not helpful since if the client actually was aware it
was about to exit it could have happily done the existing query cancel
dance. (We could use MSG_OOB and SIGURG instead of our existing query
cancel tricks which might be simpler but given we already have the
existing code and it works I doubt anyone's going to get excited about
experimenting with replacing it with something that's rarely used and
nobody's familiar with any more.)

I do think it's worth making it easy for clients to send a normal
cancel whenever they exit normally. That would probably cover 90% of
the actual problem cases.

> Of course the elephant in the room is Windows --- does it support
> any of this stuff?

I suspect there are three different competing APIs for doing this on
Windows, none of which is spelled the same as Unix but are all better
in various subtly different ways.

-- 
greg