Thread: killing pg_dump leaves backend process
I noticed pg_dump does not exit gracefully when killed. start pg_dump kill pg_dump by ctrl-c ps x 27246 ? Ds 96:02 postgres: t-ishii dbt3 [local] COPY 29920 ? S 0:00 sshd: ishii@pts/5 29921 pts/5 Ss 0:00 -bash 30172 ? Ss 0:00 postgres: t-ishii dbt3 [local] LOCK TABLE waiting As you can see, after killing pg_dump, a backend process is (LOCK TABLE waiting) left behind. I think this could be easily fixed by adding signal handler to pg_dump so that it catches the signal and issues a query cancel request. Thoughts? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Tatsuo Ishii <ishii@postgresql.org> writes: > I noticed pg_dump does not exit gracefully when killed. > start pg_dump > kill pg_dump by ctrl-c > ps x > 27246 ? Ds 96:02 postgres: t-ishii dbt3 [local] COPY > 29920 ? S 0:00 sshd: ishii@pts/5 > 29921 pts/5 Ss 0:00 -bash > 30172 ? Ss 0:00 postgres: t-ishii dbt3 [local] LOCK TABLE waiting > As you can see, after killing pg_dump, a backend process is (LOCK > TABLE waiting) left behind. I think this could be easily fixed by > adding signal handler to pg_dump so that it catches the signal and > issues a query cancel request. If we think that's a problem (which I'm not convinced of) then pg_dump is the wrong place to fix it. Any other client would behave the same if it were killed while waiting for some backend query. So the right fix would involve figuring out a way for the backend to kill itself if the client connection goes away while it's waiting. regards, tom lane
On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Any other client would behave the same > if it were killed while waiting for some backend query. So the right > fix would involve figuring out a way for the backend to kill itself > if the client connection goes away while it's waiting. Well I'm not sure. Maybe every other client should also issue a query cancel and close the connection if it gets killed. libpq could offer a function specifically for programs to call from atexit(), signal handlers, or exception handlers (yes, that might be a bit tricky). But I do see a convincing argument for doing something in the server. Namely that if you kill -9 the client surely the server should still detect that the connection has gone away immediately. The problem is that I don't know of any way to detect eof on a socket other than trying to read from it (or calling poll or select). So the server would have to periodically poll the client even when it's not expecting any data. The inefficiency is annoying enough and it still won't detect the eof immediately. I would actually tend to think libpq should offer a way to easily send a cancel and disconnect message immediately upon exiting or closing the connection *and* the server should periodically poll to check for the connection being cleanly closed to handle kill -9. I'm surprised this is the first time we're hearing people complain about this. I know I've seen similar behaviour from Mysql and thought to myself that represented pretty poor behaviour and assumed Postgres did better. -- greg
> On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Any other client would behave the same >> if it were killed while waiting for some backend query. So the right >> fix would involve figuring out a way for the backend to kill itself >> if the client connection goes away while it's waiting. I am a little bit surprised to hear the response. I'm talking about one of the client programs that are part of PostgreSQL. IMO they should satisfy higher standard than other PostgreSQL application programs in error control and signal handling. > Well I'm not sure. Maybe every other client should also issue a query > cancel and close the connection if it gets killed. libpq could offer a > function specifically for programs to call from atexit(), signal > handlers, or exception handlers (yes, that might be a bit tricky). I'm not sure if it's a duty of libpq. Different applications need to behave differently when catching signals. I think It would be better to leave the job for applications. > But I do see a convincing argument for doing something in the server. > Namely that if you kill -9 the client surely the server should still > detect that the connection has gone away immediately. > > The problem is that I don't know of any way to detect eof on a socket > other than trying to read from it (or calling poll or select). So the > server would have to periodically poll the client even when it's not > expecting any data. The inefficiency is annoying enough and it still > won't detect the eof immediately. I think in some cases reading from socket is not reliable enough to detect a broken socket. Writing to the socket is more reliable. For this prupose Pgpool-II periodically sends "parameter status" packet to frontend while waiting for response from backend to detect the socket is broken or not. Probably PostgreSQL backend could do similar thing. > I would actually tend to think libpq should offer a way to easily send > a cancel and disconnect message immediately upon exiting or closing > the connection *and* the server should periodically poll to check for > the connection being cleanly closed to handle kill -9. > > I'm surprised this is the first time we're hearing people complain > about this. I know I've seen similar behaviour from Mysql and thought > to myself that represented pretty poor behaviour and assumed Postgres > did better. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On Sat, Aug 10, 2013 at 12:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tatsuo Ishii <ishii@postgresql.org> writes: >> I noticed pg_dump does not exit gracefully when killed. >> start pg_dump >> kill pg_dump by ctrl-c >> ps x > >> 27246 ? Ds 96:02 postgres: t-ishii dbt3 [local] COPY >> 29920 ? S 0:00 sshd: ishii@pts/5 >> 29921 pts/5 Ss 0:00 -bash >> 30172 ? Ss 0:00 postgres: t-ishii dbt3 [local] LOCK TABLE waiting > >> As you can see, after killing pg_dump, a backend process is (LOCK >> TABLE waiting) left behind. I think this could be easily fixed by >> adding signal handler to pg_dump so that it catches the signal and >> issues a query cancel request. > > If we think that's a problem (which I'm not convinced of) then pg_dump > is the wrong place to fix it. Any other client would behave the same > if it were killed while waiting for some backend query. So the right > fix would involve figuring out a way for the backend to kill itself > if the client connection goes away while it's waiting. This seems to me to be quite a bit like the TCP keepalive issue. We noticed with Slony that if something ungraceful happens in the networking layer (the specific thing noticed was someone shutting off networking, e.g. "/etc/init.d/networking stop" before shutting down Postgres+Slony), the usual timeouts are really rather excessive, on the order of a couple hours. Probably it would be desirable to reduce the timeout period, so that the server could figure out that clients are incommunicado "reasonably quickly." It's conceivable that it would be apropos to diminish the timeout values in postgresql.conf, or at least to recommend that users consider doing so. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On Sat, Aug 10, 2013 at 12:26:43PM +0100, Greg Stark wrote: > On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Any other client would behave the same > > if it were killed while waiting for some backend query. So the right > > fix would involve figuring out a way for the backend to kill itself > > if the client connection goes away while it's waiting. > > Well I'm not sure. Maybe every other client should also issue a query > cancel and close the connection if it gets killed. libpq could offer a > function specifically for programs to call from atexit(), signal > handlers, or exception handlers (yes, that might be a bit tricky). > > But I do see a convincing argument for doing something in the server. > Namely that if you kill -9 the client surely the server should still > detect that the connection has gone away immediately. I agree that both efforts have value. A client-side change can't replace the server-side change, and tightening the client side will be more of a neatness measure once the server-side mechanism is in place. > The problem is that I don't know of any way to detect eof on a socket > other than trying to read from it (or calling poll or select). So the > server would have to periodically poll the client even when it's not > expecting any data. The inefficiency is annoying enough and it still > won't detect the eof immediately. Yes, I think that is the way to do it. The check interval could default to something like 90s, high enough to make the cost disappear into the noise and yet a dramatic improvement over the current "no fixed time limit". I bet the utils/timeout.h infrastructure added in 9.3 will make this at least 60% easier to implement than it would have been before. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
<p dir="ltr">I think this is utterly the won't way to think about this.<p dir="ltr">TCP is designed to be robust againsttransient network outages. They are *not* supposed to cause disconnections. The purpose of keepalives is to detectconnections that are still valid live connections that are stale and the remote end is not longer present for.<p dir="ltr">Keepalivesthat trigger on the timescale of less than several times the msl are just broken and make TCP unreliable.That means they cannot trigger in less than many minutes.<p dir="ltr">This case is one that should just work andshould work immediately. From the users point of view when a client cleanly dies the kernel on the client is fully awareof the connection being closed and the network is working fine. The server should be aware the client has gone away*immediately*. There's no excuse for any polling or timeouts.<br /><br /><p dir="ltr">-- <br /> greg<div class="gmail_quote">On10 Aug 2013 17:30, "Christopher Browne" <<a href="mailto:cbbrowne@gmail.com">cbbrowne@gmail.com</a>>wrote:<br type="attribution" /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On Sat, Aug 10, 2013 at 12:30 AM,Tom Lane <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br /> > Tatsuo Ishii <<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>>writes:<br /> >> I noticed pg_dump does not exit gracefullywhen killed.<br /> >> start pg_dump<br /> >> kill pg_dump by ctrl-c<br /> >> ps x<br /> ><br/> >> 27246 ? Ds 96:02 postgres: t-ishii dbt3 [local] COPY<br /> >> 29920 ? S 0:00sshd: ishii@pts/5<br /> >> 29921 pts/5 Ss 0:00 -bash<br /> >> 30172 ? Ss 0:00 postgres:t-ishii dbt3 [local] LOCK TABLE waiting<br /> ><br /> >> As you can see, after killing pg_dump, a backendprocess is (LOCK<br /> >> TABLE waiting) left behind. I think this could be easily fixed by<br /> >> addingsignal handler to pg_dump so that it catches the signal and<br /> >> issues a query cancel request.<br /> ><br/> > If we think that's a problem (which I'm not convinced of) then pg_dump<br /> > is the wrong place to fixit. Any other client would behave the same<br /> > if it were killed while waiting for some backend query. So theright<br /> > fix would involve figuring out a way for the backend to kill itself<br /> > if the client connectiongoes away while it's waiting.<br /><br /> This seems to me to be quite a bit like the TCP keepalive issue.<br /><br/> We noticed with Slony that if something ungraceful happens in the<br /> networking layer (the specific thing noticedwas someone shutting off<br /> networking, e.g. "/etc/init.d/networking stop" before shutting down<br /> Postgres+Slony),the usual timeouts are really rather excessive, on<br /> the order of a couple hours.<br /><br /> Probablyit would be desirable to reduce the timeout period, so that<br /> the server could figure out that clients are incommunicado"reasonably<br /> quickly." It's conceivable that it would be apropos to diminish the<br /> timeout valuesin postgresql.conf, or at least to recommend that users<br /> consider doing so.<br /> --<br /> When confronted bya difficult problem, solve it by reducing it to the<br /> question, "How would the Lone Ranger handle this?"<br /><br /><br/> --<br /> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></blockquote></div>
On 08/10/2013 04:26 AM, Greg Stark wrote: > On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Any other client would behave the same >> if it were killed while waiting for some backend query. So the right >> fix would involve figuring out a way for the backend to kill itself >> if the client connection goes away while it's waiting. I've been waiting forever to have something we can justifiably call the "loner suicide patch". ;-) > I'm surprised this is the first time we're hearing people complain > about this. I know I've seen similar behaviour from Mysql and thought > to myself that represented pretty poor behaviour and assumed Postgres > did better. No, it's been a chronic issue since we got SMP support, pretty much forever. Why do you think we have pg_terminate_backend()? The problem, as explored downthread, is that there's no clear way to fix this. It's a problem which goes pretty far beyond PostgreSQL; you can experience the same issue on Apache with stuck downloads. Our advantage over MySQL is that the idle process isn't likely to crash anything ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Sun, Aug 11, 2013 at 9:25 PM, Josh Berkus <josh@agliodbs.com> wrote: > No, it's been a chronic issue since we got SMP support, pretty much > forever. Why do you think we have pg_terminate_backend()? > > The problem, as explored downthread, is that there's no clear way to fix > this. It's a problem which goes pretty far beyond PostgreSQL; you can > experience the same issue on Apache with stuck downloads. No. There are multiple problems that can cause a stuck orphaned server process and I think you're conflating different kinds of problems. a) If the client dies due to C-c or kill or any other normal exit path. There's really no excuse for not detecting that situation *immediately*. As suggested in the original post the client could notify the server before it dies that it's about to die. b) If the client dies in some abnormal path such as kill -9. In that case we could easily detect the situation as quickly as we want but the more often we probe the more time we waste and cpu wakeups we waste sending probes. We would only need to react to errors on that connection (RST packets which will cause a SIGIO or eof depending on what we ask for), not a lack of response so it doesn't need to make things more fragile. c) If the client goes away either because it's turned off or the network is disconnected. This is the problem Apache faces because it's exposed to the internet at large. We're not entirely immune to it but we have much less of a problem with it. The problem here is that there's really no easy solution at all. If you send keep-alives and time them out then transient network connections become spurious fatal errors. > Our advantage over MySQL is that the idle process isn't likely to crash > anything ... -- greg
On Sat, Aug 10, 2013 at 4:26 AM, Greg Stark <stark@mit.edu> wrote: > > The problem is that I don't know of any way to detect eof on a socket > other than trying to read from it (or calling poll or select). So the > server would have to periodically poll the client even when it's not > expecting any data. The inefficiency is annoying enough and it still > won't detect the eof immediately. Do we know how inefficient it is, compared to the baseline work done by CHECK_FOR_INTERRUPTS() and its affiliated machinery? ... > > I'm surprised this is the first time we're hearing people complain > about this. I know I've seen similar behaviour from Mysql and thought > to myself that represented pretty poor behaviour and assumed Postgres > did better. I've seen other complaints about it (and made at least one myself) Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Sat, Aug 10, 2013 at 4:26 AM, Greg Stark <stark@mit.edu> wrote: >> The problem is that I don't know of any way to detect eof on a socket >> other than trying to read from it (or calling poll or select). > Do we know how inefficient it is, compared to the baseline work done > by CHECK_FOR_INTERRUPTS() and its affiliated machinery? CHECK_FOR_INTERRUPTS() is about two instructions (test a global variable and branch) in the normal case with nothing to do. Don't even think of putting a kernel call into it. regards, tom lane
On Mon, Aug 12, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> On Sat, Aug 10, 2013 at 4:26 AM, Greg Stark <stark@mit.edu> wrote: >>> The problem is that I don't know of any way to detect eof on a socket >>> other than trying to read from it (or calling poll or select). > >> Do we know how inefficient it is, compared to the baseline work done >> by CHECK_FOR_INTERRUPTS() and its affiliated machinery? > > CHECK_FOR_INTERRUPTS() is about two instructions (test a global variable > and branch) in the normal case with nothing to do. Don't even think of > putting a kernel call into it. So I poked around a bit. It looks like Linux does send a SIGIO when a tcp connection is closed (with POLL_HUP if it's closed and POLL_IN if it's half-closed). So it should be possible to arrange to get a signal which CHECK_FOR_INTERRUPTS could handle the normal way. However this would mean getting a signal every time there's data available from the client. I don't know how inefficient that would be or how convenient it would be to turn it off and on all the time so we aren't constantly receiving useless signals. I'm not sure how portal this behaviour is either. There may well be platforms where having the socket closed doesn't generate a SIGIO. I'm not sure this is the end of the story either. Ok, so the tcp stream is closed, does that mean it's safe to end the currently executing command? There may be a commit buffered up in the stream that hasn't been processed yet. If you connect and send "vacuum" and then close the connection do you expect the vacuum to just cancel immediately? It does seem obvious that a select shouldn't keep running since it will die as soon as it produces any output. It may well be that Postgres should just document it as part of the protocol that if the tcp connection is closed then whatever command was running may be terminated at any time since that's effectively true since really any WARNING or INFO would do that anyways and we don't have any policy of discouraging those for fear of causing spurious failures. -- greg
Greg Stark <stark@mit.edu> writes: > So I poked around a bit. It looks like Linux does send a SIGIO when a > tcp connection is closed (with POLL_HUP if it's closed and POLL_IN if > it's half-closed). So it should be possible to arrange to get a signal > which CHECK_FOR_INTERRUPTS could handle the normal way. > However this would mean getting a signal every time there's data > available from the client. I don't know how inefficient that would be > or how convenient it would be to turn it off and on all the time so we > aren't constantly receiving useless signals. That sounds like a mess --- race conditions all over the place, even aside from efficiency worries. > I'm not sure how portal this behaviour is either. There may well be > platforms where having the socket closed doesn't generate a SIGIO. AFAICS, the POSIX spec doesn't define SIGIO at all, so this worry is probably very real. What I *do* see standardized in POSIX is SIGURG (out-of-band data is available). If that's delivered upon socket close, which unfortunately POSIX doesn't say, then it'd avoid the race condition issue. We don't use out-of-band data in the protocol and could easily say that we'll never do so in future. Of course the elephant in the room is Windows --- does it support any of this stuff? regards, tom lane
On Mon, Aug 12, 2013 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > That sounds like a mess --- race conditions all over the place, even aside from efficiency worries. This I don't understand. All I'm envisioning is setting a flag in the signal handler. If that flag is set then the next CHECK_FOR_INTERRUPTS would check for eof on the client input anyways (by reading some additional data into the so any spurious signals due to races would just be ignored anyways. It occurs to me it can be kind of tricky to arrange for the communication layer to actually try to read however. It may have some data buffered up and choose not to read anything. It's possibly even going through openssl so we may not even know whether the read actually happened. Still, at least trying is better than not. > AFAICS, the POSIX spec doesn't define SIGIO at all, so this worry is > probably very real. > > What I *do* see standardized in POSIX is SIGURG (out-of-band data is > available). If that's delivered upon socket close It's not. You're not going to get SIGURG unless any data is sent with MSG_OOB. That's not helpful since if the client actually was aware it was about to exit it could have happily done the existing query cancel dance. (We could use MSG_OOB and SIGURG instead of our existing query cancel tricks which might be simpler but given we already have the existing code and it works I doubt anyone's going to get excited about experimenting with replacing it with something that's rarely used and nobody's familiar with any more.) I do think it's worth making it easy for clients to send a normal cancel whenever they exit normally. That would probably cover 90% of the actual problem cases. > Of course the elephant in the room is Windows --- does it support > any of this stuff? I suspect there are three different competing APIs for doing this on Windows, none of which is spelled the same as Unix but are all better in various subtly different ways. -- greg