Thread: BUG #5459: Unable to cancel query while in send()
The following bug has been logged online: Bug reference: 5459 Logged by: Mason Hale Email address: mason@onespot.com PostgreSQL version: 8.3.8 Operating system: Redhat EL 5.1-64 bit Description: Unable to cancel query while in send() Details: ISSUE: unable to cancel queries using pg_cancel_backend(), that are in send() function call, waiting on client receipt of data. EXPECTED RESULT: expect to be able to cancel most/all queries using pg_cancel_backend() as superuser, perhaps with some wait time, but not an hour or more. = SYMPTOM = A SELECT query was running over 18 hours on our PostgreSQL 8.3.8 server. Verified that it was not waiting on any locks via pg_stat_activity. Attempted to cancel the query using pg_cancel_backend(), which returned 't'. However more than an hour later the process was still active, using about 6% of CPU and 5% of RAM. Terminated the client process that was running the query (from another server) did not cause the query process on the pgsql server to stop. In this case the client was connecting via a ssh tunnel through an intermediate 'gateway' server. Connection path was: CLIENT --> SSH GATEWAY --> DB SERVER = DIAGNOSIS = Diagnosed this issue with help from 'andres' in #postgresql IRC. Per his request, attached to 'stuck' process using gdb, generating the following outputs: - Initial backtrace: http://pgsql.privatepaste.com/6f15c7e363 -( 'c', then ctrl+c, then 'bt full') x 4: http://pgsql.privatepaste.com/3d3261659a - Stepping several times with 'n': http://pgsql.privatepaste.com/0f302125a8 'andres' reported that interrupts were not checked in send() and probably should be, and suggested opening this bug report. Additional investigation of the ssh tunnel connection revealed the connection on the intermediate gateway server was stuck in a FIN_WAIT2 state (as reported by netstat). The other end of the connection on the pgsql server was reported as CLOSE_WAIT by netstat. Kiling the ssh tunnel process on the gateway server cleared the connection and the long-running query process db server terminated very soon after.
"Mason Hale" <mason@onespot.com> writes: > ISSUE: unable to cancel queries using pg_cancel_backend(), that are in > send() function call, waiting on client receipt of data. I think what you are describing is a kernel bug. There's not a lot we can do about it if the send() call hangs. Considering the kernel already knows the connection is closed (per the CLOSE_WAIT state shown by netstat) the send() should return failure immediately, and it's not doing so. There might be some TCP-level incompatibility involved between the database and gateway server TCP stacks, since the combination of the FIN_WAIT2 and CLOSE_WAIT states really ought not persist very long; but I'm not a network hacker so I'm a bit out of my depth in diagnosing that aspect of it. regards, tom lane
Hi, On Wednesday 12 May 2010 03:44:16 Tom Lane wrote: > "Mason Hale" <mason@onespot.com> writes: > > ISSUE: unable to cancel queries using pg_cancel_backend(), that are in > > send() function call, waiting on client receipt of data. > I think what you are describing is a kernel bug. There's not a lot > we can do about it if the send() call hangs. Considering the kernel > already knows the connection is closed (per the CLOSE_WAIT state shown > by netstat) the send() should return failure immediately, and it's not > doing so. I can reproduce the issue though when the connection just is very, very slow (high packet loss). Uppon receiving a signal the send returns with EINTR uppon which point I think a check for interrupts might be placed. > There might be some TCP-level incompatibility involved between the > database and gateway server TCP stacks, since the combination of the > FIN_WAIT2 and CLOSE_WAIT states really ought not persist very long; > but I'm not a network hacker so I'm a bit out of my depth in diagnosing > that aspect of it. There is a userland implementation (ssh) involved, so that does sound likely. Andres
Andres Freund <andres@anarazel.de> writes: > I can reproduce the issue though when the connection just is very, very slow > (high packet loss). Uppon receiving a signal the send returns with EINTR uppon > which point I think a check for interrupts might be placed. The gdb trace you showed before gave no indication that the send() was returning, which is why I thought it was a kernel bug (or possibly a glibc bug, not sure exactly where that behavior is determined). However, even if it did return, we can't just throw a CHECK_FOR_INTERRUPTS in there. Abandoning the send() would mean that we lose message boundary synchronization in the FE/BE protocol, because there's no way to know how many bytes of the current message got sent. The only way to get out of it would be to abort the transaction and shut down the backend without any further attempt to communicate with the client ... which is a code path that doesn't exist, and even if it did exist is surely not something that should be invoked by a simple query cancel. In general we expect the kernel to tell us when the client connection has been lost. It appears to me that in this case the kernel failed to do that in a reasonable fashion. regards, tom lane
On Wed, May 12, 2010 at 2:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think what you are describing is a kernel bug. =A0There's not a lot > we can do about it if the send() call hangs. =A0Considering the kernel > already knows the connection is closed (per the CLOSE_WAIT state shown > by netstat) the send() should return failure immediately, and it's not > doing so. > For what it's worth CLOSE_WAIT means the remote end has sent a FIN but the local end hasn't closed the connection. TCP connections can live in this half-open state (or its dual) for a while with one direction closed but the other direction still open. So send() isn't necessarily going to return an error or anything, it will expect the remote end to keep receiving data or send an RST if it's actually gone away. I'm not sure I have a clear idea of the exact scenario from the description provided. It seems there should be two connections in psql -> ssh -> postgres and two endpoints for each connection, so I'm not sure which connections were in CLOSE_WAIT and FIN_WAIT2 and which two we're still missing. I'm not sure how ssh behaves when one side closes a connection. It might not reproduce the half-open connection on the either side preventing psql/postgres from responding appropriately. I'm not even sure it's possible for it to do so reliably. --=20 greg