Thread: BUG #5459: Unable to cancel query while in send()

BUG #5459: Unable to cancel query while in send()

From

"Mason Hale"

Date:

12 May 2010, 00:11:05

The following bug has been logged online:

Bug reference:      5459
Logged by:          Mason Hale
Email address:      mason@onespot.com
PostgreSQL version: 8.3.8
Operating system:   Redhat EL 5.1-64 bit
Description:        Unable to cancel query while in send()
Details:

ISSUE: unable to cancel queries using pg_cancel_backend(), that are in
send() function call, waiting on client receipt of data.

EXPECTED RESULT: expect to be able to cancel most/all queries using
pg_cancel_backend() as superuser, perhaps with some wait time, but not an
hour or more.

= SYMPTOM =

A SELECT query was running over 18 hours on our PostgreSQL 8.3.8 server.
Verified that it was not waiting on any locks via pg_stat_activity.
Attempted to cancel the query using pg_cancel_backend(), which returned 't'.
However more than an hour later the process was still active, using about 6%
of CPU and 5% of RAM.

Terminated the client process that was running the query (from another
server) did not cause the query process on the pgsql server to stop. In this
case the client was connecting via a ssh tunnel through an intermediate
'gateway' server.

Connection path was:

   CLIENT -->  SSH GATEWAY --> DB SERVER

= DIAGNOSIS =

Diagnosed this issue with help from 'andres' in #postgresql IRC. Per his
request, attached to 'stuck' process using gdb, generating the following
outputs:

  - Initial backtrace: http://pgsql.privatepaste.com/6f15c7e363
  -( 'c', then ctrl+c, then 'bt full') x 4:
http://pgsql.privatepaste.com/3d3261659a
  - Stepping several times with 'n':
http://pgsql.privatepaste.com/0f302125a8

'andres' reported that interrupts were not checked in send() and probably
should be, and suggested opening this bug report.

Additional investigation of the ssh tunnel connection revealed the
connection on the intermediate gateway server was stuck in a FIN_WAIT2 state
(as reported by netstat). The other end of the connection on the pgsql
server was reported as CLOSE_WAIT by netstat.

Kiling the ssh tunnel process on the gateway server cleared the connection
and the long-running query process db server terminated very soon after.

Re: BUG #5459: Unable to cancel query while in send()

From

Tom Lane

Date:

12 May 2010, 01:44:26

"Mason Hale" <mason@onespot.com> writes:
> ISSUE: unable to cancel queries using pg_cancel_backend(), that are in
> send() function call, waiting on client receipt of data.

I think what you are describing is a kernel bug.  There's not a lot
we can do about it if the send() call hangs.  Considering the kernel
already knows the connection is closed (per the CLOSE_WAIT state shown
by netstat) the send() should return failure immediately, and it's not
doing so.

There might be some TCP-level incompatibility involved between the
database and gateway server TCP stacks, since the combination of the
FIN_WAIT2 and CLOSE_WAIT states really ought not persist very long;
but I'm not a network hacker so I'm a bit out of my depth in diagnosing
that aspect of it.

            regards, tom lane

Re: BUG #5459: Unable to cancel query while in send()

From

Andres Freund

Date:

12 May 2010, 08:31:05

Hi,

On Wednesday 12 May 2010 03:44:16 Tom Lane wrote:
> "Mason Hale" <mason@onespot.com> writes:
> > ISSUE: unable to cancel queries using pg_cancel_backend(), that are in
> > send() function call, waiting on client receipt of data.
> I think what you are describing is a kernel bug.  There's not a lot
> we can do about it if the send() call hangs.  Considering the kernel
> already knows the connection is closed (per the CLOSE_WAIT state shown
> by netstat) the send() should return failure immediately, and it's not
> doing so.
I can reproduce the issue though when the connection just is very, very slow
(high packet loss). Uppon receiving a signal the send returns with EINTR uppon
which point I think a check for interrupts might be placed.

> There might be some TCP-level incompatibility involved between the
> database and gateway server TCP stacks, since the combination of the
> FIN_WAIT2 and CLOSE_WAIT states really ought not persist very long;
> but I'm not a network hacker so I'm a bit out of my depth in diagnosing
> that aspect of it.
There is a userland implementation (ssh) involved, so that does sound likely.

Andres

Re: BUG #5459: Unable to cancel query while in send()

From

Tom Lane

Date:

12 May 2010, 15:27:05

Andres Freund <andres@anarazel.de> writes:
> I can reproduce the issue though when the connection just is very, very slow
> (high packet loss). Uppon receiving a signal the send returns with EINTR uppon
> which point I think a check for interrupts might be placed.

The gdb trace you showed before gave no indication that the send() was
returning, which is why I thought it was a kernel bug (or possibly a
glibc bug, not sure exactly where that behavior is determined).

However, even if it did return, we can't just throw a
CHECK_FOR_INTERRUPTS in there.  Abandoning the send() would mean that we
lose message boundary synchronization in the FE/BE protocol, because
there's no way to know how many bytes of the current message got sent.
The only way to get out of it would be to abort the transaction and shut
down the backend without any further attempt to communicate with the
client ... which is a code path that doesn't exist, and even if it did
exist is surely not something that should be invoked by a simple query
cancel.

In general we expect the kernel to tell us when the client connection
has been lost.  It appears to me that in this case the kernel failed to
do that in a reasonable fashion.

            regards, tom lane

Re: BUG #5459: Unable to cancel query while in send()

From

Greg Stark

Date:

31 May 2010, 01:59:24

On Wed, May 12, 2010 at 2:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think what you are describing is a kernel bug. =A0There's not a lot
> we can do about it if the send() call hangs. =A0Considering the kernel
> already knows the connection is closed (per the CLOSE_WAIT state shown
> by netstat) the send() should return failure immediately, and it's not
> doing so.
>

For what it's worth CLOSE_WAIT means the remote end has sent a FIN but
the local end hasn't closed the connection. TCP connections can live
in this half-open state (or its dual) for a while with one direction
closed but the other direction still open. So send() isn't necessarily
going to return an error or anything, it will expect the remote end to
keep receiving data or send an RST if it's actually gone away.

I'm not sure I have a clear idea of the exact scenario from the
description provided. It seems there should be two connections in psql
-> ssh -> postgres and two endpoints for each connection, so I'm not
sure which connections were in CLOSE_WAIT and FIN_WAIT2 and which two
we're still missing.

I'm not sure how ssh behaves when one side closes a connection. It
might not reproduce the half-open connection on the either side
preventing psql/postgres from responding appropriately. I'm not even
sure it's possible for it to do so reliably.

--=20
greg