On Tue, Mar 29, 2011 at 2:54 PM, Derrick Rice
<derrick.rice@gmail.com> wrote:
Try trussing the backend process. You may find it in a network IO wait
trying to send data to a client that is hung or over a socket that was
timed out by a firewall or network equipment.
Such a condition will cause the backend to be unable to hear the
cancel. The statement will still show as running in pg_stat_activity.
SIGTERM on such a backend will probably also fall on deaf ears.
I'm aware of that condition, which is exactly what the keepalive settings are supposed to detect.
So I spent some time reading Linux-2.6 TCP code and my previous statement is downright wrong. Keepalive is only in use when there is no data unacknowledged and no data to send. Retransmission timeouts are in use for those other scenarios.
In any case, I would have expected a retransmission timeout. My new hypothesis based on output from `ss' is that a firewall, NAT, or VPN of my users is putting the connection into persist mode (setting the window size to 0) when the end point of the connection is unresponsive. Furthermore, I think that firewall is continuing to respond to the persist probes of my machine until it finally decides that the end point is gone. At which point it might be ignoring future probes, starting the retransmission timeouts for my machine.
So I'm not looking for any further help here, since this isn't a PostgreSQL issue. If I resolve the problem I'll let you all know just for entertainment purposes :)
Thanks
Derrick