Thread: "Stuck" query

"Stuck" query

From

Tommy Gildseth

Date:

10 September 2008, 07:10:01

I have a query on a database that has been running for nearly 24 hours
at the moment. The query itself doesn't seem like it should take very
long to run, so it seems like there's something else going on here.

The output from pg_stat_activity looks like this:
SELECT procpid, waiting, current_query, query_start, backend_start FROM
pg_stat_activity WHERE current_query <> '<IDLE>' AND usename <> 'postgres';
  procpid | waiting |                  current_query                  |
          query_start          |         backend_start

---------+---------+-------------------------------------------------+-------------------------------+-------------------------------
    17504 | f       | SELECT cam.netboxid,                            |
2008-09-09 13:44:01.035885+02 | 2008-09-09 13:43:58.613948+02
                    :        ifindex,
                    :        arp.ip,
                    :        REPLACE(mac::text, ':', '') AS portname,
                    :        cam.start_time,
                    :        cam.end_time,
                    :        vlan
                    : FROM cam
                    : JOIN netbox USING (netboxid)
                    : JOIN arp USING (mac)
                    : JOIN prefix ON (arp.prefixid = prefix.prefixid)
                    : JOIN vlan USING (vlanid)
                    : WHERE cam.end_time='infinity'
                    :   AND arp.end_time='infinity'
                    :   AND vlan IS NOT NULL
                    :

SELECT pg_cancel_backend(17504) has no effect, neither does kill 17504
from the shell.

I tried "strace -p17504", and this gave me just the following output:
sendto(7, "\7\0\0\0\003771\0\0\0\00224\0\0\0\017127.120.213.18"..., 968,
0, NULL, 0 <unfinished ...>


Does anyone have any further troubleshooting suggestions that I can do,
to figure out why this query have "crashed"?

The pg version is 8.2.9 on RHEL4


--
Tommy Gildseth

Re: "Stuck" query

From

Tommy Gildseth

Date:

10 September 2008, 07:30:03

Richard Huxton wrote:
> Tommy Gildseth wrote:
>> SELECT pg_cancel_backend(17504) has no effect, neither does kill 17504
>> from the shell.
>
> Strange.
>
>> I tried "strace -p17504", and this gave me just the following output:
>> sendto(7, "\7\0\0\0\003771\0\0\0\00224\0\0\0\017127.120.213.18"..., 968,
>> 0, NULL, 0 <unfinished ...>
>
> Looks like part of your query results being sent. Is it hung in that one
> system-call?


Yes, I left it there for about ~1 hour, and that was all that ever came.


>
>> Does anyone have any further troubleshooting suggestions that I can do,
>> to figure out why this query have "crashed"?
>
> Is the client locked/crashed too?
> If it's connected over a network, is the connection still there?


We stopped the client application to see if that would make any
difference, but the connection and the query on the server side is still
there.

A strace of the client application before it was shut down simply showed:

strace -p6721
Process 6721 attached - interrupt to quit
accept(18,  <unfinished ...>

so, not very helpfull.


>
>> The pg version is 8.2.9 on RHEL4
>
> Fairly standard setup. I've seen various problems reported by selinux
> oddities, but nothing quite like this.

We don't use selinux.


--
Tommy Gildseth

Re: "Stuck" query

From

Tom Lane

Date:

10 September 2008, 09:26:56

Tommy Gildseth <tommy.gildseth@usit.uio.no> writes:
> Richard Huxton wrote:
>> Looks like part of your query results being sent. Is it hung in that one
>> system-call?

> Yes, I left it there for about ~1 hour, and that was all that ever came.

Seems like you have got a network issue.  What does netstat show for the
status of that connection?

I don't think that a query cancel will blow PG off the send; you'd
probably have to resort to kill -9 on that process (with a consequent
restart of other sessions).  It's odd that the kernel hasn't given up
on the connection yet ...

            regards, tom lane

Re: "Stuck" query

From

Tommy Gildseth

Date:

10 September 2008, 09:57:51

Tom Lane wrote:
> Tommy Gildseth <tommy.gildseth@usit.uio.no> writes:
>> Richard Huxton wrote:
>>> Looks like part of your query results being sent. Is it hung in that one
>>> system-call?
>
>> Yes, I left it there for about ~1 hour, and that was all that ever came.
>
> Seems like you have got a network issue.  What does netstat show for the
> status of that connection?
>
> I don't think that a query cancel will blow PG off the send; you'd
> probably have to resort to kill -9 on that process (with a consequent
> restart of other sessions).  It's odd that the kernel hasn't given up
> on the connection yet ...


Netstat showed:

netstat -a --tcp -p | grep 49004
tcp  0  44660 dbserver:postgres clientserver:49004  ESTABLISHED
17504/postgres: nav

I went back to the server the client was running on to double check, and
it seems the client process hadn't been killed off when the application
was restarted.

We've got some scheduled downtime tomorrow, so I think I'll just leave
it till then, since it's not causing any problems as far as I can tell.


--
Tommy Gildseth

Re: "Stuck" query

From

Erik Jones

Date:

10 September 2008, 12:53:57

On Sep 10, 2008, at 5:57 AM, Tommy Gildseth wrote:

> Tom Lane wrote:
>> Tommy Gildseth <tommy.gildseth@usit.uio.no> writes:
>>> Richard Huxton wrote:
>>>> Looks like part of your query results being sent. Is it hung in
>>>> that one
>>>> system-call?
>>> Yes, I left it there for about ~1 hour, and that was all that ever
>>> came.
>> Seems like you have got a network issue.  What does netstat show
>> for the
>> status of that connection?
>> I don't think that a query cancel will blow PG off the send; you'd
>> probably have to resort to kill -9 on that process (with a consequent
>> restart of other sessions).  It's odd that the kernel hasn't given up
>> on the connection yet ...
>
>
> Netstat showed:
>
> netstat -a --tcp -p | grep 49004
> tcp  0  44660 dbserver:postgres clientserver:49004  ESTABLISHED
> 17504/postgres: nav
>
> I went back to the server the client was running on to double check,
> and it seems the client process hadn't been killed off when the
> application was restarted.
>
> We've got some scheduled downtime tomorrow, so I think I'll just
> leave it till then, since it's not causing any problems as far as I
> can tell.

For what it's worth, I've run into a situation similar to this with a
client a couple time in the last week or two (I can't say identical as
I don't know all of the details about the client end of your
connection).  Using the client port # you can use lsof in addition to
netstat (lsof -i tcp:49004) to track down the client process.  In our
case, the client process was a connection made via an ssh tunnel and
was sitting in FIN_WAIT2 status.  Killing the client process
individually made everything go away nicely without any kind of extra
downtime necessary.

Erik Jones>, Database Administrator
Engine Yard
Support, Scalability, Reliability
(415) 963-4410 x 260
Location: US/Pacific
IRC: mage2k

Re: "Stuck" query

From

Tommy Gildseth

Date:

11 September 2008, 03:49:30

Erik Jones wrote:
>
> On Sep 10, 2008, at 5:57 AM, Tommy Gildseth wrote:
>
>> Tom Lane wrote:
>>> Tommy Gildseth <tommy.gildseth@usit.uio.no> writes:
>>>> Richard Huxton wrote:
>
> For what it's worth, I've run into a situation similar to this with a
> client a couple time in the last week or two (I can't say identical as I
> don't know all of the details about the client end of your connection).
> Using the client port # you can use lsof in addition to netstat (lsof -i
> tcp:49004) to track down the client process.  In our case, the client
> process was a connection made via an ssh tunnel and was sitting in
> FIN_WAIT2 status.  Killing the client process individually made
> everything go away nicely without any kind of extra downtime necessary.


Thanks, I'll keep that in mind if/when it happens again.


--
Tommy Gildseth