Thread: "Stuck" query
I have a query on a database that has been running for nearly 24 hours at the moment. The query itself doesn't seem like it should take very long to run, so it seems like there's something else going on here. The output from pg_stat_activity looks like this: SELECT procpid, waiting, current_query, query_start, backend_start FROM pg_stat_activity WHERE current_query <> '<IDLE>' AND usename <> 'postgres'; procpid | waiting | current_query | query_start | backend_start ---------+---------+-------------------------------------------------+-------------------------------+------------------------------- 17504 | f | SELECT cam.netboxid, | 2008-09-09 13:44:01.035885+02 | 2008-09-09 13:43:58.613948+02 : ifindex, : arp.ip, : REPLACE(mac::text, ':', '') AS portname, : cam.start_time, : cam.end_time, : vlan : FROM cam : JOIN netbox USING (netboxid) : JOIN arp USING (mac) : JOIN prefix ON (arp.prefixid = prefix.prefixid) : JOIN vlan USING (vlanid) : WHERE cam.end_time='infinity' : AND arp.end_time='infinity' : AND vlan IS NOT NULL : SELECT pg_cancel_backend(17504) has no effect, neither does kill 17504 from the shell. I tried "strace -p17504", and this gave me just the following output: sendto(7, "\7\0\0\0\003771\0\0\0\00224\0\0\0\017127.120.213.18"..., 968, 0, NULL, 0 <unfinished ...> Does anyone have any further troubleshooting suggestions that I can do, to figure out why this query have "crashed"? The pg version is 8.2.9 on RHEL4 -- Tommy Gildseth
Richard Huxton wrote: > Tommy Gildseth wrote: >> SELECT pg_cancel_backend(17504) has no effect, neither does kill 17504 >> from the shell. > > Strange. > >> I tried "strace -p17504", and this gave me just the following output: >> sendto(7, "\7\0\0\0\003771\0\0\0\00224\0\0\0\017127.120.213.18"..., 968, >> 0, NULL, 0 <unfinished ...> > > Looks like part of your query results being sent. Is it hung in that one > system-call? Yes, I left it there for about ~1 hour, and that was all that ever came. > >> Does anyone have any further troubleshooting suggestions that I can do, >> to figure out why this query have "crashed"? > > Is the client locked/crashed too? > If it's connected over a network, is the connection still there? We stopped the client application to see if that would make any difference, but the connection and the query on the server side is still there. A strace of the client application before it was shut down simply showed: strace -p6721 Process 6721 attached - interrupt to quit accept(18, <unfinished ...> so, not very helpfull. > >> The pg version is 8.2.9 on RHEL4 > > Fairly standard setup. I've seen various problems reported by selinux > oddities, but nothing quite like this. We don't use selinux. -- Tommy Gildseth
Tommy Gildseth <tommy.gildseth@usit.uio.no> writes: > Richard Huxton wrote: >> Looks like part of your query results being sent. Is it hung in that one >> system-call? > Yes, I left it there for about ~1 hour, and that was all that ever came. Seems like you have got a network issue. What does netstat show for the status of that connection? I don't think that a query cancel will blow PG off the send; you'd probably have to resort to kill -9 on that process (with a consequent restart of other sessions). It's odd that the kernel hasn't given up on the connection yet ... regards, tom lane
Tom Lane wrote: > Tommy Gildseth <tommy.gildseth@usit.uio.no> writes: >> Richard Huxton wrote: >>> Looks like part of your query results being sent. Is it hung in that one >>> system-call? > >> Yes, I left it there for about ~1 hour, and that was all that ever came. > > Seems like you have got a network issue. What does netstat show for the > status of that connection? > > I don't think that a query cancel will blow PG off the send; you'd > probably have to resort to kill -9 on that process (with a consequent > restart of other sessions). It's odd that the kernel hasn't given up > on the connection yet ... Netstat showed: netstat -a --tcp -p | grep 49004 tcp 0 44660 dbserver:postgres clientserver:49004 ESTABLISHED 17504/postgres: nav I went back to the server the client was running on to double check, and it seems the client process hadn't been killed off when the application was restarted. We've got some scheduled downtime tomorrow, so I think I'll just leave it till then, since it's not causing any problems as far as I can tell. -- Tommy Gildseth
On Sep 10, 2008, at 5:57 AM, Tommy Gildseth wrote: > Tom Lane wrote: >> Tommy Gildseth <tommy.gildseth@usit.uio.no> writes: >>> Richard Huxton wrote: >>>> Looks like part of your query results being sent. Is it hung in >>>> that one >>>> system-call? >>> Yes, I left it there for about ~1 hour, and that was all that ever >>> came. >> Seems like you have got a network issue. What does netstat show >> for the >> status of that connection? >> I don't think that a query cancel will blow PG off the send; you'd >> probably have to resort to kill -9 on that process (with a consequent >> restart of other sessions). It's odd that the kernel hasn't given up >> on the connection yet ... > > > Netstat showed: > > netstat -a --tcp -p | grep 49004 > tcp 0 44660 dbserver:postgres clientserver:49004 ESTABLISHED > 17504/postgres: nav > > I went back to the server the client was running on to double check, > and it seems the client process hadn't been killed off when the > application was restarted. > > We've got some scheduled downtime tomorrow, so I think I'll just > leave it till then, since it's not causing any problems as far as I > can tell. For what it's worth, I've run into a situation similar to this with a client a couple time in the last week or two (I can't say identical as I don't know all of the details about the client end of your connection). Using the client port # you can use lsof in addition to netstat (lsof -i tcp:49004) to track down the client process. In our case, the client process was a connection made via an ssh tunnel and was sitting in FIN_WAIT2 status. Killing the client process individually made everything go away nicely without any kind of extra downtime necessary. Erik Jones>, Database Administrator Engine Yard Support, Scalability, Reliability (415) 963-4410 x 260 Location: US/Pacific IRC: mage2k
Erik Jones wrote: > > On Sep 10, 2008, at 5:57 AM, Tommy Gildseth wrote: > >> Tom Lane wrote: >>> Tommy Gildseth <tommy.gildseth@usit.uio.no> writes: >>>> Richard Huxton wrote: > > For what it's worth, I've run into a situation similar to this with a > client a couple time in the last week or two (I can't say identical as I > don't know all of the details about the client end of your connection). > Using the client port # you can use lsof in addition to netstat (lsof -i > tcp:49004) to track down the client process. In our case, the client > process was a connection made via an ssh tunnel and was sitting in > FIN_WAIT2 status. Killing the client process individually made > everything go away nicely without any kind of extra downtime necessary. Thanks, I'll keep that in mind if/when it happens again. -- Tommy Gildseth