Thread: rogue process maxing cpu and unresponsive to signals
I've got a simple select query that runs every 10 minutes in order to update data in some external rrds (it lets us make pretty graphs and so forth). This has been working fine for months on end, when suddenly yesterday the badness happen. For some reason, this same query that normally takes a couple seconds has now been stuck running for over 24 hours, maxing the CPU and generally slowing other queries down. The external script that initiates the query has been restarted, and netstat no longer shows that connection. All subsequent calls of the same query are quick as usual, but the renegade process lingers on, unresponsive to signals. Some of the things I've tried so far (unsuccessfully): 1. I've tried killing the process using kill from the command-line (INT, TERM and HUP), as well as using pg_cancel_backend() via psql. 2. I've tried attaching gdb to the renegade process to see what it's doing, but that hangs, forcing me to kill gdb (no problems attaching to other postgres processes however). Any other ideas? I'd like to avoid doing a kill -9 if at all possible. The machine is debian (sarge) running postgres 8.1. Jon
On Aug 15, 2007, at 9:27 PM, Jon Jensen wrote: > I've got a simple select query that runs every 10 minutes in order > to update data in some external rrds (it lets us make pretty graphs > and so forth). This has been working fine for months on end, when > suddenly yesterday the badness happen. For some reason, this same > query that normally takes a couple seconds has now been stuck > running for over 24 hours, maxing the CPU and generally slowing > other queries down. > > The external script that initiates the query has been restarted, > and netstat no longer shows that connection. All subsequent calls > of the same query are quick as usual, but the renegade process > lingers on, unresponsive to signals. Some of the things I've tried > so far (unsuccessfully): > > 1. I've tried killing the process using kill from the command-line > (INT, TERM and HUP), as well as using pg_cancel_backend() via psql. > 2. I've tried attaching gdb to the renegade process to see what > it's doing, but that hangs, forcing me to kill gdb (no problems > attaching to other postgres processes however). > > Any other ideas? I'd like to avoid doing a kill -9 if at all > possible. The machine is debian (sarge) running postgres 8.1. There's a lot of parts of the code that don't check for signals, because normally they don't run for any real length of time... until they do. :) The factorial calculation is an example that was recently fixed. So it's possible that something in your query is in that same condition. You may be stuck with a kill -9, but it would be good to identify what part of the code is hung up so we can determine if it makes sense to add signal handling. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jon Jensen <jon@jenseng.com> writes: > 1. I've tried killing the process using kill from the command-line (INT, > TERM and HUP), as well as using pg_cancel_backend() via psql. > 2. I've tried attaching gdb to the renegade process to see what it's > doing, but that hangs, forcing me to kill gdb (no problems attaching to > other postgres processes however). > Any other ideas? I'd like to avoid doing a kill -9 if at all possible. > The machine is debian (sarge) running postgres 8.1. I think you'll find that "kill -9" doesn't do anything either, and the only recourse is a system reboot. What this sounds like to me is that the kernel has gotten wedged trying to perform some operation or other on behalf of that process. Problems like a stuck disk I/O request are often found to result in unkillable, un-attachable processes. How up-to-date is your kernel? Seen any signs of hardware problems lately? regards, tom lane
> I think you'll find that "kill -9" doesn't do anything either, and > the only recourse is a system reboot. What this sounds like to me > is that the kernel has gotten wedged trying to perform some operation > or other on behalf of that process. Problems like a stuck disk I/O > request are often found to result in unkillable, un-attachable processes. > > How up-to-date is your kernel? Seen any signs of hardware problems > lately? > > regards, tom lane > The kernel is 2.6.15-1-686-smp, and we haven't seen any other problems on this machine till now. I was hoping there was some other solution, but you are probably right... I'll investigate some more to see if I can find any other workarounds, but I'll probably just have to bite the bullet here pretty soon. Jon
> There's a lot of parts of the code that don't check for signals, > because normally they don't run for any real length of time... until > they do. :) The factorial calculation is an example that was recently > fixed. So it's possible that something in your query is in that same > condition. You may be stuck with a kill -9, but it would be good to > identify what part of the code is hung up so we can determine if it > makes sense to add signal handling. Yeah, if there's any information I can furnish that would help determine where it's getting stuck, let me know. I'm not really sure how to gain any visibility into what the process is doing at this point. Jon
Jon Jensen <jon@jenseng.com> writes: > Yeah, if there's any information I can furnish that would help determine > where it's getting stuck, let me know. I'm not really sure how to gain > any visibility into what the process is doing at this point. If my theory is right that it's stuck in a kernel call, then *maybe* "strace -p pid" would show the suspended call. Or maybe not. Probably worth trying if you haven't rebooted yet. regards, tom lane
> If my theory is right that it's stuck in a kernel call, then *maybe* > "strace -p pid" would show the suspended call. Or maybe not. Probably > worth trying if you haven't rebooted yet. > > Tom and Jim, Yeah, no luck on the strace either... ended up having to kill -9 it, so we may never know the root cause. Thanks for the help/suggestions though. Jon