Thread: rogue process maxing cpu and unresponsive to signals

rogue process maxing cpu and unresponsive to signals

From
Jon Jensen
Date:
I've got a simple select query that runs every 10 minutes in order to
update data in some external rrds (it lets us make pretty graphs and so
forth). This has been working fine for months on end, when suddenly
yesterday the badness happen. For some reason, this same query that
normally takes a couple seconds has now been stuck running for over 24
hours, maxing the CPU and generally slowing other queries down.

The external script that initiates the query has been restarted, and
netstat no longer shows that connection. All subsequent calls of the
same query are quick as usual, but the renegade process lingers on,
unresponsive to signals. Some of the things I've tried so far
(unsuccessfully):

1. I've tried killing the process using kill from the command-line (INT,
TERM and HUP), as well as using pg_cancel_backend() via psql.
2. I've tried attaching gdb to the renegade process to see what it's
doing, but that hangs, forcing me to kill gdb (no problems attaching to
other postgres processes however).

Any other ideas? I'd like to avoid doing a kill -9 if at all possible.
The machine is debian (sarge) running postgres 8.1.

Jon

Re: rogue process maxing cpu and unresponsive to signals

From
Decibel!
Date:
On Aug 15, 2007, at 9:27 PM, Jon Jensen wrote:
> I've got a simple select query that runs every 10 minutes in order
> to update data in some external rrds (it lets us make pretty graphs
> and so forth). This has been working fine for months on end, when
> suddenly yesterday the badness happen. For some reason, this same
> query that normally takes a couple seconds has now been stuck
> running for over 24 hours, maxing the CPU and generally slowing
> other queries down.
>
> The external script that initiates the query has been restarted,
> and netstat no longer shows that connection. All subsequent calls
> of the same query are quick as usual, but the renegade process
> lingers on, unresponsive to signals. Some of the things I've tried
> so far (unsuccessfully):
>
> 1. I've tried killing the process using kill from the command-line
> (INT, TERM and HUP), as well as using pg_cancel_backend() via psql.
> 2. I've tried attaching gdb to the renegade process to see what
> it's doing, but that hangs, forcing me to kill gdb (no problems
> attaching to other postgres processes however).
>
> Any other ideas? I'd like to avoid doing a kill -9 if at all
> possible. The machine is debian (sarge) running postgres 8.1.

There's a lot of parts of the code that don't check for signals,
because normally they don't run for any real length of time... until
they do. :) The factorial calculation is an example that was recently
fixed. So it's possible that something in your query is in that same
condition. You may be stuck with a kill -9, but it would be good to
identify what part of the code is hung up so we can determine if it
makes sense to add signal handling.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: rogue process maxing cpu and unresponsive to signals

From
Tom Lane
Date:
Jon Jensen <jon@jenseng.com> writes:
> 1. I've tried killing the process using kill from the command-line (INT,
> TERM and HUP), as well as using pg_cancel_backend() via psql.
> 2. I've tried attaching gdb to the renegade process to see what it's
> doing, but that hangs, forcing me to kill gdb (no problems attaching to
> other postgres processes however).

> Any other ideas? I'd like to avoid doing a kill -9 if at all possible.
> The machine is debian (sarge) running postgres 8.1.

I think you'll find that "kill -9" doesn't do anything either, and
the only recourse is a system reboot.  What this sounds like to me
is that the kernel has gotten wedged trying to perform some operation
or other on behalf of that process.  Problems like a stuck disk I/O
request are often found to result in unkillable, un-attachable processes.

How up-to-date is your kernel?  Seen any signs of hardware problems
lately?

            regards, tom lane

Re: rogue process maxing cpu and unresponsive to signals

From
Jon Jensen
Date:
> I think you'll find that "kill -9" doesn't do anything either, and
> the only recourse is a system reboot.  What this sounds like to me
> is that the kernel has gotten wedged trying to perform some operation
> or other on behalf of that process.  Problems like a stuck disk I/O
> request are often found to result in unkillable, un-attachable processes.
>
> How up-to-date is your kernel?  Seen any signs of hardware problems
> lately?
>
>             regards, tom lane
>

The kernel is 2.6.15-1-686-smp, and we haven't seen any other problems
on this machine till now.

I was hoping there was some other solution, but you are probably
right... I'll investigate some more to see if I can find any other
workarounds, but I'll probably just have to bite the bullet here pretty
soon.

Jon

Re: rogue process maxing cpu and unresponsive to signals

From
Jon Jensen
Date:
> There's a lot of parts of the code that don't check for signals,
> because normally they don't run for any real length of time... until
> they do. :) The factorial calculation is an example that was recently
> fixed. So it's possible that something in your query is in that same
> condition. You may be stuck with a kill -9, but it would be good to
> identify what part of the code is hung up so we can determine if it
> makes sense to add signal handling.

Yeah, if there's any information I can furnish that would help determine
where it's getting stuck, let me know. I'm not really sure how to gain
any visibility into what the process is doing at this point.

Jon

Re: rogue process maxing cpu and unresponsive to signals

From
Tom Lane
Date:
Jon Jensen <jon@jenseng.com> writes:
> Yeah, if there's any information I can furnish that would help determine
> where it's getting stuck, let me know. I'm not really sure how to gain
> any visibility into what the process is doing at this point.

If my theory is right that it's stuck in a kernel call, then *maybe*
"strace -p pid" would show the suspended call.  Or maybe not.  Probably
worth trying if you haven't rebooted yet.

            regards, tom lane

Re: rogue process maxing cpu and unresponsive to signals

From
Jon Jensen
Date:
> If my theory is right that it's stuck in a kernel call, then *maybe*
> "strace -p pid" would show the suspended call.  Or maybe not.  Probably
> worth trying if you haven't rebooted yet.
>
>

Tom and Jim,

Yeah, no luck on the strace either... ended up having to kill -9 it, so
we may never know the root cause. Thanks for the help/suggestions though.

Jon