Thread: killing processes

killing processes

From

David Kerr

Date:

20 July 2009, 19:44:15

What's the generally accepted method for killing processes that went 'all wacky' in postgres?

I think i've seen in this group that kill -INT would be the way to go.

I'm playing around with different options for a median function. this one got out of hand
and was taking too long, so i wanted to kill it:

test=# select array_median(array(select t1 from test2 order by 1));
^CCancel request sent

It just sits there, it's been trying to die for 1/2 an hour.

At the OS it's taking up 100% of the CPU.

I tried kill -INT <pid> but that didn't help.

It's not updating anything, and i'm the only one in the database.

Fortunatly it's not production, so I don't really care. But if it was production, what would
be the method to kill it?  (I know about kill -9, i'm assuming that == bad)

If this were production, I'd need to end the process, force a rollback (if necessary) and get
my CPU back so "just waiting for it to die" really isn't an option...

(PostgreSQL 8.3.5, linux/SLES11)

Thanks

Dave

Re: killing processes

From

Scott Marlowe

Date:

20 July 2009, 19:50:52

On Mon, Jul 20, 2009 at 4:44 PM, David Kerr<dmk@mr-paradox.net> wrote:
> What's the generally accepted method for killing processes that went 'all wacky' in postgres?
>
> I think i've seen in this group that kill -INT would be the way to go.
>
> I'm playing around with different options for a median function. this one got out of hand
> and was taking too long, so i wanted to kill it:
>
> test=# select array_median(array(select t1 from test2 order by 1));
> ^CCancel request sent
>
> It just sits there, it's been trying to die for 1/2 an hour.
>
> At the OS it's taking up 100% of the CPU.
>
> I tried kill -INT <pid> but that didn't help.
>
> It's not updating anything, and i'm the only one in the database.
>
> Fortunatly it's not production, so I don't really care. But if it was production, what would
> be the method to kill it?  (I know about kill -9, i'm assuming that == bad)
>
> If this were production, I'd need to end the process, force a rollback (if necessary) and get
> my CPU back so "just waiting for it to die" really isn't an option...
>
> (PostgreSQL 8.3.5, linux/SLES11)

What's most likely happening is that it's stuck in a tight loop that
doesn't check for interrupts, so it just keeps on going.

You can kill -9 a process.  It'll cause the postmaster to kill all
backends and flush the buffers if I remember correctly.  Yeah, not the
ideal solution in production but on a non-prod machine it's an ok way
to get out of these issues.  And even in production, it's often much
faster to kill -9 a single process than to wait for it to finish.  I
think there might be choices other than -9 here, but I can't recall
them off the top of my head.

Re: killing processes

From

Merlin Moncure

Date:

20 July 2009, 20:18:19

On Mon, Jul 20, 2009 at 6:48 PM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
> On Mon, Jul 20, 2009 at 4:44 PM, David Kerr<dmk@mr-paradox.net> wrote:
>> What's the generally accepted method for killing processes that went 'all wacky' in postgres?
>>
>> I think i've seen in this group that kill -INT would be the way to go.
>>
>> I'm playing around with different options for a median function. this one got out of hand
>> and was taking too long, so i wanted to kill it:
>>
>> test=# select array_median(array(select t1 from test2 order by 1));
>> ^CCancel request sent
>>
>> It just sits there, it's been trying to die for 1/2 an hour.
>>
>> At the OS it's taking up 100% of the CPU.
>>
>> I tried kill -INT <pid> but that didn't help.
>>
>> It's not updating anything, and i'm the only one in the database.
>>
>> Fortunatly it's not production, so I don't really care. But if it was production, what would
>> be the method to kill it?  (I know about kill -9, i'm assuming that == bad)
>>
>> If this were production, I'd need to end the process, force a rollback (if necessary) and get
>> my CPU back so "just waiting for it to die" really isn't an option...
>>
>> (PostgreSQL 8.3.5, linux/SLES11)
>
> What's most likely happening is that it's stuck in a tight loop that
> doesn't check for interrupts, so it just keeps on going.
>
> You can kill -9 a process.  It'll cause the postmaster to kill all
> backends and flush the buffers if I remember correctly.

That's right -- and if there is any case where built in functions,
sql, or pgsql functions get stuck in such a way that a kill -9 is
required, it should be reported so that it can be fixed.  Trust me,
you should avoid kill -9 if at all possible.  Backend C functions you
write should check interrupts at appropriate places so you can respond
to cancels appropriately with CHECK_FOR_INTERRUPTS();

A good example if how this is done, take a look at ITAGAKI's proposed
enhancement of dblink and the attached patch here:

http://www.nabble.com/query-cancel-issues-in-contrib-dblink-td24214133.html

merlin

Re: killing processes

From

Tom Lane

Date:

20 July 2009, 20:26:36

Scott Marlowe <scott.marlowe@gmail.com> writes:
> On Mon, Jul 20, 2009 at 4:44 PM, David Kerr<dmk@mr-paradox.net> wrote:
>> I'm playing around with different options for a median function. this one got out of hand
>> and was taking too long, so i wanted to kill it:
>>
>> test=# select array_median(array(select t1 from test2 order by 1));
>> ^CCancel request sent
>>
>> It just sits there, it's been trying to die for 1/2 an hour.

> What's most likely happening is that it's stuck in a tight loop that
> doesn't check for interrupts, so it just keeps on going.

Yeah, that was my first thought.  What is array_median()?  If it's
custom C code, you probably forgot to put CHECK_FOR_INTERRUPTS in
suitable place(s).

> You can kill -9 a process.  It'll cause the postmaster to kill all
> backends and flush the buffers if I remember correctly.  Yeah, not the
> ideal solution in production but on a non-prod machine it's an ok way
> to get out of these issues.  And even in production, it's often much
> faster to kill -9 a single process than to wait for it to finish.  I
> think there might be choices other than -9 here, but I can't recall
> them off the top of my head.

Kill -9 on the postmaster is disrecommended for a number of reasons,
but kill -9 on an individual backend is quite predictable.  What it
will do is cause *all* of your open sessions to get aborted.  If you
can tolerate that then do it.

If you can't tolerate that, you can try a SIGTERM, but I suspect
that if the thing is too hosed to notice a cancel (a/k/a SIGINT)
then it won't notice SIGTERM either.  SIGTERM on a single backend
isn't as safe as the other options anyhow.  There used to be bugs
in that code path (still ARE known bugs, if you're not running a
pretty recent PG release) and I wouldn't trust it totally.

One thing you should definitely not try is SIGQUIT.  That will
kill the process all right (unless it's wedged a lot worse than
I think) but it will not do anything about cleaning up its shared
memory state, and that almost certainly will cause problems later.
Better to find a time when it's okay to use SIGKILL.

            regards, tom lane

Re: killing processes

From

David Kerr

Date:

20 July 2009, 20:33:37

On Mon, Jul 20, 2009 at 07:18:07PM -0400, Merlin Moncure wrote:
- On Mon, Jul 20, 2009 at 6:48 PM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
- > On Mon, Jul 20, 2009 at 4:44 PM, David Kerr<dmk@mr-paradox.net> wrote:
- > What's most likely happening is that it's stuck in a tight loop that
- > doesn't check for interrupts, so it just keeps on going.
- >
- > You can kill -9 a process.  It'll cause the postmaster to kill all
- > backends and flush the buffers if I remember correctly.
-
- That's right -- and if there is any case where built in functions,
- sql, or pgsql functions get stuck in such a way that a kill -9 is
- required, it should be reported so that it can be fixed.  Trust me,
- you should avoid kill -9 if at all possible.  Backend C functions you
- write should check interrupts at appropriate places so you can respond
- to cancels appropriately with CHECK_FOR_INTERRUPTS();
-
- A good example if how this is done, take a look at ITAGAKI's proposed
- enhancement of dblink and the attached patch here:
-
- http://www.nabble.com/query-cancel-issues-in-contrib-dblink-td24214133.html

Thanks guys.

I wonder if this is a case for a bug, of course, it's probably not reproduceable.
But, i don't see any coded loop or way for me to insert a signal check. (I'm not much of a
programmer) the function was just:

CREATE OR REPLACE FUNCTION array_median(anyarray)
RETURNS anyelement AS
$$
SELECT CASE
WHEN array_upper($1,1) = 0 THEN null
WHEN mod(array_upper($1,1),2) = 1 THEN
asorted[ceiling(array_upper(asorted,1)/2.0)]
ELSE
((asorted[ceiling(array_upper(asorted,1)/2.0)] + asorted[ceiling(array_upper(asorted,1)/2.0)+1])/2.0) END
FROM (SELECT ARRAY(SELECT ($1)[n] FROM
generate_series(1, array_upper($1, 1)) AS n
WHERE ($1)[n] IS NOT NULL
ORDER BY ($1)[n]
) As asorted) As foo ;
$$
LANGUAGE 'sql' IMMUTABLE;

(got it from here - thanks to the author =) )
http://www.postgresonline.com/journal/index.php?/archives/67-Build-Median-Aggregate-Function-in-SQL.html

Thanks

Dave

Re: killing processes

From

Tom Lane

Date:

21 July 2009, 00:14:36

David Kerr <dmk@mr-paradox.net> writes:
> But, i don't see any coded loop or way for me to insert a signal check. (I'm not much of a
> programmer) the function was just:

> CREATE OR REPLACE FUNCTION array_median(anyarray)
> RETURNS anyelement AS
> $$
> SELECT CASE
> WHEN array_upper($1,1) = 0 THEN null
> WHEN mod(array_upper($1,1),2) = 1 THEN
> asorted[ceiling(array_upper(asorted,1)/2.0)]
> ELSE
> ((asorted[ceiling(array_upper(asorted,1)/2.0)] + asorted[ceiling(array_upper(asorted,1)/2.0)+1])/2.0) END
> FROM (SELECT ARRAY(SELECT ($1)[n] FROM
> generate_series(1, array_upper($1, 1)) AS n
> WHERE ($1)[n] IS NOT NULL
> ORDER BY ($1)[n]
> ) As asorted) As foo ;
> $$
> LANGUAGE 'sql' IMMUTABLE;

Huh.  How big an array were you trying to invoke it on?

            regards, tom lane

Re: killing processes

From

David Kerr

Date:

21 July 2009, 12:47:43

On Mon, Jul 20, 2009 at 11:14:22PM -0400, Tom Lane wrote:
- David Kerr <dmk@mr-paradox.net> writes:
- > But, i don't see any coded loop or way for me to insert a signal check. (I'm not much of a
- > programmer) the function was just:
-
- > CREATE OR REPLACE FUNCTION array_median(anyarray)
- > RETURNS anyelement AS
- > $$
- > SELECT CASE
- > WHEN array_upper($1,1) = 0 THEN null
- > WHEN mod(array_upper($1,1),2) = 1 THEN
- > asorted[ceiling(array_upper(asorted,1)/2.0)]
- > ELSE
- > ((asorted[ceiling(array_upper(asorted,1)/2.0)] + asorted[ceiling(array_upper(asorted,1)/2.0)+1])/2.0) END
- > FROM (SELECT ARRAY(SELECT ($1)[n] FROM
- > generate_series(1, array_upper($1, 1)) AS n
- > WHERE ($1)[n] IS NOT NULL
- > ORDER BY ($1)[n]
- > ) As asorted) As foo ;
- > $$
- > LANGUAGE 'sql' IMMUTABLE;
-
- Huh.  How big an array were you trying to invoke it on?
-
-             regards, tom lane
-

there are 899991 records in the table it's just

test=# \d test2
       Table "public.test2"
 Column |     Type     | Modifiers
--------+--------------+-----------
 t1     | numeric(9,4) |


Thanks

Dave

Re: killing processes

From

Tom Lane

Date:

21 July 2009, 14:16:42

David Kerr <dmk@mr-paradox.net> writes:
> On Mon, Jul 20, 2009 at 11:14:22PM -0400, Tom Lane wrote:
> - Huh.  How big an array were you trying to invoke it on?

> there are 899991 records in the table it's just

> test=# \d test2
>        Table "public.test2"
>  Column |     Type     | Modifiers
> --------+--------------+-----------
>  t1     | numeric(9,4) |

I tried it on a table with 899991 random values.  It took frickin'
forever, but seemed to be willing to respond to cancels anywhere
along the line.  I'm not sure why you're seeing differently.

(The reason it takes forever is that numeric is a variable-width
type, and access into a varwidth array is O(n), so the sorting
step you've got here is O(n^2).  It might help to use unnest()
instead of this handmade version of it ...)

            regards, tom lane

Re: killing processes

From

David Kerr

Date:

21 July 2009, 14:47:16

On Tue, Jul 21, 2009 at 01:13:18PM -0400, Tom Lane wrote:
- David Kerr <dmk@mr-paradox.net> writes:
- I tried it on a table with 899991 random values.  It took frickin'
- forever, but seemed to be willing to respond to cancels anywhere
- along the line.  I'm not sure why you're seeing differently.

Hehe, yeah. For me I let it run 10 min and hit ^C so maybe i just
hit it at a bad place.

- (The reason it takes forever is that numeric is a variable-width
- type, and access into a varwidth array is O(n), so the sorting
- step you've got here is O(n^2).  It might help to use unnest()
- instead of this handmade version of it ...)

unnest() is 8.4 only, right?

I'm actually probably just going to install R and use the median
function from that.  (I was hoping to avoid installing all of R)

Or maybe i'll try my hand at a perl one and see if that gives
ok performance.

Thanks!

Dave

Re: killing processes

From

"Chris Spotts"

Date:

21 July 2009, 15:02:00

> - type, and access into a varwidth array is O(n), so the sorting
> - step you've got here is O(n^2).  It might help to use unnest()
> - instead of this handmade version of it ...)
>
> unnest() is 8.4 only, right?
>
> I'm actually probably just going to install R and use the median
> function from that.  (I was hoping to avoid installing all of R)
>
[Spotts, Christopher]
We have large data numeric data sets that we do medians over and due to
performance reasons, we eventually found it useful to take a smaller random
sample and median that...it was close enough for us.  Just a thought.

Chris

Re: killing processes

From

Merlin Moncure

Date:

21 July 2009, 15:24:03

On Tue, Jul 21, 2009 at 1:47 PM, David Kerr<dmk@mr-paradox.net> wrote:
> On Tue, Jul 21, 2009 at 01:13:18PM -0400, Tom Lane wrote:
> - David Kerr <dmk@mr-paradox.net> writes:
> - I tried it on a table with 899991 random values.  It took frickin'
> - forever, but seemed to be willing to respond to cancels anywhere
> - along the line.  I'm not sure why you're seeing differently.
>
> Hehe, yeah. For me I let it run 10 min and hit ^C so maybe i just
> hit it at a bad place.
>
> - (The reason it takes forever is that numeric is a variable-width
> - type, and access into a varwidth array is O(n), so the sorting
> - step you've got here is O(n^2).  It might help to use unnest()
> - instead of this handmade version of it ...)
>
> unnest() is 8.4 only, right?

There is an undocumented version in earlier versions that's quite similar:
information_schema._pg_expandarray (it's an sql function, not C, so
you should be able examine the source in order to roll your own if you
need to).

merlin