Thread: killing processes
What's the generally accepted method for killing processes that went 'all wacky' in postgres? I think i've seen in this group that kill -INT would be the way to go. I'm playing around with different options for a median function. this one got out of hand and was taking too long, so i wanted to kill it: test=# select array_median(array(select t1 from test2 order by 1)); ^CCancel request sent It just sits there, it's been trying to die for 1/2 an hour. At the OS it's taking up 100% of the CPU. I tried kill -INT <pid> but that didn't help. It's not updating anything, and i'm the only one in the database. Fortunatly it's not production, so I don't really care. But if it was production, what would be the method to kill it? (I know about kill -9, i'm assuming that == bad) If this were production, I'd need to end the process, force a rollback (if necessary) and get my CPU back so "just waiting for it to die" really isn't an option... (PostgreSQL 8.3.5, linux/SLES11) Thanks Dave
On Mon, Jul 20, 2009 at 4:44 PM, David Kerr<dmk@mr-paradox.net> wrote: > What's the generally accepted method for killing processes that went 'all wacky' in postgres? > > I think i've seen in this group that kill -INT would be the way to go. > > I'm playing around with different options for a median function. this one got out of hand > and was taking too long, so i wanted to kill it: > > test=# select array_median(array(select t1 from test2 order by 1)); > ^CCancel request sent > > It just sits there, it's been trying to die for 1/2 an hour. > > At the OS it's taking up 100% of the CPU. > > I tried kill -INT <pid> but that didn't help. > > It's not updating anything, and i'm the only one in the database. > > Fortunatly it's not production, so I don't really care. But if it was production, what would > be the method to kill it? (I know about kill -9, i'm assuming that == bad) > > If this were production, I'd need to end the process, force a rollback (if necessary) and get > my CPU back so "just waiting for it to die" really isn't an option... > > (PostgreSQL 8.3.5, linux/SLES11) What's most likely happening is that it's stuck in a tight loop that doesn't check for interrupts, so it just keeps on going. You can kill -9 a process. It'll cause the postmaster to kill all backends and flush the buffers if I remember correctly. Yeah, not the ideal solution in production but on a non-prod machine it's an ok way to get out of these issues. And even in production, it's often much faster to kill -9 a single process than to wait for it to finish. I think there might be choices other than -9 here, but I can't recall them off the top of my head.
On Mon, Jul 20, 2009 at 6:48 PM, Scott Marlowe<scott.marlowe@gmail.com> wrote: > On Mon, Jul 20, 2009 at 4:44 PM, David Kerr<dmk@mr-paradox.net> wrote: >> What's the generally accepted method for killing processes that went 'all wacky' in postgres? >> >> I think i've seen in this group that kill -INT would be the way to go. >> >> I'm playing around with different options for a median function. this one got out of hand >> and was taking too long, so i wanted to kill it: >> >> test=# select array_median(array(select t1 from test2 order by 1)); >> ^CCancel request sent >> >> It just sits there, it's been trying to die for 1/2 an hour. >> >> At the OS it's taking up 100% of the CPU. >> >> I tried kill -INT <pid> but that didn't help. >> >> It's not updating anything, and i'm the only one in the database. >> >> Fortunatly it's not production, so I don't really care. But if it was production, what would >> be the method to kill it? (I know about kill -9, i'm assuming that == bad) >> >> If this were production, I'd need to end the process, force a rollback (if necessary) and get >> my CPU back so "just waiting for it to die" really isn't an option... >> >> (PostgreSQL 8.3.5, linux/SLES11) > > What's most likely happening is that it's stuck in a tight loop that > doesn't check for interrupts, so it just keeps on going. > > You can kill -9 a process. It'll cause the postmaster to kill all > backends and flush the buffers if I remember correctly. That's right -- and if there is any case where built in functions, sql, or pgsql functions get stuck in such a way that a kill -9 is required, it should be reported so that it can be fixed. Trust me, you should avoid kill -9 if at all possible. Backend C functions you write should check interrupts at appropriate places so you can respond to cancels appropriately with CHECK_FOR_INTERRUPTS(); A good example if how this is done, take a look at ITAGAKI's proposed enhancement of dblink and the attached patch here: http://www.nabble.com/query-cancel-issues-in-contrib-dblink-td24214133.html merlin
Scott Marlowe <scott.marlowe@gmail.com> writes: > On Mon, Jul 20, 2009 at 4:44 PM, David Kerr<dmk@mr-paradox.net> wrote: >> I'm playing around with different options for a median function. this one got out of hand >> and was taking too long, so i wanted to kill it: >> >> test=# select array_median(array(select t1 from test2 order by 1)); >> ^CCancel request sent >> >> It just sits there, it's been trying to die for 1/2 an hour. > What's most likely happening is that it's stuck in a tight loop that > doesn't check for interrupts, so it just keeps on going. Yeah, that was my first thought. What is array_median()? If it's custom C code, you probably forgot to put CHECK_FOR_INTERRUPTS in suitable place(s). > You can kill -9 a process. It'll cause the postmaster to kill all > backends and flush the buffers if I remember correctly. Yeah, not the > ideal solution in production but on a non-prod machine it's an ok way > to get out of these issues. And even in production, it's often much > faster to kill -9 a single process than to wait for it to finish. I > think there might be choices other than -9 here, but I can't recall > them off the top of my head. Kill -9 on the postmaster is disrecommended for a number of reasons, but kill -9 on an individual backend is quite predictable. What it will do is cause *all* of your open sessions to get aborted. If you can tolerate that then do it. If you can't tolerate that, you can try a SIGTERM, but I suspect that if the thing is too hosed to notice a cancel (a/k/a SIGINT) then it won't notice SIGTERM either. SIGTERM on a single backend isn't as safe as the other options anyhow. There used to be bugs in that code path (still ARE known bugs, if you're not running a pretty recent PG release) and I wouldn't trust it totally. One thing you should definitely not try is SIGQUIT. That will kill the process all right (unless it's wedged a lot worse than I think) but it will not do anything about cleaning up its shared memory state, and that almost certainly will cause problems later. Better to find a time when it's okay to use SIGKILL. regards, tom lane
On Mon, Jul 20, 2009 at 07:18:07PM -0400, Merlin Moncure wrote: - On Mon, Jul 20, 2009 at 6:48 PM, Scott Marlowe<scott.marlowe@gmail.com> wrote: - > On Mon, Jul 20, 2009 at 4:44 PM, David Kerr<dmk@mr-paradox.net> wrote: - > What's most likely happening is that it's stuck in a tight loop that - > doesn't check for interrupts, so it just keeps on going. - > - > You can kill -9 a process. It'll cause the postmaster to kill all - > backends and flush the buffers if I remember correctly. - - That's right -- and if there is any case where built in functions, - sql, or pgsql functions get stuck in such a way that a kill -9 is - required, it should be reported so that it can be fixed. Trust me, - you should avoid kill -9 if at all possible. Backend C functions you - write should check interrupts at appropriate places so you can respond - to cancels appropriately with CHECK_FOR_INTERRUPTS(); - - A good example if how this is done, take a look at ITAGAKI's proposed - enhancement of dblink and the attached patch here: - - http://www.nabble.com/query-cancel-issues-in-contrib-dblink-td24214133.html Thanks guys. I wonder if this is a case for a bug, of course, it's probably not reproduceable. But, i don't see any coded loop or way for me to insert a signal check. (I'm not much of a programmer) the function was just: CREATE OR REPLACE FUNCTION array_median(anyarray) RETURNS anyelement AS $$ SELECT CASE WHEN array_upper($1,1) = 0 THEN null WHEN mod(array_upper($1,1),2) = 1 THEN asorted[ceiling(array_upper(asorted,1)/2.0)] ELSE ((asorted[ceiling(array_upper(asorted,1)/2.0)] + asorted[ceiling(array_upper(asorted,1)/2.0)+1])/2.0) END FROM (SELECT ARRAY(SELECT ($1)[n] FROM generate_series(1, array_upper($1, 1)) AS n WHERE ($1)[n] IS NOT NULL ORDER BY ($1)[n] ) As asorted) As foo ; $$ LANGUAGE 'sql' IMMUTABLE; (got it from here - thanks to the author =) ) http://www.postgresonline.com/journal/index.php?/archives/67-Build-Median-Aggregate-Function-in-SQL.html Thanks Dave
David Kerr <dmk@mr-paradox.net> writes: > But, i don't see any coded loop or way for me to insert a signal check. (I'm not much of a > programmer) the function was just: > CREATE OR REPLACE FUNCTION array_median(anyarray) > RETURNS anyelement AS > $$ > SELECT CASE > WHEN array_upper($1,1) = 0 THEN null > WHEN mod(array_upper($1,1),2) = 1 THEN > asorted[ceiling(array_upper(asorted,1)/2.0)] > ELSE > ((asorted[ceiling(array_upper(asorted,1)/2.0)] + asorted[ceiling(array_upper(asorted,1)/2.0)+1])/2.0) END > FROM (SELECT ARRAY(SELECT ($1)[n] FROM > generate_series(1, array_upper($1, 1)) AS n > WHERE ($1)[n] IS NOT NULL > ORDER BY ($1)[n] > ) As asorted) As foo ; > $$ > LANGUAGE 'sql' IMMUTABLE; Huh. How big an array were you trying to invoke it on? regards, tom lane
On Mon, Jul 20, 2009 at 11:14:22PM -0400, Tom Lane wrote: - David Kerr <dmk@mr-paradox.net> writes: - > But, i don't see any coded loop or way for me to insert a signal check. (I'm not much of a - > programmer) the function was just: - - > CREATE OR REPLACE FUNCTION array_median(anyarray) - > RETURNS anyelement AS - > $$ - > SELECT CASE - > WHEN array_upper($1,1) = 0 THEN null - > WHEN mod(array_upper($1,1),2) = 1 THEN - > asorted[ceiling(array_upper(asorted,1)/2.0)] - > ELSE - > ((asorted[ceiling(array_upper(asorted,1)/2.0)] + asorted[ceiling(array_upper(asorted,1)/2.0)+1])/2.0) END - > FROM (SELECT ARRAY(SELECT ($1)[n] FROM - > generate_series(1, array_upper($1, 1)) AS n - > WHERE ($1)[n] IS NOT NULL - > ORDER BY ($1)[n] - > ) As asorted) As foo ; - > $$ - > LANGUAGE 'sql' IMMUTABLE; - - Huh. How big an array were you trying to invoke it on? - - regards, tom lane - there are 899991 records in the table it's just test=# \d test2 Table "public.test2" Column | Type | Modifiers --------+--------------+----------- t1 | numeric(9,4) | Thanks Dave
David Kerr <dmk@mr-paradox.net> writes: > On Mon, Jul 20, 2009 at 11:14:22PM -0400, Tom Lane wrote: > - Huh. How big an array were you trying to invoke it on? > there are 899991 records in the table it's just > test=# \d test2 > Table "public.test2" > Column | Type | Modifiers > --------+--------------+----------- > t1 | numeric(9,4) | I tried it on a table with 899991 random values. It took frickin' forever, but seemed to be willing to respond to cancels anywhere along the line. I'm not sure why you're seeing differently. (The reason it takes forever is that numeric is a variable-width type, and access into a varwidth array is O(n), so the sorting step you've got here is O(n^2). It might help to use unnest() instead of this handmade version of it ...) regards, tom lane
On Tue, Jul 21, 2009 at 01:13:18PM -0400, Tom Lane wrote: - David Kerr <dmk@mr-paradox.net> writes: - I tried it on a table with 899991 random values. It took frickin' - forever, but seemed to be willing to respond to cancels anywhere - along the line. I'm not sure why you're seeing differently. Hehe, yeah. For me I let it run 10 min and hit ^C so maybe i just hit it at a bad place. - (The reason it takes forever is that numeric is a variable-width - type, and access into a varwidth array is O(n), so the sorting - step you've got here is O(n^2). It might help to use unnest() - instead of this handmade version of it ...) unnest() is 8.4 only, right? I'm actually probably just going to install R and use the median function from that. (I was hoping to avoid installing all of R) Or maybe i'll try my hand at a perl one and see if that gives ok performance. Thanks! Dave
> - type, and access into a varwidth array is O(n), so the sorting > - step you've got here is O(n^2). It might help to use unnest() > - instead of this handmade version of it ...) > > unnest() is 8.4 only, right? > > I'm actually probably just going to install R and use the median > function from that. (I was hoping to avoid installing all of R) > [Spotts, Christopher] We have large data numeric data sets that we do medians over and due to performance reasons, we eventually found it useful to take a smaller random sample and median that...it was close enough for us. Just a thought. Chris
On Tue, Jul 21, 2009 at 1:47 PM, David Kerr<dmk@mr-paradox.net> wrote: > On Tue, Jul 21, 2009 at 01:13:18PM -0400, Tom Lane wrote: > - David Kerr <dmk@mr-paradox.net> writes: > - I tried it on a table with 899991 random values. It took frickin' > - forever, but seemed to be willing to respond to cancels anywhere > - along the line. I'm not sure why you're seeing differently. > > Hehe, yeah. For me I let it run 10 min and hit ^C so maybe i just > hit it at a bad place. > > - (The reason it takes forever is that numeric is a variable-width > - type, and access into a varwidth array is O(n), so the sorting > - step you've got here is O(n^2). It might help to use unnest() > - instead of this handmade version of it ...) > > unnest() is 8.4 only, right? There is an undocumented version in earlier versions that's quite similar: information_schema._pg_expandarray (it's an sql function, not C, so you should be able examine the source in order to roll your own if you need to). merlin