Thread: query looping?
The query shown below [select count(distinct...] seems to be looping (99-101% CPU as shown by top for 11+ hours). This using postgres 8.3.5 on a dual quad core machine (Intel(R) Xeon(R) CPU X5460 @ 3.16GHz) with 32G RAM. Can I provide any other info to help investigate this issue? Or any thoughts on how to prevent/avoid it? Thanks, Brian top - 11:03:39 up 91 days, 22:39, 2 users, load average: 3.73, 2.14, 1.42 Tasks: 135 total, 3 running, 132 sleeping, 0 stopped, 0 zombie Cpu(s): 27.3% us, 7.7% sy, 0.0% ni, 54.0% id, 11.0% wa, 0.0% hi, 0.0% si Mem: 33264272k total, 33247780k used, 16492k free, 17232k buffers Swap: 4088532k total, 334264k used, 3754268k free, 26760304k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24585 postgres 17 0 572m 494m 484m R 99 1.5 646:13.63 postmaster cemdb=# select procpid,query_start,current_query from pg_stat_activity; procpid | query_start | current_query ---------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 13210 | 2010-01-04 10:54:04.490107-08 | <IDLE> 24496 | 2010-01-04 10:56:14.982997-08 | <IDLE> 30636 | 2010-01-04 10:54:04.488569-08 | <IDLE> 5309 | 2010-01-04 10:56:22.850159-08 | select procpid,query_start,current_query from pg_stat_activity; 30637 | 2010-01-04 10:54:04.490152-08 | <IDLE> 24500 | 2010-01-04 10:56:14.98354-08 | <IDLE> 13213 | 2010-01-04 10:54:04.491743-08 | <IDLE> 13214 | 2010-01-04 10:56:04.197887-08 | <IDLE> 24499 | 2010-01-04 10:56:14.981709-08 | <IDLE> 24502 | 2010-01-04 10:56:14.985027-08 | <IDLE> 13217 | 2010-01-04 10:54:04.487039-08 | <IDLE> 24504 | 2010-01-04 10:56:14.984631-08 | <IDLE> 24505 | 2010-01-04 10:56:14.985248-08 | <IDLE> 27938 | 2010-01-04 10:54:04.485782-08 | <IDLE> 1104 | 2010-01-04 10:54:04.489896-08 | <IDLE> 27941 | 2010-01-04 10:54:04.491925-08 | <IDLE> 24585 | 2010-01-04 00:16:52.764899-08 | select count(distinct b.ts_id) from ts_stats_transetgroup_user_weekly b, ts_stats_transet_user_interval c, ts_transetgroup_transets_map m where b.ts_transet_group_id = m.ts_transet_group_id and m.ts_transet_incarnation_id = c.ts_transet_incarnation_id and c.ts_user_incarnation_id = b.ts_user_incarnation_id and c.ts_interval_start_time >= $1 and c.ts_interval_start_time < $2 and b.ts_interval_start_time >= $3 and b.ts_interval_start_time < $4 24586 | 2010-01-04 10:56:14.986201-08 | <IDLE> 13224 | 2010-01-04 10:54:04.487762-08 | <IDLE> 24591 | 2010-01-04 10:56:14.98333-08 | <IDLE> 24592 | 2010-01-04 10:56:14.98157-08 | <IDLE> (21 rows) cemdb=# select locktype,database,relation,virtualtransaction,mode,granted from pg_locks where pid=24585; locktype | database | relation | virtualtransaction | mode | granted ------------+----------+----------+--------------------+-----------------+--------- relation | 74755 | 76064 | 23/1037332 | AccessShareLock | t virtualxid | | | 23/1037332 | ExclusiveLock | t relation | 74755 | 75245 | 23/1037332 | AccessShareLock | t relation | 74755 | 76062 | 23/1037332 | AccessShareLock | t relation | 74755 | 76188 | 23/1037332 | AccessShareLock | t relation | 74755 | 74822 | 23/1037332 | AccessShareLock | t relation | 74755 | 76187 | 23/1037332 | AccessShareLock | t relation | 74755 | 76186 | 23/1037332 | AccessShareLock | t relation | 74755 | 76189 | 23/1037332 | AccessShareLock | t relation | 74755 | 76057 | 23/1037332 | AccessShareLock | t relation | 74755 | 75846 | 23/1037332 | AccessShareLock | t relation | 74755 | 76063 | 23/1037332 | AccessShareLock | t relation | 74755 | 76058 | 23/1037332 | AccessShareLock | t relation | 74755 | 76185 | 23/1037332 | AccessShareLock | t relation | 74755 | 75874 | 23/1037332 | AccessShareLock | t relation | 74755 | 76061 | 23/1037332 | AccessShareLock | t relation | 74755 | 76191 | 23/1037332 | AccessShareLock | t relation | 74755 | 76059 | 23/1037332 | AccessShareLock | t relation | 74755 | 76363 | 23/1037332 | AccessShareLock | t relation | 74755 | 76364 | 23/1037332 | AccessShareLock | t relation | 74755 | 76192 | 23/1037332 | AccessShareLock | t relation | 74755 | 76362 | 23/1037332 | AccessShareLock | t relation | 74755 | 76190 | 23/1037332 | AccessShareLock | t relation | 74755 | 75920 | 23/1037332 | AccessShareLock | t relation | 74755 | 75343 | 23/1037332 | AccessShareLock | t relation | 74755 | 76193 | 23/1037332 | AccessShareLock | t relation | 74755 | 76060 | 23/1037332 | AccessShareLock | t relation | 74755 | 76065 | 23/1037332 | AccessShareLock | t (28 rows) cemdb=> explain analyze select count(distinct b.ts_id) from ts_stats_transetgroup_user_weekly b, ts_stats_transet_user_interval c, ts_transetgroup_transets_map m where b.ts_transet_group_id = m.ts_transet_group_id and m.ts_transet_incarnation_id = c.ts_transet_incarnation_id and c.ts_user_incarnation_id = b.ts_user_incarnation_id and c.ts_interval_start_time >= '2010-01-03 00:00' and c.ts_interval_start_time < '2010-01-03 08:00' and b.ts_interval_start_time >= '2009-12-28 00:00' and b.ts_interval_start_time < '2010-01-04 00:00'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1726281.28..1726281.29 rows=1 width=8) (actual time=16690.541..16690.542 rows=1 loops=1) -> Hash Join (cost=43363.87..1725837.91 rows=354697 width=8) (actual time=1343.960..14467.012 rows=1600000 loops=1) Hash Cond: ((b.ts_transet_group_id = m.ts_transet_group_id) AND (c.ts_transet_incarnation_id = m.ts_transet_incarnation_id)) -> Hash Join (cost=43362.03..1717697.04 rows=1697479 width=24) (actual time=1343.778..11432.270 rows=1600000 loops=1) Hash Cond: (c.ts_user_incarnation_id = b.ts_user_incarnation_id) -> Bitmap Heap Scan on ts_stats_transet_user_interval c (cost=19014.73..1666918.08 rows=844436 width=16) (actual time=950.097..8125.102 rows=800000 loops=1) Recheck Cond: ((ts_interval_start_time >= '2010-01-03 00:00:00-08'::timestamp with time zone) AND (ts_interval_start_time < '2010-01-03 08:00:00-08'::timestamp with time zone)) -> Bitmap Index Scan on ts_stats_transet_user_interval_starttime (cost=0.00..18909.17 rows=844436 width=0) (actual time=930.036..930.036 rows=800000 loops=1) Index Cond: ((ts_interval_start_time >= '2010-01-03 00:00:00-08'::timestamp with time zone) AND (ts_interval_start_time < '2010-01-03 08:00:00-08'::timestamp with time zone)) -> Hash (cost=23787.37..23787.37 rows=89590 width=24) (actual time=393.570..393.570 rows=89758 loops=1) -> Seq Scan on ts_stats_transetgroup_user_weekly b (cost=0.00..23787.37 rows=89590 width=24) (actual time=0.040..295.414 rows=89758 loops=1) Filter: ((ts_interval_start_time >= '2009-12-28 00:00:00-08'::timestamp with time zone) AND (ts_interval_start_time < '2010-01-04 00:00:00-08'::timestamp with time zone)) -> Hash (cost=1.33..1.33 rows=67 width=16) (actual time=0.156..0.156 rows=67 loops=1) -> Seq Scan on ts_transetgroup_transets_map m (cost=0.00..1.33 rows=67 width=16) (actual time=0.022..0.080 rows=67 loops=1) Total runtime: 16691.964 ms (15 rows)
On Mon, Jan 4, 2010 at 2:41 PM, Brian Cox <brian.cox@ca.com> wrote: > The query shown below [select count(distinct...] seems to be looping > (99-101% CPU as shown by top for 11+ hours). This using postgres 8.3.5 on a > dual quad core machine (Intel(R) Xeon(R) CPU X5460 @ 3.16GHz) with 32G RAM. > Can I provide any other info to help investigate this issue? Or any thoughts > on how to prevent/avoid it? You posted an EXPLAIN ANALYZE showing almost the same query taking 17 seconds, but a crucial difference is that the EXPLAIN ANALYZE shows the query with the parameters actually in the query, whereas the query that's actually running for a long time uses bound parameters. PostgreSQL won't take the particular values into account in planning that version, which can sometimes lead to a markedly inferior plan. What do you get if you do this? PREPARE foo AS <the query, with the $x entries still in there> EXPLAIN EXECUTE foo(<the values>); ...Robert
On 01/04/2010 04:53 PM, Robert Haas [robertmhaas@gmail.com] wrote: > PREPARE foo AS <the query, with the $x entries still in there> > EXPLAIN EXECUTE foo(<the values>); Thanks for the response. Results below. Brian cemdb=> prepare foo as select count(distinct b.ts_id) from ts_stats_transetgroup_user_weekly b, ts_stats_transet_user_interval c, ts_transetgroup_transets_map m where b.ts_transet_group_id = m.ts_transet_group_id and m.ts_transet_incarnation_id = c.ts_transet_incarnation_id and c.ts_user_incarnation_id = b.ts_user_incarnation_id and c.ts_interval_start_time >= $1 and c.ts_interval_start_time < $2 and b.ts_interval_start_time >= $3 and b.ts_interval_start_time < $4; PREPARE cemdb=> explain execute foo('2010-01-03 00:00','2010-01-03 08:00','2009-12-28 00:00','2010-01-04 00:00'); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=325382.51..325382.51 rows=1 width=8) -> Hash Join (cost=3486.00..325382.00 rows=406 width=8) Hash Cond: ((b.ts_transet_group_id = m.ts_transet_group_id) AND (c.ts_transet_incarnation_id = m.ts_transet_incarnation_id)) -> Hash Join (cost=3484.17..325370.84 rows=1944 width=24) Hash Cond: (c.ts_user_incarnation_id = b.ts_user_incarnation_id) -> Bitmap Heap Scan on ts_stats_transet_user_interval c (cost=2177.34..322486.61 rows=96473 width=16) Recheck Cond: ((ts_interval_start_time >= $1) AND (ts_interval_start_time < $2)) -> Bitmap Index Scan on ts_stats_transet_user_interval_starttime (cost=0.00..2165.28 rows=96473 width=0) Index Cond: ((ts_interval_start_time >= $1) AND (ts_interval_start_time < $2)) -> Hash (cost=1301.21..1301.21 rows=898 width=24) -> Index Scan using ts_stats_transetgroup_user_weekly_starttimeindex on ts_stats_transetgroup_user_weekly b (cost=0.00..1301.21 rows=898 width=24) Index Cond: ((ts_interval_start_time >= $3) AND (ts_interval_start_time < $4)) -> Hash (cost=1.33..1.33 rows=67 width=16) -> Seq Scan on ts_transetgroup_transets_map m (cost=0.00..1.33 rows=67 width=16) (14 rows)
On Mon, Jan 4, 2010 at 5:24 PM, Brian Cox <brian.cox@ca.com> wrote: > On 01/04/2010 04:53 PM, Robert Haas [robertmhaas@gmail.com] wrote: >> >> PREPARE foo AS <the query, with the $x entries still in there> >> EXPLAIN EXECUTE foo(<the values>); > > Thanks for the response. Results below. Brian > > cemdb=> prepare foo as select count(distinct b.ts_id) from > ts_stats_transetgroup_user_weekly b, ts_stats_transet_user_interval c, > ts_transetgroup_transets_map m where b.ts_transet_group_id = > m.ts_transet_group_id and m.ts_transet_incarnation_id = > c.ts_transet_incarnation_id and c.ts_user_incarnation_id = > b.ts_user_incarnation_id and c.ts_interval_start_time >= $1 and > c.ts_interval_start_time < $2 and b.ts_interval_start_time >= $3 and > b.ts_interval_start_time < $4; > PREPARE > > cemdb=> explain execute foo('2010-01-03 00:00','2010-01-03 > 08:00','2009-12-28 00:00','2010-01-04 00:00'); > > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=325382.51..325382.51 rows=1 width=8) > -> Hash Join (cost=3486.00..325382.00 rows=406 width=8) > Hash Cond: ((b.ts_transet_group_id = m.ts_transet_group_id) AND > (c.ts_transet_incarnation_id = m.ts_transet_incarnation_id)) > -> Hash Join (cost=3484.17..325370.84 rows=1944 width=24) > Hash Cond: (c.ts_user_incarnation_id = > b.ts_user_incarnation_id) > -> Bitmap Heap Scan on ts_stats_transet_user_interval c > (cost=2177.34..322486.61 rows=96473 width=16) > Recheck Cond: ((ts_interval_start_time >= $1) AND > (ts_interval_start_time < $2)) > -> Bitmap Index Scan on > ts_stats_transet_user_interval_starttime (cost=0.00..2165.28 rows=96473 > width=0) > Index Cond: ((ts_interval_start_time >= $1) AND > (ts_interval_start_time < $2)) > -> Hash (cost=1301.21..1301.21 rows=898 width=24) > -> Index Scan using > ts_stats_transetgroup_user_weekly_starttimeindex on > ts_stats_transetgroup_user_weekly b (cost=0.00..1301.21 rows=898 width=24) > Index Cond: ((ts_interval_start_time >= $3) AND > (ts_interval_start_time < $4)) > -> Hash (cost=1.33..1.33 rows=67 width=16) > -> Seq Scan on ts_transetgroup_transets_map m > (cost=0.00..1.33 rows=67 width=16) > (14 rows) Hmm. Looks like the same plan. It's not obvious to me what is wrong. Maybe it would make sense to start by checking the row count estimates for the different rows in this plan. For example: SELECT SUM(1) FROM ts_stats_transetgroup_user_weekly b WHERE ts_interval_start_time > [value] AND ts_interval_start_time < [value]; ...and similarly for the bitmap index scan. ...Robert
> SELECT SUM(1) FROM ts_stats_transetgroup_user_weekly b WHERE > ts_interval_start_time > [value] AND ts_interval_start_time < [value]; > > ...and similarly for the bitmap index scan. cemdb=> SELECT SUM(1) FROM ts_stats_transetgroup_user_weekly b WHERE ts_interval_start_time >= '2009-12-28' AND ts_interval_start_time < '2010-01-04'; sum ------- 89758 (1 row) cemdb=> select sum(1) from ts_stats_transet_user_interval where ts_interval_start_time >= '2009-01-03' and ts_interval_start_time < '2009-01-03 08:00'; sum ----- (1 row) cemdb=> select sum(1) from ts_stats_transet_user_interval where ts_interval_start_time >= '2010-01-03' and ts_interval_start_time < '2010-01-03 08:00'; sum -------- 800000 (1 row) the estimates in the 1st query plan are OK (since they are the "same"). The 2nd, however, look to be too low. FYI, this query finally completed, so it wasn't looping but the query plan is very poor: [24585-cemdb-admin-2010-01-05 10:54:49.511 PST]LOG: duration: 124676746.863 ms execute <unnamed>: select count(distinct b.ts_id) from ts_stats_transetgroup_user_weekly b, ts_stats_transet_user_interval c, ts_transetgroup_transets_map m where b.ts_transet_group_id = m.ts_transet_group_id and m.ts_transet_incarnation_id = c.ts_transet_incarnation_id and c.ts_user_incarnation_id = b.ts_user_incarnation_id and c.ts_interval_start_time >= $1 and c.ts_interval_start_time < $2 and b.ts_interval_start_time >= $3 and b.ts_interval_start_time < $4 [24585-cemdb-admin-2010-01-05 10:54:49.511 PST]DETAIL: parameters: $1 = '2010-01-03 00:00:00-08', $2 = '2010-01-03 08:00:00-08', $3 = '2010-01-01 00:00:00-08', $4 = '2010-01-04 00:00:00-08' compare to: [root@rdl64xeoserv01 log]# time PGPASSWORD=**** psql -U admin -d cemdb -c "select count(distinct b.ts_id) from ts_stats_transetgroup_user_weekly b, ts_stats_transet_user_interval c, ts_transetgroup_transets_map m where b.ts_transet_group_id = m.ts_transet_group_id and m.ts_transet_incarnation_id = c.ts_transet_incarnation_id and c.ts_user_incarnation_id = b.ts_user_incarnation_id and c.ts_interval_start_time >= '2010-01-03 00:00' and c.ts_interval_start_time < '2010-01-03 08:00' and b.ts_interval_start_time >= '2009-12-28 00:00' and b.ts_interval_start_time < '2010-01-04 00:00'" count ------- 89758 (1 row) real 0m3.804s user 0m0.001s sys 0m0.003s so why the former ~40,000 times slower? Thanks, Brian
also compare: [4258-cemdb-admin-2010-01-05 13:11:42.913 PST]LOG: duration: 6401.314 ms statement: execute foo('2010-01-03 00:00','2010-01-03 08:00','2009-12-28 00:00','2010-01-04 00:00'); [4258-cemdb-admin-2010-01-05 13:11:42.913 PST]DETAIL: prepare: prepare foo as select count(distinct b.ts_id) from ts_stats_transetgroup_user_weekly b, ts_stats_transet_user_interval c, ts_transetgroup_transets_map m where b.ts_transet_group_id = m.ts_transet_group_id and m.ts_transet_incarnation_id = c.ts_transet_incarnation_id and c.ts_user_incarnation_id = b.ts_user_incarnation_id and c.ts_interval_start_time >= $1 and c.ts_interval_start_time < $2 and b.ts_interval_start_time >= $3 and b.ts_interval_start_time < $4; still the original query is ~20,000 times slower. Here's the explain foo output for the execute above: cemdb=> explain execute foo('2010-01-03 00:00','2010-01-03 08:00','2009-12-28 00:00','2010-01-04 00:00'); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=347318.10..347318.11 rows=1 width=8) -> Hash Join (cost=3836.14..347317.41 rows=549 width=8) Hash Cond: ((b.ts_transet_group_id = m.ts_transet_group_id) AND (c.ts_transet_incarnation_id = m.ts_transet_incarnation_id)) -> Hash Join (cost=3834.30..347302.98 rows=2628 width=24) Hash Cond: (c.ts_user_incarnation_id = b.ts_user_incarnation_id) -> Bitmap Heap Scan on ts_stats_transet_user_interval c (cost=2199.30..343132.02 rows=103500 width=16) Recheck Cond: ((ts_interval_start_time >= $1) AND (ts_interval_start_time < $2)) -> Bitmap Index Scan on ts_stats_transet_user_interval_starttime (cost=0.00..2186.36 rows=103500 width=0) Index Cond: ((ts_interval_start_time >= $1) AND (ts_interval_start_time < $2)) -> Hash (cost=1627.99..1627.99 rows=1122 width=24) -> Index Scan using ts_stats_transetgroup_user_weekly_starttimeindex on ts_stats_transetgroup_user_weekly b (cost=0.00..1627.99 rows=1122 width=24) Index Cond: ((ts_interval_start_time >= $3) AND (ts_interval_start_time < $4)) -> Hash (cost=1.33..1.33 rows=67 width=16) -> Seq Scan on ts_transetgroup_transets_map m (cost=0.00..1.33 rows=67 width=16) (14 rows) comparing this to the 1st explain foo output shows some minor differences in row estimates -- but nothing, I assume, that could explain the huge time difference. Of course, the 1st plan may not (and probably? wasn't) the plan that was used to take 124M ms. Any thoughts on how to avoid this? Thanks, Brian
On Tue, Jan 5, 2010 at 4:33 PM, Brian Cox <brian.cox@ca.com> wrote: > comparing this to the 1st explain foo output shows some minor differences in > row estimates -- but nothing, I assume, that could explain the huge time > difference. Of course, the 1st plan may not (and probably? wasn't) the plan > that was used to take 124M ms. > > Any thoughts on how to avoid this? The incorrect row estimates can only foul up the plan; they can't directly make anything slow. Comparing the two plans line by line, the only difference I see is the fast plan has: -> Seq Scan on ts_stats_transetgroup_user_weekly b (cost=0.00..23787.37 rows=89590 width=24) (actual time=0.040..295.414 rows=89758 loops=1) Filter: ((ts_interval_start_time >= '2009-12-28 00:00:00-08'::timestamp with time zone) AND (ts_interval_start_time < '2010-01-04 00:00:00-08'::timestamp with time zone)) ...while the slow one has: Index Scan using ts_stats_transetgroup_user_weekly_starttimeindex on ts_stats_transetgroup_user_weekly b (cost=0.00..1301.21 rows=898 width=24) Index Cond: ((ts_interval_start_time >= $3) AND (ts_interval_start_time < $4)) So it looks like using that index to fetch the data is a LOT slower than just scanning the whole table. In terms of fixing this problem, I have two ideas: - If you don't have any queries where this index makes things faster, then you can just drop the index. - If you have other queries where this index helps (even though it is hurting this one), then you're going to have to find a way to execute the query without using bound parameters - i.e. with the actual values in there instead of $1 through $4. That will allow the planner to see that the index scan is a loser because it will see that there are a lot of rows in the specified range of ts_interval_start_times. ...Robert
On 01/05/2010 08:34 PM, Robert Haas [robertmhaas@gmail.com] wrote: > - If you have other queries where this index helps (even though it is > hurting this one), then you're going to have to find a way to execute > the query without using bound parameters - i.e. with the actual values > in there instead of $1 through $4. That will allow the planner to see > that the index scan is a loser because it will see that there are a > lot of rows in the specified range of ts_interval_start_times. I think that this is possible without too much work. FYI - this test is still running and the same query has been executed at least 2 more times (it gets done 1-24 times per day) since it took 124M ms with acceptable response times (several secs). I don't see how either of the 2 query plans posted could've taken that long (and the actually execution times I posted confirm this), so I'm assuming that some other plan was used for the 124M ms execution. Seems like it must have been some NxM plan. Do you think that autovacuuming more frequently would prevent the query planner from making this poor choice? Brian
On Thu, Jan 7, 2010 at 1:43 PM, Brian Cox <brian.cox@ca.com> wrote: > On 01/05/2010 08:34 PM, Robert Haas [robertmhaas@gmail.com] wrote: >> >> - If you have other queries where this index helps (even though it is >> hurting this one), then you're going to have to find a way to execute >> the query without using bound parameters - i.e. with the actual values >> in there instead of $1 through $4. That will allow the planner to see >> that the index scan is a loser because it will see that there are a >> lot of rows in the specified range of ts_interval_start_times. > > I think that this is possible without too much work. Oh, good. > FYI - this test is still running and the same query has been executed at > least 2 more times (it gets done 1-24 times per day) since it took 124M ms > with acceptable response times (several secs). I don't see how either of the > 2 query plans posted could've taken that long (and the actually execution > times I posted confirm this), so I'm assuming that some other plan was used > for the 124M ms execution. Seems like it must have been some NxM plan. Do > you think that autovacuuming more frequently would prevent the query planner > from making this poor choice? That seems pretty speculative... I'm not really sure. ...Robert