Thread: Two slightly different queries produce same results, one takes 1/4 the time.. bug in planner?
Two slightly different queries produce same results, one takes 1/4 the time.. bug in planner?
From
Andrew Schmidt
Date:
Summary: Two queries that are nearly identical AND use the exact same indices take much different times in execution. Both queries produce the exact same results. One takes 1072ms to execute, the other 262ms. Although I have a fix, it seems more of a hack, and I'd also like to know why it's doing it. Details: I run a hockey pool website that uses mysql 4x with innodb. This week I've been trying out postgresql with the site to check out the performance difference. Standard select statements seem to be about twice as fast with postgres as opposed to innodb. I really wasn't expecting this, but was impressed none-the-less. I run on FreeBSD 5.4. However, I've run into a problem where one query took about twice as long as innodb. Some investigation and playing around with the query, I ended up with a slightly different query but was about 3 times as fast as innodb (and 5 times faster than the original query). I didn't add any indices and the EXPLAIN ANALYZE doesn't show any new indices being used with the new query. The order of which tables are read first was changed in the query planner and also it uses HashAggregate vs GroupAggregate.. not entirely sure what that means. Table structure: team - list of teams for each pool, team_players - list of player_id's for each team player - list of nhl players player_updates - for each nhl player the stats on each day they played (ie, crosby on oct 10, got 1 goal and 2 assists) There are other tables but these are the ones pertinent to my problem. So a query to find the total goals per team for an entire pool is fairly straightforward. Due to the fact different points are awarded if the nhl player is a goalie, I group by team_id / is_goalie. SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS total_goals FROM team t JOIN team_players tp ON tp.team_id = t.team_id JOIN player_updates pu ON pu.player_id = tp.player_id JOIN player pl ON pl.player_id = pu.player_id WHERE t.pool_id = 21699 GROUP BY tp.team_id,is_goalie; Explain analyze of this query: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=8742.52..9853.85 rows=44453 width=13) (actual time=1186.973..1432.548 rows=626 loops=1) -> Sort (cost=8742.52..8853.66 rows=44453 width=13) (actual time=1186.237..1309.562 rows=40928 loops=1) Sort Key: tp.team_id, (pl."position" = 'G'::bpchar) -> Hash Join (cost=174.52..4812.28 rows=44453 width=13) (actual time=51.099..400.119 rows=40928 loops=1) Hash Cond: ("outer".player_id = "inner".player_id) -> Hash Join (cost=30.85..3840.72 rows=5731 width=17) (actual time=8.303..115.834 rows=5005 loops=1) Hash Cond: ("outer".player_id = "inner".player_id) -> Nested Loop (cost=0.00..3723.90 rows=5731 width=8) (actual time=0.108..71.378 rows=5005 loops=1) -> Index Scan using team_pool_id on team t (cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313 loops=1) Index Cond: (pool_id = 21699) -> Index Scan using team_id_asdas on team_players tp (cost=0.00..8.24 rows=21 width=8) (actual time=0.029..0.132 rows=16 loops=313) Index Cond: (tp.team_id = "outer".team_id) -> Hash (cost=28.48..28.48 rows=948 width=9) (actual time=8.140..8.140 rows=948 loops=1) -> Seq Scan on player pl (cost=0.00..28.48 rows=948 width=9) (actual time=0.043..4.470 rows=948 loops=1) -> Hash (cost=130.14..130.14 rows=5414 width=8) (actual time=42.690..42.690 rows=5414 loops=1) -> Seq Scan on player_updates pu (cost=0.00..130.14 rows=5414 width=8) (actual time=0.013..21.410 rows=5414 loops=1) Total runtime: 1436.617 ms (17 rows) Time: 1443.127 ms I tried out enforcing the planner to use the indices on player_updates and player, but the query took longer. Now the slight modification where I put the pool_id in the JOIN of team and team_players (but also keep the check in the where clause): SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals FROM team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = 21699 <--- ** added pool_id JOIN player_updates pu ON pu.player_id = tp.player_id JOIN player pl ON pl.player_id = pu.player_id WHERE t.pool_id = 21699 GROUP BY tp.team_id,is_goalie; The explain analyze: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=271.72..280.21 rows=566 width=13) (actual time=478.783..481.127 rows=626 loops=1) -> Hash Join (cost=89.66..267.48 rows=566 width=13) (actual time=124.015..328.387 rows=40928 loops=1) Hash Cond: ("outer".player_id = "inner".player_id) -> Seq Scan on player_updates pu (cost=0.00..130.14 rows=5414 width=8) (actual time=0.021..16.934 rows=5414 loops=1) -> Hash (cost=89.48..89.48 rows=73 width=17) (actual time=123.872..123.872 rows=5005 loops=1) -> Hash Join (cost=53.16..89.48 rows=73 width=17) (actual time=79.843..104.031 rows=5005 loops=1) Hash Cond: ("outer".player_id = "inner".player_id) -> Seq Scan on player pl (cost=0.00..28.48 rows=948 width=9) (actual time=0.040..3.103 rows=948 loops=1) -> Hash (cost=52.97..52.97 rows=73 width=8) (actual time=79.740..79.740 rows=5005 loops=1) -> Nested Loop (cost=0.00..52.97 rows=73 width=8) (actual time=0.098..58.145 rows=5005 loops=1) -> Index Scan using team_pool_id on team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556 rows=313 loops=1) Index Cond: ((pool_id = 21699) AND (pool_id = 21699)) -> Index Scan using team_id_asdas on team_players tp (cost=0.00..8.24 rows=21 width=8) (actual time=0.021..0.099 rows=16 loops=313) Index Cond: (tp.team_id = "outer".team_id) Total runtime: 482.791 ms (15 rows) Time: 489.270 ms So there are no new indices being used, but the order in which the tables are read makes it much faster. Also if the pool_id check is removed from the WHERE clause (so it'd only be in the join clause), it goes back to the slower query planner. Originally I thought the slowdown was because it wasn't using the player_id index on player and player_updates, but that doesn't seem to be the case. Both queries return the exact same result (626 rows, same stats) Conclusion: Is this a bug or my misunderstanding? If it's a bug I can submit a more detailed step by step email. If it's just my misunderstanding, then can someone explain why? I've always been under the impression the join should only contain the fields pertinent to the join. Also does it need the pool_id again in the where clause when it's in the join clause. thanks a bunch, -- Andrew I apologize if the formatting gets messed up.
Andrew Schmidt wrote: > However, I've run into a problem where one query took about twice as > long as innodb. Some investigation and playing around with the query, I > ended up with a slightly different query but was about 3 times as fast > as innodb (and 5 times faster than the original query). I didn't add > any indices and the EXPLAIN ANALYZE doesn't show any new indices being > used with the new query. The order of which tables are read first was > changed in the query planner and also it uses HashAggregate vs > GroupAggregate.. not entirely sure what that means. It's a different method of aggregating data (calculating your sum()). The HashAggregate puts values into buckets and works best when you have a fairly small number of different values (so each value can get its own bucket). I think - bear in mind I'm just another user, so I'm not familiar with the internals. The different plans are because PG thinks it will get different numbers of rows... > SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS > total_goals > FROM > team t JOIN team_players tp ON tp.team_id = t.team_id > JOIN player_updates pu ON pu.player_id = tp.player_id > JOIN player pl ON pl.player_id = pu.player_id > WHERE t.pool_id = 21699 > GROUP BY tp.team_id,is_goalie; > Explain analyze of this query: > -> Index Scan using team_pool_id on team t > (cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313 > loops=1) > Index Cond: (pool_id = 21699) With this index scan you have one condition and PG expects 391 matching rows (it gets 313, so that's a pretty good estimate). > SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals > FROM > team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = > 21699 <--- ** added pool_id > JOIN player_updates pu ON pu.player_id = tp.player_id > JOIN player pl ON pl.player_id = pu.player_id > WHERE t.pool_id = 21699 > GROUP BY tp.team_id,is_goalie; > > The explain analyze: > -> Index Scan using team_pool_id on > team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556 > rows=313 loops=1) > Index Cond: ((pool_id = 21699) AND > (pool_id = 21699)) Here it sees two conditions, and PG is expecting that will restrict things further. It's clearly not expecting the same condition twice, but is presumably treating it something like "a>-4 AND a<4". This actually makes the row estimate wrong. So - if PG is getting this bit wrong why is query 2 faster? Looking at the row estimates for case 1 we can see that the final cases are badly wrong. > GroupAggregate (cost=8742.52..9853.85 rows=44453 width=13) (actual > time=1186.973..1432.548 rows=626 loops=1) Here it's got the number of rows badly wrong - if you actually ended up with 44,000 rows then maybe this would be the best plan. This is the last step though, so what happens before? > -> Sort (cost=8742.52..8853.66 rows=44453 width=13) (actual > time=1186.237..1309.562 rows=40928 loops=1) > Sort Key: tp.team_id, (pl."position" = 'G'::bpchar) We're about right here. The problem must be in the last step, and looking at the row estimates it seems to think that no summarising is actually going to take place. I think the "position = 'G'" bit is confusing it. Well, we can test that: 1. Just group by "pl.position" - remove the test. Do the estimated and actual rows match now? 2. Rewrite the position='G' as a CASE structure 3. Do the position='G' part in a sub-query and join to that. The results of those should show whether this is actually the problem. -- Richard Huxton Archonet Ltd
Hi Richard, While in the process of responding to your email I found out what was doing it. It was the grouping of the team_players team_id instead of the team team_id. Here's some table numbers: team => 31720 rows, team_players => 464896 rows, player => 948 rows player_updates => 5414 rows The query: SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals FROM team t JOIN team_players tp ON tp.team_id = t.team_id JOIN player_updates pu ON pu.player_id = tp.player_id JOIN player pl ON pl.player_id = pu.player_id WHERE t.pool_id = 21699 GROUP BY tp.team_id; It's grouping by the team_players tp.team_id, change the group to t.team_id and bam! 5x faster. thanks for the help, -- Andrew ******* This was my response to you Richard before I found out what it was. ******* I removed position from the query, but still got the same query times. I summed up the pl.player_id just to make sure it was reading the player table. EXPLAIN ANALYSE SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals FROM team t JOIN team_players tp ON tp.team_id = t.team_id JOIN player_updates pu ON pu.player_id = tp.player_id JOIN player pl ON pl.player_id = pu.player_id WHERE t.pool_id = 21699 GROUP BY tp.team_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=8584.45..9364.83 rows=22390 width=12) (actual time=894.364..1140.884 rows=313 loops=1) -> Sort (cost=8584.45..8695.59 rows=44453 width=12) (actual time=893.373..1013.989 rows=40928 loops=1) Sort Key: tp.team_id -> Hash Join (cost=174.52..4701.14 rows=44453 width=12) (actual time=50.433..328.788 rows=40928 loops=1) Hash Cond: ("outer".player_id = "inner".player_id) -> Hash Join (cost=30.85..3840.72 rows=5731 width=12) (actual time=7.279..110.578 rows=5005 loops=1) Hash Cond: ("outer".player_id = "inner".player_id) -> Nested Loop (cost=0.00..3723.90 rows=5731 width=8) (actual time=0.106..69.528 rows=5005 loops=1) -> Index Scan using team_pool_id on team t (cost=0.00..397.90 rows=391 width=4) (actual time=0.058..2.313 rows=313 loops=1) Index Cond: (pool_id = 21699) -> Index Scan using team_id_asdas on team_players tp (cost=0.00..8.24 rows=21 width=8) (actual time=0.027..0.122 rows=16 loops=313) Index Cond: (tp.team_id = "outer".team_id) -> Hash (cost=28.48..28.48 rows=948 width=4) (actual time=7.121..7.121 rows=948 loops=1) -> Seq Scan on player pl (cost=0.00..28.48 rows=948 width=4) (actual time=0.042..3.930 rows=948 loops=1) -> Hash (cost=130.14..130.14 rows=5414 width=8) (actual time=43.070..43.070 rows=5414 loops=1) -> Seq Scan on player_updates pu (cost=0.00..130.14 rows=5414 width=8) (actual time=0.013..22.410 rows=5414 loops=1) Total runtime: 1144.059 ms (17 rows) --- VS fast query EXPLAIN ANALYSE SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals FROM team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = 21699 JOIN player_updates pu ON pu.player_id = tp.player_id JOIN player pl ON pl.player_id = pu.player_id WHERE t.pool_id = 21699 GROUP BY tp.team_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=270.31..278.80 rows=566 width=12) (actual time=409.428..410.551 rows=313 loops=1) -> Hash Join (cost=89.66..266.06 rows=566 width=12) (actual time=120.341..278.310 rows=40928 loops=1) Hash Cond: ("outer".player_id = "inner".player_id) -> Seq Scan on player_updates pu (cost=0.00..130.14 rows=5414 width=8) (actual time=0.019..16.833 rows=5414 loops=1) -> Hash (cost=89.48..89.48 rows=73 width=12) (actual time=120.209..120.209 rows=5005 loops=1) -> Hash Join (cost=53.16..89.48 rows=73 width=12) (actual time=78.672..102.434 rows=5005 loops=1) Hash Cond: ("outer".player_id = "inner".player_id) -> Seq Scan on player pl (cost=0.00..28.48 rows=948 width=4) (actual time=0.040..3.092 rows=948 loops=1) -> Hash (cost=52.97..52.97 rows=73 width=8) (actual time=78.577..78.577 rows=5005 loops=1) -> Nested Loop (cost=0.00..52.97 rows=73 width=8) (actual time=0.100..57.729 rows=5005 loops=1) -> Index Scan using team_pool_id on team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.050..1.591 rows=313 loops=1) Index Cond: ((pool_id = 21699) AND (pool_id = 21699)) -> Index Scan using team_id_asdas on team_players tp (cost=0.00..8.24 rows=21 width=8) (actual time=0.021..0.099 rows=16 loops=313) Index Cond: (tp.team_id = "outer".team_id) Total runtime: 411.629 ms (15 rows) Time: 417.726 ms I ran ANALYZE on each of the tables, just to be sure, but there was no changes. Also, here's some table info: team => 31720 rows, team_players => 464896 rows, player => 948 rows player_updates => 5414 rows thanks, -- Andrew >> The explain analyze: >> -> Index Scan using team_pool_id on >> team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556 >> rows=313 loops=1) >> Index Cond: ((pool_id = 21699) >> AND (pool_id = 21699)) > > > Here it sees two conditions, and PG is expecting that will restrict > things further. It's clearly not expecting the same condition twice, > but is presumably treating it something like "a>-4 AND a<4". This > actually makes the row estimate wrong. > > So - if PG is getting this bit wrong why is query 2 faster? > Looking at the row estimates for case 1 we can see that the final > cases are badly wrong. > > > GroupAggregate (cost=8742.52..9853.85 rows=44453 width=13) (actual > > time=1186.973..1432.548 rows=626 loops=1) > > Here it's got the number of rows badly wrong - if you actually ended > up with 44,000 rows then maybe this would be the best plan. This is > the last step though, so what happens before? > > > -> Sort (cost=8742.52..8853.66 rows=44453 width=13) (actual > > time=1186.237..1309.562 rows=40928 loops=1) > > Sort Key: tp.team_id, (pl."position" = 'G'::bpchar) > > We're about right here. The problem must be in the last step, and > looking at the row estimates it seems to think that no summarising is > actually going to take place. > > I think the "position = 'G'" bit is confusing it. Well, we can test that: > > 1. Just group by "pl.position" - remove the test. Do the estimated and > actual rows match now? > > 2. Rewrite the position='G' as a CASE structure > > 3. Do the position='G' part in a sub-query and join to that. > > The results of those should show whether this is actually the problem. > -- > Richard Huxton > Archonet Ltd > > . > -- _____________________________________________ G l o b a l D i a g n o s t i c s I n c. Andrew Schmidt t.416-304-0049 x206 aschmidt@lifescale.com f.866-697-8726 _____________________________________________