Thread: Two slightly different queries produce same results, one takes 1/4 the time.. bug in planner?

Summary:

Two queries that are nearly identical AND use the exact same indices
take much different times in execution.   Both queries produce the exact
same results.  One takes 1072ms to execute, the other 262ms.  Although I
have a fix, it seems more of a hack, and I'd also like to know why it's
doing it.

Details:

I run a hockey pool website that uses mysql 4x with innodb.  This week
I've been trying out postgresql with the site to check out the
performance difference.   Standard select statements seem to be about
twice as fast with postgres as opposed to innodb.   I really wasn't
expecting this, but was impressed none-the-less.  I run on FreeBSD 5.4.

However, I've run into a problem where one query took about twice as
long as innodb.  Some investigation and playing around with the query, I
ended up with a slightly different query but was about 3 times as fast
as innodb (and 5 times faster than the original query).  I didn't add
any indices and the EXPLAIN ANALYZE doesn't show any new indices being
used with the new query.  The order of which tables are read first was
changed in the query planner and also it uses HashAggregate vs
GroupAggregate.. not entirely sure what that means.

Table structure:

team - list of teams for each pool,
team_players - list of player_id's for each team
player - list of nhl players
player_updates - for each nhl player the stats on each day they played
(ie, crosby on oct 10, got 1 goal and 2 assists)

There are other tables but these are the ones pertinent to my problem.

So a query to find the total goals per team for an entire pool is fairly
straightforward.  Due to the fact different points are awarded if the
nhl player is a goalie, I group by team_id / is_goalie.


SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS
total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id,is_goalie;

Explain analyze of this query:


QUERY
PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=8742.52..9853.85 rows=44453 width=13) (actual
time=1186.973..1432.548 rows=626 loops=1)
   ->  Sort  (cost=8742.52..8853.66 rows=44453 width=13) (actual
time=1186.237..1309.562 rows=40928 loops=1)
         Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)
         ->  Hash Join  (cost=174.52..4812.28 rows=44453 width=13)
(actual time=51.099..400.119 rows=40928 loops=1)
               Hash Cond: ("outer".player_id = "inner".player_id)
               ->  Hash Join  (cost=30.85..3840.72 rows=5731 width=17)
(actual time=8.303..115.834 rows=5005 loops=1)
                     Hash Cond: ("outer".player_id = "inner".player_id)
                     ->  Nested Loop  (cost=0.00..3723.90 rows=5731
width=8) (actual time=0.108..71.378 rows=5005 loops=1)
                           ->  Index Scan using team_pool_id on team t
(cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313
loops=1)
                                 Index Cond: (pool_id = 21699)
                           ->  Index Scan using team_id_asdas on
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual
time=0.029..0.132 rows=16 loops=313)
                                 Index Cond: (tp.team_id = "outer".team_id)
                     ->  Hash  (cost=28.48..28.48 rows=948 width=9)
(actual time=8.140..8.140 rows=948 loops=1)
                           ->  Seq Scan on player pl  (cost=0.00..28.48
rows=948 width=9) (actual time=0.043..4.470 rows=948 loops=1)
               ->  Hash  (cost=130.14..130.14 rows=5414 width=8) (actual
time=42.690..42.690 rows=5414 loops=1)
                     ->  Seq Scan on player_updates pu
(cost=0.00..130.14 rows=5414 width=8) (actual time=0.013..21.410
rows=5414 loops=1)
 Total runtime: 1436.617 ms
(17 rows)

Time: 1443.127 ms



I tried out enforcing the planner to use the indices on player_updates
and player, but the query took longer.

Now the slight modification where I put the pool_id in the JOIN of team
and team_players (but also keep the check in the where clause):

SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id =
21699  <--- ** added pool_id
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id,is_goalie;

The explain analyze:


QUERY
PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=271.72..280.21 rows=566 width=13) (actual
time=478.783..481.127 rows=626 loops=1)
   ->  Hash Join  (cost=89.66..267.48 rows=566 width=13) (actual
time=124.015..328.387 rows=40928 loops=1)
         Hash Cond: ("outer".player_id = "inner".player_id)
         ->  Seq Scan on player_updates pu  (cost=0.00..130.14 rows=5414
width=8) (actual time=0.021..16.934 rows=5414 loops=1)
         ->  Hash  (cost=89.48..89.48 rows=73 width=17) (actual
time=123.872..123.872 rows=5005 loops=1)
               ->  Hash Join  (cost=53.16..89.48 rows=73 width=17)
(actual time=79.843..104.031 rows=5005 loops=1)
                     Hash Cond: ("outer".player_id = "inner".player_id)
                     ->  Seq Scan on player pl  (cost=0.00..28.48
rows=948 width=9) (actual time=0.040..3.103 rows=948 loops=1)
                     ->  Hash  (cost=52.97..52.97 rows=73 width=8)
(actual time=79.740..79.740 rows=5005 loops=1)
                           ->  Nested Loop  (cost=0.00..52.97 rows=73
width=8) (actual time=0.098..58.145 rows=5005 loops=1)
                                 ->  Index Scan using team_pool_id on
team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556
rows=313 loops=1)
                                       Index Cond: ((pool_id = 21699)
AND (pool_id = 21699))
                                 ->  Index Scan using team_id_asdas on
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual
time=0.021..0.099 rows=16 loops=313)
                                       Index Cond: (tp.team_id =
"outer".team_id)
 Total runtime: 482.791 ms
(15 rows)

Time: 489.270 ms


So there are no new indices being used, but the order in which the
tables are read makes it much faster.  Also if the pool_id check is
removed from the WHERE clause (so it'd only be in the join clause),  it
goes back to the slower query planner.
Originally I thought the slowdown was because it wasn't using the
player_id index on player and player_updates, but that doesn't seem to
be the case.

Both queries return the exact same result (626 rows, same stats)

Conclusion:

Is this a bug or my misunderstanding?   If it's a bug I can submit a
more detailed step by step email.  If it's just my misunderstanding,
then can someone explain why?  I've always been under the impression the
join should only contain the fields pertinent to the join.  Also does it
need the pool_id again in the where clause when it's in the join clause.

thanks a bunch,

-- Andrew

I apologize if the formatting gets messed up.



Re: Two slightly different queries produce same results,

From
Richard Huxton
Date:
Andrew Schmidt wrote:
> However, I've run into a problem where one query took about twice as
> long as innodb.  Some investigation and playing around with the query, I
> ended up with a slightly different query but was about 3 times as fast
> as innodb (and 5 times faster than the original query).  I didn't add
> any indices and the EXPLAIN ANALYZE doesn't show any new indices being
> used with the new query.  The order of which tables are read first was
> changed in the query planner and also it uses HashAggregate vs
> GroupAggregate.. not entirely sure what that means.

It's a different method of aggregating data (calculating your sum()).
The HashAggregate puts values into buckets and works best when you have
a fairly small number of different values (so each value can get its own
bucket). I think - bear in mind I'm just another user, so I'm not
familiar with the internals.

The different plans are because PG thinks it will get different numbers
of rows...

> SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS
> total_goals
> FROM
> team t JOIN team_players tp ON tp.team_id = t.team_id
> JOIN player_updates pu ON pu.player_id = tp.player_id
> JOIN player pl ON pl.player_id = pu.player_id
> WHERE t.pool_id = 21699
> GROUP BY tp.team_id,is_goalie;

> Explain analyze of this query:

>                           ->  Index Scan using team_pool_id on team t
> (cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313
> loops=1)
>                                 Index Cond: (pool_id = 21699)

With this index scan you have one condition and PG expects 391 matching
rows (it gets 313, so that's a pretty good estimate).

> SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals
> FROM
> team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id =
> 21699  <--- ** added pool_id
> JOIN player_updates pu ON pu.player_id = tp.player_id
> JOIN player pl ON pl.player_id = pu.player_id
> WHERE t.pool_id = 21699
> GROUP BY tp.team_id,is_goalie;
>
> The explain analyze:
>                                 ->  Index Scan using team_pool_id on
> team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556
> rows=313 loops=1)
>                                       Index Cond: ((pool_id = 21699) AND
> (pool_id = 21699))

Here it sees two conditions, and PG is expecting that will restrict
things further. It's clearly not expecting the same condition twice, but
is presumably treating it something like "a>-4 AND a<4". This actually
makes the row estimate wrong.

So - if PG is getting this bit wrong why is query 2 faster?
Looking at the row estimates for case 1 we can see that the final cases
are badly wrong.

 > GroupAggregate  (cost=8742.52..9853.85 rows=44453 width=13) (actual
 > time=1186.973..1432.548 rows=626 loops=1)

Here it's got the number of rows badly wrong - if you actually ended up
with 44,000 rows then maybe this would be the best plan. This is the
last step though, so what happens before?

 >   ->  Sort  (cost=8742.52..8853.66 rows=44453 width=13) (actual
 > time=1186.237..1309.562 rows=40928 loops=1)
 >         Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)

We're about right here. The problem must be in the last step, and
looking at the row estimates it seems to think that no summarising is
actually going to take place.

I think the "position = 'G'" bit is confusing it. Well, we can test that:

1. Just group by "pl.position" - remove the test. Do the estimated and
actual rows match now?

2. Rewrite the position='G' as a CASE structure

3. Do the position='G' part in a sub-query and join to that.

The results of those should show whether this is actually the problem.
--
   Richard Huxton
   Archonet Ltd

Re: Two slightly different queries produce same results,

From
Andrew Schmidt
Date:
Hi Richard,

While in the process of responding to your email I found out what was
doing it.   It was the grouping of the team_players team_id instead of
the team team_id.

Here's some table numbers:

team => 31720 rows,
team_players => 464896 rows,
player => 948 rows
player_updates => 5414 rows

The query:
 SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id;


It's grouping by the team_players tp.team_id,    change the group to
t.team_id and bam! 5x faster.

thanks for the help,

-- Andrew


*******
This was my response to you Richard before I found out what it was.
*******
I removed position from the query, but still got the same query times.
I summed up the pl.player_id just to make sure it was reading the player
table.


EXPLAIN ANALYSE
 SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id;

QUERY
PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=8584.45..9364.83 rows=22390 width=12) (actual
time=894.364..1140.884 rows=313 loops=1)
   ->  Sort  (cost=8584.45..8695.59 rows=44453 width=12) (actual
time=893.373..1013.989 rows=40928 loops=1)
         Sort Key: tp.team_id
         ->  Hash Join  (cost=174.52..4701.14 rows=44453 width=12)
(actual time=50.433..328.788 rows=40928 loops=1)
               Hash Cond: ("outer".player_id = "inner".player_id)
               ->  Hash Join  (cost=30.85..3840.72 rows=5731 width=12)
(actual time=7.279..110.578 rows=5005 loops=1)
                     Hash Cond: ("outer".player_id = "inner".player_id)
                     ->  Nested Loop  (cost=0.00..3723.90 rows=5731
width=8) (actual time=0.106..69.528 rows=5005 loops=1)
                           ->  Index Scan using team_pool_id on team t
(cost=0.00..397.90 rows=391 width=4) (actual time=0.058..2.313 rows=313
loops=1)
                                 Index Cond: (pool_id = 21699)
                           ->  Index Scan using team_id_asdas on
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual
time=0.027..0.122 rows=16 loops=313)
                                 Index Cond: (tp.team_id = "outer".team_id)
                     ->  Hash  (cost=28.48..28.48 rows=948 width=4)
(actual time=7.121..7.121 rows=948 loops=1)
                           ->  Seq Scan on player pl  (cost=0.00..28.48
rows=948 width=4) (actual time=0.042..3.930 rows=948 loops=1)
               ->  Hash  (cost=130.14..130.14 rows=5414 width=8) (actual
time=43.070..43.070 rows=5414 loops=1)
                     ->  Seq Scan on player_updates pu
(cost=0.00..130.14 rows=5414 width=8) (actual time=0.013..22.410
rows=5414 loops=1)
 Total runtime: 1144.059 ms
(17 rows)

---
VS fast query

EXPLAIN ANALYSE
 SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = 21699
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id;


QUERY
PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=270.31..278.80 rows=566 width=12) (actual
time=409.428..410.551 rows=313 loops=1)
   ->  Hash Join  (cost=89.66..266.06 rows=566 width=12) (actual
time=120.341..278.310 rows=40928 loops=1)
         Hash Cond: ("outer".player_id = "inner".player_id)
         ->  Seq Scan on player_updates pu  (cost=0.00..130.14 rows=5414
width=8) (actual time=0.019..16.833 rows=5414 loops=1)
         ->  Hash  (cost=89.48..89.48 rows=73 width=12) (actual
time=120.209..120.209 rows=5005 loops=1)
               ->  Hash Join  (cost=53.16..89.48 rows=73 width=12)
(actual time=78.672..102.434 rows=5005 loops=1)
                     Hash Cond: ("outer".player_id = "inner".player_id)
                     ->  Seq Scan on player pl  (cost=0.00..28.48
rows=948 width=4) (actual time=0.040..3.092 rows=948 loops=1)
                     ->  Hash  (cost=52.97..52.97 rows=73 width=8)
(actual time=78.577..78.577 rows=5005 loops=1)
                           ->  Nested Loop  (cost=0.00..52.97 rows=73
width=8) (actual time=0.100..57.729 rows=5005 loops=1)
                                 ->  Index Scan using team_pool_id on
team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.050..1.591
rows=313 loops=1)
                                       Index Cond: ((pool_id = 21699)
AND (pool_id = 21699))
                                 ->  Index Scan using team_id_asdas on
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual
time=0.021..0.099 rows=16 loops=313)
                                       Index Cond: (tp.team_id =
"outer".team_id)
 Total runtime: 411.629 ms
(15 rows)

Time: 417.726 ms

I ran ANALYZE on each of the tables, just to be sure,  but there was no
changes.

Also, here's some table info:

team => 31720 rows,
team_players => 464896 rows,
player => 948 rows
player_updates => 5414 rows


thanks,

-- Andrew

>> The explain analyze:
>>                                 ->  Index Scan using team_pool_id on
>> team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556
>> rows=313 loops=1)
>>                                       Index Cond: ((pool_id = 21699)
>> AND (pool_id = 21699))
>
>
> Here it sees two conditions, and PG is expecting that will restrict
> things further. It's clearly not expecting the same condition twice,
> but is presumably treating it something like "a>-4 AND a<4". This
> actually makes the row estimate wrong.
>
> So - if PG is getting this bit wrong why is query 2 faster?
> Looking at the row estimates for case 1 we can see that the final
> cases are badly wrong.
>
> > GroupAggregate  (cost=8742.52..9853.85 rows=44453 width=13) (actual
> > time=1186.973..1432.548 rows=626 loops=1)
>
> Here it's got the number of rows badly wrong - if you actually ended
> up with 44,000 rows then maybe this would be the best plan. This is
> the last step though, so what happens before?
>
> >   ->  Sort  (cost=8742.52..8853.66 rows=44453 width=13) (actual
> > time=1186.237..1309.562 rows=40928 loops=1)
> >         Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)
>
> We're about right here. The problem must be in the last step, and
> looking at the row estimates it seems to think that no summarising is
> actually going to take place.
>
> I think the "position = 'G'" bit is confusing it. Well, we can test that:
>
> 1. Just group by "pl.position" - remove the test. Do the estimated and
> actual rows match now?
>
> 2. Rewrite the position='G' as a CASE structure
>
> 3. Do the position='G' part in a sub-query and join to that.
>



> The results of those should show whether this is actually the problem.
> --
>   Richard Huxton
>   Archonet Ltd
>
> .
>

--
_____________________________________________
G l o b a l   D i a g n o s t i c s    I n c.
Andrew Schmidt            t.416-304-0049 x206
aschmidt@lifescale.com         f.866-697-8726
_____________________________________________