Two slightly different queries produce same results, one takes 1/4 the time.. bug in planner? - Mailing list pgsql-general

From Andrew Schmidt
Subject Two slightly different queries produce same results, one takes 1/4 the time.. bug in planner?
Date
Msg-id 4373C4F4.8060909@lifescale.com
Whole thread Raw
Responses Re: Two slightly different queries produce same results,  (Richard Huxton <dev@archonet.com>)
List pgsql-general
Summary:

Two queries that are nearly identical AND use the exact same indices
take much different times in execution.   Both queries produce the exact
same results.  One takes 1072ms to execute, the other 262ms.  Although I
have a fix, it seems more of a hack, and I'd also like to know why it's
doing it.

Details:

I run a hockey pool website that uses mysql 4x with innodb.  This week
I've been trying out postgresql with the site to check out the
performance difference.   Standard select statements seem to be about
twice as fast with postgres as opposed to innodb.   I really wasn't
expecting this, but was impressed none-the-less.  I run on FreeBSD 5.4.

However, I've run into a problem where one query took about twice as
long as innodb.  Some investigation and playing around with the query, I
ended up with a slightly different query but was about 3 times as fast
as innodb (and 5 times faster than the original query).  I didn't add
any indices and the EXPLAIN ANALYZE doesn't show any new indices being
used with the new query.  The order of which tables are read first was
changed in the query planner and also it uses HashAggregate vs
GroupAggregate.. not entirely sure what that means.

Table structure:

team - list of teams for each pool,
team_players - list of player_id's for each team
player - list of nhl players
player_updates - for each nhl player the stats on each day they played
(ie, crosby on oct 10, got 1 goal and 2 assists)

There are other tables but these are the ones pertinent to my problem.

So a query to find the total goals per team for an entire pool is fairly
straightforward.  Due to the fact different points are awarded if the
nhl player is a goalie, I group by team_id / is_goalie.


SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS
total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id,is_goalie;

Explain analyze of this query:


QUERY
PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=8742.52..9853.85 rows=44453 width=13) (actual
time=1186.973..1432.548 rows=626 loops=1)
   ->  Sort  (cost=8742.52..8853.66 rows=44453 width=13) (actual
time=1186.237..1309.562 rows=40928 loops=1)
         Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)
         ->  Hash Join  (cost=174.52..4812.28 rows=44453 width=13)
(actual time=51.099..400.119 rows=40928 loops=1)
               Hash Cond: ("outer".player_id = "inner".player_id)
               ->  Hash Join  (cost=30.85..3840.72 rows=5731 width=17)
(actual time=8.303..115.834 rows=5005 loops=1)
                     Hash Cond: ("outer".player_id = "inner".player_id)
                     ->  Nested Loop  (cost=0.00..3723.90 rows=5731
width=8) (actual time=0.108..71.378 rows=5005 loops=1)
                           ->  Index Scan using team_pool_id on team t
(cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313
loops=1)
                                 Index Cond: (pool_id = 21699)
                           ->  Index Scan using team_id_asdas on
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual
time=0.029..0.132 rows=16 loops=313)
                                 Index Cond: (tp.team_id = "outer".team_id)
                     ->  Hash  (cost=28.48..28.48 rows=948 width=9)
(actual time=8.140..8.140 rows=948 loops=1)
                           ->  Seq Scan on player pl  (cost=0.00..28.48
rows=948 width=9) (actual time=0.043..4.470 rows=948 loops=1)
               ->  Hash  (cost=130.14..130.14 rows=5414 width=8) (actual
time=42.690..42.690 rows=5414 loops=1)
                     ->  Seq Scan on player_updates pu
(cost=0.00..130.14 rows=5414 width=8) (actual time=0.013..21.410
rows=5414 loops=1)
 Total runtime: 1436.617 ms
(17 rows)

Time: 1443.127 ms



I tried out enforcing the planner to use the indices on player_updates
and player, but the query took longer.

Now the slight modification where I put the pool_id in the JOIN of team
and team_players (but also keep the check in the where clause):

SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id =
21699  <--- ** added pool_id
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id,is_goalie;

The explain analyze:


QUERY
PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=271.72..280.21 rows=566 width=13) (actual
time=478.783..481.127 rows=626 loops=1)
   ->  Hash Join  (cost=89.66..267.48 rows=566 width=13) (actual
time=124.015..328.387 rows=40928 loops=1)
         Hash Cond: ("outer".player_id = "inner".player_id)
         ->  Seq Scan on player_updates pu  (cost=0.00..130.14 rows=5414
width=8) (actual time=0.021..16.934 rows=5414 loops=1)
         ->  Hash  (cost=89.48..89.48 rows=73 width=17) (actual
time=123.872..123.872 rows=5005 loops=1)
               ->  Hash Join  (cost=53.16..89.48 rows=73 width=17)
(actual time=79.843..104.031 rows=5005 loops=1)
                     Hash Cond: ("outer".player_id = "inner".player_id)
                     ->  Seq Scan on player pl  (cost=0.00..28.48
rows=948 width=9) (actual time=0.040..3.103 rows=948 loops=1)
                     ->  Hash  (cost=52.97..52.97 rows=73 width=8)
(actual time=79.740..79.740 rows=5005 loops=1)
                           ->  Nested Loop  (cost=0.00..52.97 rows=73
width=8) (actual time=0.098..58.145 rows=5005 loops=1)
                                 ->  Index Scan using team_pool_id on
team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556
rows=313 loops=1)
                                       Index Cond: ((pool_id = 21699)
AND (pool_id = 21699))
                                 ->  Index Scan using team_id_asdas on
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual
time=0.021..0.099 rows=16 loops=313)
                                       Index Cond: (tp.team_id =
"outer".team_id)
 Total runtime: 482.791 ms
(15 rows)

Time: 489.270 ms


So there are no new indices being used, but the order in which the
tables are read makes it much faster.  Also if the pool_id check is
removed from the WHERE clause (so it'd only be in the join clause),  it
goes back to the slower query planner.
Originally I thought the slowdown was because it wasn't using the
player_id index on player and player_updates, but that doesn't seem to
be the case.

Both queries return the exact same result (626 rows, same stats)

Conclusion:

Is this a bug or my misunderstanding?   If it's a bug I can submit a
more detailed step by step email.  If it's just my misunderstanding,
then can someone explain why?  I've always been under the impression the
join should only contain the fields pertinent to the join.  Also does it
need the pool_id again in the where clause when it's in the join clause.

thanks a bunch,

-- Andrew

I apologize if the formatting gets messed up.



pgsql-general by date:

Previous
From: "Uwe C. Schroeder"
Date:
Subject: Re: Where
Next
From: Bob Pawley
Date:
Subject: Re: Where