Re: Two slightly different queries produce same results, - Mailing list pgsql-general

From Andrew Schmidt
Subject Re: Two slightly different queries produce same results,
Date
Msg-id 4374B54A.4090407@lifescale.com
Whole thread Raw
In response to Re: Two slightly different queries produce same results,  (Richard Huxton <dev@archonet.com>)
List pgsql-general
Hi Richard,

While in the process of responding to your email I found out what was
doing it.   It was the grouping of the team_players team_id instead of
the team team_id.

Here's some table numbers:

team => 31720 rows,
team_players => 464896 rows,
player => 948 rows
player_updates => 5414 rows

The query:
 SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id;


It's grouping by the team_players tp.team_id,    change the group to
t.team_id and bam! 5x faster.

thanks for the help,

-- Andrew


*******
This was my response to you Richard before I found out what it was.
*******
I removed position from the query, but still got the same query times.
I summed up the pl.player_id just to make sure it was reading the player
table.


EXPLAIN ANALYSE
 SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id;

QUERY
PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=8584.45..9364.83 rows=22390 width=12) (actual
time=894.364..1140.884 rows=313 loops=1)
   ->  Sort  (cost=8584.45..8695.59 rows=44453 width=12) (actual
time=893.373..1013.989 rows=40928 loops=1)
         Sort Key: tp.team_id
         ->  Hash Join  (cost=174.52..4701.14 rows=44453 width=12)
(actual time=50.433..328.788 rows=40928 loops=1)
               Hash Cond: ("outer".player_id = "inner".player_id)
               ->  Hash Join  (cost=30.85..3840.72 rows=5731 width=12)
(actual time=7.279..110.578 rows=5005 loops=1)
                     Hash Cond: ("outer".player_id = "inner".player_id)
                     ->  Nested Loop  (cost=0.00..3723.90 rows=5731
width=8) (actual time=0.106..69.528 rows=5005 loops=1)
                           ->  Index Scan using team_pool_id on team t
(cost=0.00..397.90 rows=391 width=4) (actual time=0.058..2.313 rows=313
loops=1)
                                 Index Cond: (pool_id = 21699)
                           ->  Index Scan using team_id_asdas on
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual
time=0.027..0.122 rows=16 loops=313)
                                 Index Cond: (tp.team_id = "outer".team_id)
                     ->  Hash  (cost=28.48..28.48 rows=948 width=4)
(actual time=7.121..7.121 rows=948 loops=1)
                           ->  Seq Scan on player pl  (cost=0.00..28.48
rows=948 width=4) (actual time=0.042..3.930 rows=948 loops=1)
               ->  Hash  (cost=130.14..130.14 rows=5414 width=8) (actual
time=43.070..43.070 rows=5414 loops=1)
                     ->  Seq Scan on player_updates pu
(cost=0.00..130.14 rows=5414 width=8) (actual time=0.013..22.410
rows=5414 loops=1)
 Total runtime: 1144.059 ms
(17 rows)

---
VS fast query

EXPLAIN ANALYSE
 SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
 FROM
 team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = 21699
 JOIN player_updates pu ON pu.player_id = tp.player_id
 JOIN player pl ON pl.player_id = pu.player_id
 WHERE t.pool_id = 21699
 GROUP BY tp.team_id;


QUERY
PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=270.31..278.80 rows=566 width=12) (actual
time=409.428..410.551 rows=313 loops=1)
   ->  Hash Join  (cost=89.66..266.06 rows=566 width=12) (actual
time=120.341..278.310 rows=40928 loops=1)
         Hash Cond: ("outer".player_id = "inner".player_id)
         ->  Seq Scan on player_updates pu  (cost=0.00..130.14 rows=5414
width=8) (actual time=0.019..16.833 rows=5414 loops=1)
         ->  Hash  (cost=89.48..89.48 rows=73 width=12) (actual
time=120.209..120.209 rows=5005 loops=1)
               ->  Hash Join  (cost=53.16..89.48 rows=73 width=12)
(actual time=78.672..102.434 rows=5005 loops=1)
                     Hash Cond: ("outer".player_id = "inner".player_id)
                     ->  Seq Scan on player pl  (cost=0.00..28.48
rows=948 width=4) (actual time=0.040..3.092 rows=948 loops=1)
                     ->  Hash  (cost=52.97..52.97 rows=73 width=8)
(actual time=78.577..78.577 rows=5005 loops=1)
                           ->  Nested Loop  (cost=0.00..52.97 rows=73
width=8) (actual time=0.100..57.729 rows=5005 loops=1)
                                 ->  Index Scan using team_pool_id on
team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.050..1.591
rows=313 loops=1)
                                       Index Cond: ((pool_id = 21699)
AND (pool_id = 21699))
                                 ->  Index Scan using team_id_asdas on
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual
time=0.021..0.099 rows=16 loops=313)
                                       Index Cond: (tp.team_id =
"outer".team_id)
 Total runtime: 411.629 ms
(15 rows)

Time: 417.726 ms

I ran ANALYZE on each of the tables, just to be sure,  but there was no
changes.

Also, here's some table info:

team => 31720 rows,
team_players => 464896 rows,
player => 948 rows
player_updates => 5414 rows


thanks,

-- Andrew

>> The explain analyze:
>>                                 ->  Index Scan using team_pool_id on
>> team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556
>> rows=313 loops=1)
>>                                       Index Cond: ((pool_id = 21699)
>> AND (pool_id = 21699))
>
>
> Here it sees two conditions, and PG is expecting that will restrict
> things further. It's clearly not expecting the same condition twice,
> but is presumably treating it something like "a>-4 AND a<4". This
> actually makes the row estimate wrong.
>
> So - if PG is getting this bit wrong why is query 2 faster?
> Looking at the row estimates for case 1 we can see that the final
> cases are badly wrong.
>
> > GroupAggregate  (cost=8742.52..9853.85 rows=44453 width=13) (actual
> > time=1186.973..1432.548 rows=626 loops=1)
>
> Here it's got the number of rows badly wrong - if you actually ended
> up with 44,000 rows then maybe this would be the best plan. This is
> the last step though, so what happens before?
>
> >   ->  Sort  (cost=8742.52..8853.66 rows=44453 width=13) (actual
> > time=1186.237..1309.562 rows=40928 loops=1)
> >         Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)
>
> We're about right here. The problem must be in the last step, and
> looking at the row estimates it seems to think that no summarising is
> actually going to take place.
>
> I think the "position = 'G'" bit is confusing it. Well, we can test that:
>
> 1. Just group by "pl.position" - remove the test. Do the estimated and
> actual rows match now?
>
> 2. Rewrite the position='G' as a CASE structure
>
> 3. Do the position='G' part in a sub-query and join to that.
>



> The results of those should show whether this is actually the problem.
> --
>   Richard Huxton
>   Archonet Ltd
>
> .
>

--
_____________________________________________
G l o b a l   D i a g n o s t i c s    I n c.
Andrew Schmidt            t.416-304-0049 x206
aschmidt@lifescale.com         f.866-697-8726
_____________________________________________


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: return next
Next
From: Jaime Casanova
Date:
Subject: Re: Best way to use indexes for partial match at