Re: Two slightly different queries produce same results, - Mailing list pgsql-general

From Richard Huxton
Subject Re: Two slightly different queries produce same results,
Date
Msg-id 43745EE1.1010208@archonet.com
Whole thread Raw
In response to Two slightly different queries produce same results, one takes 1/4 the time.. bug in planner?  (Andrew Schmidt <andrew@lifescale.com>)
Responses Re: Two slightly different queries produce same results,  (Andrew Schmidt <andrew@lifescale.com>)
List pgsql-general
Andrew Schmidt wrote:
> However, I've run into a problem where one query took about twice as
> long as innodb.  Some investigation and playing around with the query, I
> ended up with a slightly different query but was about 3 times as fast
> as innodb (and 5 times faster than the original query).  I didn't add
> any indices and the EXPLAIN ANALYZE doesn't show any new indices being
> used with the new query.  The order of which tables are read first was
> changed in the query planner and also it uses HashAggregate vs
> GroupAggregate.. not entirely sure what that means.

It's a different method of aggregating data (calculating your sum()).
The HashAggregate puts values into buckets and works best when you have
a fairly small number of different values (so each value can get its own
bucket). I think - bear in mind I'm just another user, so I'm not
familiar with the internals.

The different plans are because PG thinks it will get different numbers
of rows...

> SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS
> total_goals
> FROM
> team t JOIN team_players tp ON tp.team_id = t.team_id
> JOIN player_updates pu ON pu.player_id = tp.player_id
> JOIN player pl ON pl.player_id = pu.player_id
> WHERE t.pool_id = 21699
> GROUP BY tp.team_id,is_goalie;

> Explain analyze of this query:

>                           ->  Index Scan using team_pool_id on team t
> (cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313
> loops=1)
>                                 Index Cond: (pool_id = 21699)

With this index scan you have one condition and PG expects 391 matching
rows (it gets 313, so that's a pretty good estimate).

> SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals
> FROM
> team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id =
> 21699  <--- ** added pool_id
> JOIN player_updates pu ON pu.player_id = tp.player_id
> JOIN player pl ON pl.player_id = pu.player_id
> WHERE t.pool_id = 21699
> GROUP BY tp.team_id,is_goalie;
>
> The explain analyze:
>                                 ->  Index Scan using team_pool_id on
> team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556
> rows=313 loops=1)
>                                       Index Cond: ((pool_id = 21699) AND
> (pool_id = 21699))

Here it sees two conditions, and PG is expecting that will restrict
things further. It's clearly not expecting the same condition twice, but
is presumably treating it something like "a>-4 AND a<4". This actually
makes the row estimate wrong.

So - if PG is getting this bit wrong why is query 2 faster?
Looking at the row estimates for case 1 we can see that the final cases
are badly wrong.

 > GroupAggregate  (cost=8742.52..9853.85 rows=44453 width=13) (actual
 > time=1186.973..1432.548 rows=626 loops=1)

Here it's got the number of rows badly wrong - if you actually ended up
with 44,000 rows then maybe this would be the best plan. This is the
last step though, so what happens before?

 >   ->  Sort  (cost=8742.52..8853.66 rows=44453 width=13) (actual
 > time=1186.237..1309.562 rows=40928 loops=1)
 >         Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)

We're about right here. The problem must be in the last step, and
looking at the row estimates it seems to think that no summarising is
actually going to take place.

I think the "position = 'G'" bit is confusing it. Well, we can test that:

1. Just group by "pl.position" - remove the test. Do the estimated and
actual rows match now?

2. Rewrite the position='G' as a CASE structure

3. Do the position='G' part in a sub-query and join to that.

The results of those should show whether this is actually the problem.
--
   Richard Huxton
   Archonet Ltd

pgsql-general by date:

Previous
From: Guido Neitzer
Date:
Subject: Re: Ordering and unicode
Next
From: "Andrus"
Date:
Subject: Re: Best way to use indexes for partial match at