Thread: equivalent queries lead to different query plans for self-joins with group by?

equivalent queries lead to different query plans for self-joins with group by?

From

Ben

Date:

11 November 2010, 20:53:11

dear pgers --

consider the following toy example (using pg 8.4.3) :

create temporary table foo (
  ts timestamp not null,
  id integer not null,
  val double precision not null,
  primary key (ts, id)
);

i might want to return the vals, minus the averages at each timestamp.  the obvious self-join results in a sequential
scanover foo -- we aggregate the average val for EVERY timestamp, then join against the timestamps we want. 

us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a
using(ts) where ts > '2010-11-11' and ts < '2010-11-13';
                                   
                                                                        QUERY PLAN

                                 

-----------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                      
 Hash Join  (cost=49.06..54.41 rows=8 width=28)

                                  
   Hash Cond: (pg_temp_2.foo.ts = pg_temp_2.foo.ts)

                                  
   ->  HashAggregate  (cost=34.45..36.95 rows=200 width=16)

                                  
         ->  Seq Scan on foo  (cost=0.00..26.30 rows=1630 width=16)

                                  
   ->  Hash  (cost=14.51..14.51 rows=8 width=20)

                                 
         ->  Bitmap Heap Scan on foo  (cost=4.33..14.51 rows=8 width=20)

                                 
               Recheck Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13
00:00:00'::timestampwithout time zone))
                                           
               ->  Bitmap Index Scan on foo_pkey  (cost=0.00..4.33 rows=8 width=0)

                                 
                     Index Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13
00:00:00'::timestampwithout time zone))
                                       

on the other hand, if i specify "which" timestamp i'm restricting, it appears to do the right thing:

us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a
using(ts) where a.ts > '2010-11-11' and a.ts < '2010-11-13';
                                   
                                                                        QUERY PLAN

                                 

-----------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                      
 Nested Loop  (cost=18.86..29.14 rows=8 width=28)

                                  
   ->  HashAggregate  (cost=14.55..14.56 rows=1 width=16)

                                  
         ->  Bitmap Heap Scan on foo  (cost=4.33..14.51 rows=8 width=16)

                                 
               Recheck Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13
00:00:00'::timestampwithout time zone))
                                           
               ->  Bitmap Index Scan on foo_pkey  (cost=0.00..4.33 rows=8 width=0)

                                 
                     Index Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13
00:00:00'::timestampwithout time zone))
                                       
   ->  Bitmap Heap Scan on foo  (cost=4.31..14.45 rows=8 width=20)

                                 
         Recheck Cond: (pg_temp_2.foo.ts = pg_temp_2.foo.ts)

                                 
         ->  Bitmap Index Scan on foo_pkey  (cost=0.00..4.31 rows=8 width=0)

                                 
               Index Cond: (pg_temp_2.foo.ts = pg_temp_2.foo.ts)

                                 

i find this behavior curious.  my understanding is that both queries are equivalent, and i would expect that the query
plannerwould be able to choose either of those plans.  this is important -- with the real data i'm working with, the
tableis very large, and the sequential scan is a killer.  

are these queries equivalent, or am i mistaken?  if the planner distinguishes between these plans, how do i ensure that
whereclause restrictions propagate (correctly) to subqueries? 

best regards, ben

Re: equivalent queries lead to different query plans for self-joins with group by?

From

Tom Lane

Date:

11 November 2010, 21:37:58

Ben <midfield@gmail.com> writes:
> us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a
using(ts) where ts > '2010-11-11' and ts < '2010-11-13';

> on the other hand, if i specify "which" timestamp i'm restricting, it appears to do the right thing:

> us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a
using(ts) where a.ts > '2010-11-11' and a.ts < '2010-11-13';

Well, arguably it's not doing the right thing either way --- you'd sort
of like the inequalities to get pushed down into both of the join
inputs, not just one of them.  PG doesn't make that deduction though;
it can make such inferences for equalities, but inequalities are not
optimized as much.

The case where you have "using (ts) where ts > ..." is, I believe,
interpreted as though you'd specified the left-hand join input, ie
"using (ts) where foo.ts > ...".  So the range condition is pushed
into the foo scan, which doesn't help the avg() subquery.  When you
specify the restriction against "a", it's pushed into the subquery
where it's more useful.

You could try "where foo.ts > ... and a.ts > ..." but not sure if it's
really worth the trouble here, at least not if those rowcount estimates
are anywhere near accurate.  If you were doing the join across quite a
lot of rows, having the constraints in both subqueries would be useful.

            regards, tom lane

Re: equivalent queries lead to different query plans for self-joins with group by?

From

Ben

Date:

11 November 2010, 22:56:44

appreciate the instant response.

> Well, arguably it's not doing the right thing either way --- you'd sort
> of like the inequalities to get pushed down into both of the join
> inputs, not just one of them.  PG doesn't make that deduction though;
> it can make such inferences for equalities, but inequalities are not
> optimized as much.

in my work i have replaced the query with a sql function + window :

create or replace function bar(timestamp, timestamp) returns setof foo
language 'sql' as $$
  select ts,
         id,
         val -
         (avg(val) over (partition by ts)) as val
  from foo
  where ts > $1
  and ts < $2
$$;

i was forced to use a sql function as opposed to a view because the query planner was unable to push down restrictions
onts inside the view subquery, which i've manually done in the function.  indeed, 

explain select ts, id, val - (avg(val) over (partition by ts)) as val from foo where ts > '2009-10-20' and ts <
'2009-10-21';

and

explain select * from (select ts, id, val - (avg(val) over (partition by ts)) as val from foo) as f where ts >
'2009-10-20'and ts < '2009-10-21'; 

give different answers, despite being equivalent, but i understand it is hard to push things into subqueries in
general. in this case it is only legal because we partition by ts. 

thanks again for the explanations!

best, ben