Thread: equivalent queries lead to different query plans for self-joins with group by?
dear pgers -- consider the following toy example (using pg 8.4.3) : create temporary table foo ( ts timestamp not null, id integer not null, val double precision not null, primary key (ts, id) ); i might want to return the vals, minus the averages at each timestamp. the obvious self-join results in a sequential scanover foo -- we aggregate the average val for EVERY timestamp, then join against the timestamps we want. us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a using(ts) where ts > '2010-11-11' and ts < '2010-11-13'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=49.06..54.41 rows=8 width=28) Hash Cond: (pg_temp_2.foo.ts = pg_temp_2.foo.ts) -> HashAggregate (cost=34.45..36.95 rows=200 width=16) -> Seq Scan on foo (cost=0.00..26.30 rows=1630 width=16) -> Hash (cost=14.51..14.51 rows=8 width=20) -> Bitmap Heap Scan on foo (cost=4.33..14.51 rows=8 width=20) Recheck Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13 00:00:00'::timestampwithout time zone)) -> Bitmap Index Scan on foo_pkey (cost=0.00..4.33 rows=8 width=0) Index Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13 00:00:00'::timestampwithout time zone)) on the other hand, if i specify "which" timestamp i'm restricting, it appears to do the right thing: us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a using(ts) where a.ts > '2010-11-11' and a.ts < '2010-11-13'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=18.86..29.14 rows=8 width=28) -> HashAggregate (cost=14.55..14.56 rows=1 width=16) -> Bitmap Heap Scan on foo (cost=4.33..14.51 rows=8 width=16) Recheck Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13 00:00:00'::timestampwithout time zone)) -> Bitmap Index Scan on foo_pkey (cost=0.00..4.33 rows=8 width=0) Index Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13 00:00:00'::timestampwithout time zone)) -> Bitmap Heap Scan on foo (cost=4.31..14.45 rows=8 width=20) Recheck Cond: (pg_temp_2.foo.ts = pg_temp_2.foo.ts) -> Bitmap Index Scan on foo_pkey (cost=0.00..4.31 rows=8 width=0) Index Cond: (pg_temp_2.foo.ts = pg_temp_2.foo.ts) i find this behavior curious. my understanding is that both queries are equivalent, and i would expect that the query plannerwould be able to choose either of those plans. this is important -- with the real data i'm working with, the tableis very large, and the sequential scan is a killer. are these queries equivalent, or am i mistaken? if the planner distinguishes between these plans, how do i ensure that whereclause restrictions propagate (correctly) to subqueries? best regards, ben
Re: equivalent queries lead to different query plans for self-joins with group by?
From
Tom Lane
Date:
Ben <midfield@gmail.com> writes: > us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a using(ts) where ts > '2010-11-11' and ts < '2010-11-13'; > on the other hand, if i specify "which" timestamp i'm restricting, it appears to do the right thing: > us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a using(ts) where a.ts > '2010-11-11' and a.ts < '2010-11-13'; Well, arguably it's not doing the right thing either way --- you'd sort of like the inequalities to get pushed down into both of the join inputs, not just one of them. PG doesn't make that deduction though; it can make such inferences for equalities, but inequalities are not optimized as much. The case where you have "using (ts) where ts > ..." is, I believe, interpreted as though you'd specified the left-hand join input, ie "using (ts) where foo.ts > ...". So the range condition is pushed into the foo scan, which doesn't help the avg() subquery. When you specify the restriction against "a", it's pushed into the subquery where it's more useful. You could try "where foo.ts > ... and a.ts > ..." but not sure if it's really worth the trouble here, at least not if those rowcount estimates are anywhere near accurate. If you were doing the join across quite a lot of rows, having the constraints in both subqueries would be useful. regards, tom lane
appreciate the instant response. > Well, arguably it's not doing the right thing either way --- you'd sort > of like the inequalities to get pushed down into both of the join > inputs, not just one of them. PG doesn't make that deduction though; > it can make such inferences for equalities, but inequalities are not > optimized as much. in my work i have replaced the query with a sql function + window : create or replace function bar(timestamp, timestamp) returns setof foo language 'sql' as $$ select ts, id, val - (avg(val) over (partition by ts)) as val from foo where ts > $1 and ts < $2 $$; i was forced to use a sql function as opposed to a view because the query planner was unable to push down restrictions onts inside the view subquery, which i've manually done in the function. indeed, explain select ts, id, val - (avg(val) over (partition by ts)) as val from foo where ts > '2009-10-20' and ts < '2009-10-21'; and explain select * from (select ts, id, val - (avg(val) over (partition by ts)) as val from foo) as f where ts > '2009-10-20'and ts < '2009-10-21'; give different answers, despite being equivalent, but i understand it is hard to push things into subqueries in general. in this case it is only legal because we partition by ts. thanks again for the explanations! best, ben