Re: Relids in upper relations - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Relids in upper relations |
Date | |
Msg-id | CAKJS1f_=1jBrc+DzAomNdDvwvv+SOHBqee4xB6m4OugSrHtDQA@mail.gmail.com Whole thread Raw |
In response to | Re: Relids in upper relations (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On 10 October 2016 at 11:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount) >> FROM order_line, order WHERE order_line.order_id = order.order_id >> GROUP BY 1,2; > >> Doing the aggregation step first is likely to be much faster than >> doing the join first here, > > Please provide some reason to believe that. It's the nature of an > aggregate that it's sensitive to the number of rows going through it, > with only a tiny number of exceptions (and SUM ain't one). So you could > only push it down past joins that won't change the number of rows the > aggregate will process, and how is that going to make it any faster? I think there's a flaw in Robert's example because the GROUP BY has columns from either side of the joins, however it's very simple to come up with a real world case which aggregate push downs can improve. We can simulate the aggregate push down with a simple subquery. create table sale (sale_id int primary key,product_id int not null,quantity int not null); create table product (product_id int primary key,productdesc varchar(32) not null); insert into sale select x,x%100+1,(random()*10)::int+1 from generate_series(1,10000000) x(x); insert into product select x,'ABC'||x::text from generate_Series(1,100) x(x); postgres=# explain (analyze, timing off) select p.product_id,p.productdesc,sum(s.quantity) qty from sale s inner join product p on s.product_id = p.product_id group by p.product_id; QUERY PLAN -------------------------------------------------------------------------------------------------------------HashAggregate (cost=341558.25..341559.25 rows=100 width=17) (actual rows=100 loops=1) Group Key: p.product_id -> Hash Join (cost=3.25..291558.25 rows=10000000 width=13) (actual rows=10000000 loops=1) Hash Cond: (s.product_id = p.product_id) -> Seq Scan on sale s (cost=0.00..154055.00rows=10000000 width=8) (actual rows=10000000 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 13kB -> Seq Scan on product p (cost=0.00..2.00 rows=100 width=9) (actual rows=100 loops=1)Planning time: 0.308 msExecution time: 7568.927 ms (10 rows) Time: 7569.842 ms (00:07.570) postgres=# explain (analyze, timing off) select p.product_id,p.productdesc,s.qty from (select product_id,sum(quantity) qty from sale group by product_id) s inner join product p on s.product_id = p.product_id; QUERY PLAN -----------------------------------------------------------------------------------------------------------Hash Join (cost=204058.25..204061.63rows=100 width=17) (actual rows=100 loops=1) Hash Cond: (sale.product_id = p.product_id) -> HashAggregate (cost=204055.00..204056.00 rows=100 width=12) (actual rows=100 loops=1) Group Key: sale.product_id -> Seq Scan on sale (cost=0.00..154055.00 rows=10000000 width=8) (actual rows=10000000 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1) Buckets:1024 Batches: 1 Memory Usage: 13kB -> Seq Scan on product p (cost=0.00..2.00 rows=100 width=9) (actual rows=100 loops=1)Planning time: 0.610 msExecution time: 4267.145 ms (10 rows) So the pushed down version runs in 56% of the time of the non-pushed down version. Of course, as mentioned by Robert, both versions would have to be costed in case the join condition was highly selective There's a good paper which goes into a bit more details on this http://www.vldb.org/conf/1995/P345.PDF -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: