Re: Relids in upper relations - Mailing list pgsql-hackers

From David Rowley
Subject Re: Relids in upper relations
Date
Msg-id CAKJS1f_=1jBrc+DzAomNdDvwvv+SOHBqee4xB6m4OugSrHtDQA@mail.gmail.com
Whole thread Raw
In response to Re: Relids in upper relations  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 10 October 2016 at 11:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount)
>> FROM order_line, order WHERE order_line.order_id = order.order_id
>> GROUP BY 1,2;
>
>> Doing the aggregation step first is likely to be much faster than
>> doing the join first here,
>
> Please provide some reason to believe that.  It's the nature of an
> aggregate that it's sensitive to the number of rows going through it,
> with only a tiny number of exceptions (and SUM ain't one).  So you could
> only push it down past joins that won't change the number of rows the
> aggregate will process, and how is that going to make it any faster?

I think there's a flaw in Robert's example because the GROUP BY has
columns from either side of the joins, however it's very simple to
come up with a real world case which aggregate push downs can improve.
We can simulate the aggregate push down with a simple subquery.

create table sale (sale_id int primary key,product_id int not
null,quantity int not null);
create table product (product_id int primary key,productdesc
varchar(32) not null);

insert into sale select x,x%100+1,(random()*10)::int+1 from
generate_series(1,10000000) x(x);
insert into product select x,'ABC'||x::text from generate_Series(1,100) x(x);

postgres=# explain (analyze, timing off) select
p.product_id,p.productdesc,sum(s.quantity) qty from sale s inner join
product p on s.product_id = p.product_id group by p.product_id;                                                QUERY
PLAN

-------------------------------------------------------------------------------------------------------------HashAggregate
(cost=341558.25..341559.25 rows=100 width=17) (actual
 
rows=100 loops=1)  Group Key: p.product_id  ->  Hash Join  (cost=3.25..291558.25 rows=10000000 width=13)
(actual rows=10000000 loops=1)        Hash Cond: (s.product_id = p.product_id)        ->  Seq Scan on sale s
(cost=0.00..154055.00rows=10000000
 
width=8) (actual rows=10000000 loops=1)        ->  Hash  (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 13kB              ->  Seq Scan on product p  (cost=0.00..2.00
rows=100
width=9) (actual rows=100 loops=1)Planning time: 0.308 msExecution time: 7568.927 ms
(10 rows)

Time: 7569.842 ms (00:07.570)

postgres=# explain (analyze, timing off) select
p.product_id,p.productdesc,s.qty from (select product_id,sum(quantity)
qty from sale group by product_id) s inner join product p on
s.product_id = p.product_id;                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------Hash Join
(cost=204058.25..204061.63rows=100 width=17) (actual
 
rows=100 loops=1)  Hash Cond: (sale.product_id = p.product_id)  ->  HashAggregate  (cost=204055.00..204056.00 rows=100
width=12)
(actual rows=100 loops=1)        Group Key: sale.product_id        ->  Seq Scan on sale  (cost=0.00..154055.00
rows=10000000
width=8) (actual rows=10000000 loops=1)  ->  Hash  (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1)
Buckets:1024  Batches: 1  Memory Usage: 13kB        ->  Seq Scan on product p  (cost=0.00..2.00 rows=100 width=9)
 
(actual rows=100 loops=1)Planning time: 0.610 msExecution time: 4267.145 ms
(10 rows)

So the pushed down version runs in 56% of the time of the non-pushed
down version. Of course, as mentioned by Robert, both versions would
have to be costed in case the join condition was highly selective

There's a good paper which goes into a bit more details on this
http://www.vldb.org/conf/1995/P345.PDF

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Relids in upper relations
Next
From: Andres Freund
Date:
Subject: Re: Macro customizable hashtable / bitmapscan & aggregation perf