GROUP BY before JOIN - Mailing list pgsql-hackers

From David Rowley
Subject GROUP BY before JOIN
Date
Msg-id CAKJS1f-sEcm=gTfS-DqjsOcsZ-vLhrP_hSRNtJjq-S7Egn8Rqw@mail.gmail.com
Whole thread Raw
Responses Re: GROUP BY before JOIN  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
List pgsql-hackers
== Overview ==

As of today we always perform GROUP BY at the final stage, after each relation in the query has been joined. This of course works, but it's not always the most optimal way of executing the query.

Consider the following two relations:

create table product (product_id int primary key, code varchar(32) not null);
create table sale (sale_id int primary key, product_id int not null, qty int not null);
create index on sale (product_id);

Populated with:

insert into product select x.x,'ABC' || x.x from generate_series(1,100) x(x);
insert into sale select x.x,x.x%100+1, 10 from generate_series(1,1000000) x(x);

Here we have a table with 100 products and another with 1 million sales records.

If we wanted to see the total sales for each product we may write a query such as:

select s.product_id,sum(qty) as qty from sale s inner join product p on s.product_id = p.product_id group by s.product_id;

If we look at the explain for this query we can see that the grouping happened after the join took place:

                    QUERY PLAN
--------------------------------------------------
 HashAggregate
   Group Key: s.product_id
   ->  Hash Join
         Hash Cond: (s.product_id = p.product_id)
         ->  Seq Scan on sale s
         ->  Hash
               ->  Seq Scan on product p

The join here would product exactly 1 million rows, then hashaggregate will group 1 million rows.

If it were possible for PostgreSQL to perform the GROUP BY before the join we could change that to Grouping 1 million rows, then joining 100 rows.

Of course we could have written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale group by product_id) s inner join product p on s.product_id = p.product_id;

Which would do exactly that, and in this particular case it is much faster, but it's not all rainbows and unicorns, as if the join happened to filter out many of the groups, then it might not be a win at all.

Consider:

select s.product_id from sale s inner join product p on s.product_id = p.product_id where p.code = 'ABC1' group by s.product_id;

Since the product.code has no equivalence member in the sale table the predicate cannot be applied to the sale table, so in this case if we'd written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale group by product_id) s inner join product p on s.product_id = p.product_id where p.code = 'ABC1';

We'd have thrown away 99% of the groups created in the subquery.

== Proposal ==

I've been working on allowing the planner to properly cost which option is better and choose to perform the GROUP BY at estimated most efficient join level.

So far I've not gotten as far as supporting queries which contain aggregate functions, as I need the ability to combine aggregate states (Simon and I proposed a patch for that here https://commitfest.postgresql.org/5/131/).

Performance of the above test case is looking quite good so far:

select s.product_id from sale s inner join product p on s.product_id = p.product_id group by s.product_id;

-- Master = 312.294 ms
-- Patched = 95.328 ms

So a 327% performance increase in this particular case.

== Implementation ==

I've had to make some choices about how exactly to implement this feature. As I explain above, we can't just always do the grouping at the first possible opportunity due to the possibility of joins eliminating groups and needless groups being created. 

To support this in the planner I've invented a GroupingPath type and I've also added 'cheapest_sorted_group_path' and 'cheapest_group_path' to RelOptInfo. This part I wasn't too sure about and I wondered if these GroupingPaths should just be added to the normal list with add_path(), but since these paths perform more work than normal paths that would give them an unfair disadvantage and they could be thrown out. These paths are only possibly cheaper after subsequent JOIN has taken place.

There's also an issue with row estimates being wrong on joinrels, and the row estimate is using the base rel's row count rather than the number of groups. As yet I'm unsure the best way to fix this.

I wanted to post this early so as to register the fact that I'm working on this, but I'm also posting in hope to get some early feedback on what I have so far.

Of course there's lots more work to do here, aggregates need to be supported, functional dependencies and transitive closures will also need to be detected in a more complete implementation.

Here's what I have so far (attached)

Comments are most welcome. Thanks.

Regards

David Rowley  

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Next
From: Amit Langote
Date:
Subject: Re: ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types