Thread: Functional dependency in GROUP BY through JOINs

Functional dependency in GROUP BY through JOINs

From
"David Rowley"
Date:
<div class="WordSection1"><p class="MsoNormal">Some idle thoughts has provoked me to sending this email. I’m wondering
whatI’m seeing here, is a simple missed optimisation or something that would be very difficult to implement, or even
perhapssomething I’ve completely misunderstood.<p class="MsoNormal"> <p class="MsoNormal">To set the scene let’s go
withthe old products and sales example. In this example each product has been given a unique ID maybe due to perhaps
product_codesbeing a bit long and to keep the sales table as narrow as possible.<p class="MsoNormal"> <p
class="MsoNormal">Ifwe wanted to see the sales per product we could write something like this:<p class="MsoNormal"> <p
class="MsoNormal">SELECTp.product_code,SUM(s.quantity)<p class="MsoNormal">FROM products p<p class="MsoNormal">INNER
JOINbigsalestable s ON p.productid = s.productid<p class="MsoNormal">GROUP BY p.product_code;<p class="MsoNormal"> <p
class="MsoNormal">Thoughthis plan might not be quite as optimal as it could be as it performs the grouping after the
join.<pclass="MsoNormal"> <p class="MsoNormal">                                     QUERY PLAN<p
class="MsoNormal">-------------------------------------------------------------------------------------<p
class="MsoNormal">HashAggregate (cost=33195.13..33199.63 rows=450 width=150)<p class="MsoNormal">   ->  Hash Join 
(cost=20.13..28195.13rows=1000000 width=150)<p class="MsoNormal">         Hash Cond: (s.productid = p.productid)<p
class="MsoNormal">        ->  Seq Scan on bigsalestable s  (cost=0.00..14425.00 rows=1000000 width=8)<p
class="MsoNormal">        ->  Hash  (cost=14.50..14.50 rows=450 width=150)<p class="MsoNormal">               -> 
SeqScan on products p  (cost=0.00..14.50 rows=450 width=150)<p class="MsoNormal">(6 rows)<p class="MsoNormal"> <p
class="MsoNormal">Ofcourse the query could have been written in the first place as:<p class="MsoNormal"> <p
class="MsoNormal">SELECTp.product_code,s.quantity<p class="MsoNormal">FROM products AS p<p class="MsoNormal">INNER JOIN
(SELECTproductid,SUM(quantity) AS quantity FROM bigsalestable GROUP BY productid) AS s ON p.productid = s.productid;<p
class="MsoNormal"> <pclass="MsoNormal">And that would have given us a more efficient plan.<p class="MsoNormal"> <p
class="MsoNormal">Ofcourse, for these actual plans to be equivalent there would naturally have to be a unique index on
product_codein the products table.<p class="MsoNormal">I think I’m right in thinking that if a unique index exists to
matchthe group by clause, and the join condition is equality (probably using the same operator class as the unique
btreeindex?), then the grouping could be pushed up to before the join.<p class="MsoNormal"> <p class="MsoNormal">In the
attachedscript the hand optimised query runs about twice as fast with 1 million record sales table.<p
class="MsoNormal"> <pclass="MsoNormal">The existence of join removal makes me think we don’t want to always allow it up
towho or what is writing the query to allow the best choice in plan.<p class="MsoNormal"> <p class="MsoNormal">I’m not
reallyskilled enough in the arts of postgresql’s planner to look into how hard it would be to implement this, but I
thoughtI’d throw it out there to collect some comments on the idea.<p class="MsoNormal"> <p class="MsoNormal">Regards<p
class="MsoNormal"> <pclass="MsoNormal">David Rowley<p class="MsoNormal"> </div> 

Re: Functional dependency in GROUP BY through JOINs

From
"Kevin Grittner"
Date:
David Rowley wrote:

> If we wanted to see the sales per product we could write
> something like this:

> SELECT p.product_code,SUM(s.quantity)
> FROM products p
> INNER JOIN bigsalestable s ON p.productid = s.productid
> GROUP BY p.product_code;

> Though this plan might not be quite as optimal as it could be as
> it performs the grouping after the join.

> Of course the query could have been written in the first place
> as:

> SELECT p.product_code,s.quantity
> FROM products AS p
> INNER JOIN (SELECT productid,SUM(quantity) AS quantity
>             FROM bigsalestable GROUP BY productid) AS s
>   ON p.productid = s.productid;

> And that would have given us a more efficient plan.

> Of course, for these actual plans to be equivalent there would
> naturally have to be a unique index on product_code in the
> products table.
> 
> I think I'm right in thinking that if a unique index exists to
> match the group by clause, and the join condition is equality
> (probably using the same operator class as the unique btree
> index?), then the grouping could be pushed up to before the join.

Off-hand, it seems equivalent to me; I don't know how much work it
would be.

Out of curiosity, does the first query's plan change if you run
this instead?:

SELECT s.product_code,SUM(s.quantity)
FROM products p
INNER JOIN bigsalestable s ON p.productid = s.productid
GROUP BY s.product_code;

-Kevin



Re: Functional dependency in GROUP BY through JOINs

From
Simon Riggs
Date:
On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote:

> Though this plan might not be quite as optimal as it could be as it performs
> the grouping after the join.

PostgreSQL always calculates aggregation as the last step.

It's a well known optimisation to push-down GROUP BY clauses to the
lowest level, but we don't do that, yet.

You're right that it can make a massive difference to many queries.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Functional dependency in GROUP BY through JOINs

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote:
>> Though this plan might not be quite as optimal as it could be as it performs
>> the grouping after the join.

> PostgreSQL always calculates aggregation as the last step.

> It's a well known optimisation to push-down GROUP BY clauses to the
> lowest level, but we don't do that, yet.

> You're right that it can make a massive difference to many queries.

In the case being presented here, it's not apparent to me that there's
any advantage to be had at all.  You still need to aggregate over the
rows joining to each uniquely-keyed row.  So how exactly are you going
to "push down the GROUP BY", and where does the savings come from?
        regards, tom lane



Re: Functional dependency in GROUP BY through JOINs

From
Simon Riggs
Date:
On 6 December 2012 17:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote:
>>> Though this plan might not be quite as optimal as it could be as it performs
>>> the grouping after the join.
>
>> PostgreSQL always calculates aggregation as the last step.
>
>> It's a well known optimisation to push-down GROUP BY clauses to the
>> lowest level, but we don't do that, yet.
>
>> You're right that it can make a massive difference to many queries.
>
> In the case being presented here, it's not apparent to me that there's
> any advantage to be had at all.  You still need to aggregate over the
> rows joining to each uniquely-keyed row.  So how exactly are you going
> to "push down the GROUP BY", and where does the savings come from?

David presents SQL that shows how that is possible.

In terms of operators, after push down we aggregate 1 million rows and
then join 450. Which seems cheaper than join 1 million rows and
aggregate 1 million. So we're passing nearly 1 million fewer rows into
the join.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Functional dependency in GROUP BY through JOINs

From
"Kevin Grittner"
Date:
Tom Lane wrote:

> In the case being presented here, it's not apparent to me that
> there's any advantage to be had at all.

The OP reported a different plan which was twice as fast, although
showing EXPLAIN ANALYZE results for both would be nice confirmation
of that.

> You still need to aggregate over the rows joining to each
> uniquely-keyed row.

Yes.

> So how exactly are you going to "push down the GROUP BY", and
> where does the savings come from?

There are 1000000 million rows in bigsalestable that fall into 450
groups. The question is whether you look up related data in the
products table once per row or once per group. Apparently those
extra 999550 lookups take enough time to matter.

-Kevin



Re: Functional dependency in GROUP BY through JOINs

From
"David Rowley"
Date:
> From: Kevin Grittner [mailto:kgrittn@mail.com]
> > I think I'm right in thinking that if a unique index exists to match
> > the group by clause, and the join condition is equality (probably
> > using the same operator class as the unique btree index?), then the
> > grouping could be pushed up to before the join.
>
> Off-hand, it seems equivalent to me; I don't know how much work it would
> be.
>
> Out of curiosity, does the first query's plan change if you run this instead?:
>
> SELECT s.product_code,SUM(s.quantity)
> FROM products p
> INNER JOIN bigsalestable s ON p.productid = s.productid GROUP BY
> s.product_code;
>

I should have made it more clear about the lack of product_code in the bigsalestable, there's only productid, the
lookupto the product table was to obtain the actual more user friendly product_code. 

The actual table def's are in the attachment of the original email.

Regards

David Rowley

> -Kevin




Re: Functional dependency in GROUP BY through JOINs

From
"David Rowley"
Date:
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 07 December 2012 06:22
> To: Simon Riggs
> Cc: David Rowley; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Functional dependency in GROUP BY through JOINs
> 
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com>
> wrote:
> >> Though this plan might not be quite as optimal as it could be as it
> >> performs the grouping after the join.
> 
> > PostgreSQL always calculates aggregation as the last step.
> 
> > It's a well known optimisation to push-down GROUP BY clauses to the
> > lowest level, but we don't do that, yet.
> 
> > You're right that it can make a massive difference to many queries.
> 
> In the case being presented here, it's not apparent to me that there's any
> advantage to be had at all.  You still need to aggregate over the rows
joining
> to each uniquely-keyed row.  So how exactly are you going to "push down
> the GROUP BY", and where does the savings come from?
> 

I guess the saving is, in at least this case it's because the join only
joins 10 rows on either side, but I think also because the grouping would
also be cheaper because it's done on an INT rather than CHAR().
But I'm thinking you're meaning the planner would have to know this is
cheaper and compare both versions of the plan. 

I should have showed the plan of the other nested query originally, so here
it is this time.

Version = 9.2.1


test=# EXPLAIN ANALYZE
test-# SELECT p.product_code,SUM(s.quantity)
test-# FROM products p
test-# INNER JOIN bigsalestable s ON p.productid = s.productid
test-# GROUP BY p.product_code;                                                             QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------HashAggregate  (cost=18926.22..18926.32 rows=10 width=150)
(actual
time=553.403..553.405 rows=10 loops=1)  ->  Hash Join  (cost=1.23..18676.22 rows=50000 width=150) (actual
time=0.041..324.970 rows=1000000 loops=1)        Hash Cond: (s.productid = p.productid)        ->  Seq Scan on
bigsalestables  (cost=0.00..14425.00 rows=1000000
 
width=8) (actual time=0.012..70.627 rows=1000000 loops=1)        ->  Hash  (cost=1.10..1.10 rows=10 width=150) (actual
time=0.013..0.013 rows=10 loops=1)              Buckets: 1024  Batches: 1  Memory Usage: 1kB              ->  Seq Scan
onproducts p  (cost=0.00..1.10 rows=10
 
width=150) (actual time=0.004..0.007 rows=10 loops=1)Total runtime: 553.483 ms
(8 rows)

test=# EXPLAIN ANALYZE
test-# SELECT p.product_code,s.quantity
test-# FROM products AS p
test-# INNER JOIN (SELECT productid,SUM(quantity) AS quantity FROM
bigsalestable GROUP BY productid) AS s ON p.productid = s.productid;
       QUERY PLAN
 
----------------------------------------------------------------------------
--------------------------------------------------------Hash Join  (cost=19426.22..19431.07 rows=10 width=154) (actual
time=295.548..295.557 rows=10 loops=1)  Hash Cond: (bigsalestable.productid = p.productid)  ->  HashAggregate
(cost=19425.00..19427.00rows=200 width=8) (actual
 
time=295.514..295.518 rows=10 loops=1)        ->  Seq Scan on bigsalestable  (cost=0.00..14425.00 rows=1000000
width=8) (actual time=0.004..59.330 rows=1000000 loops=1)  ->  Hash  (cost=1.10..1.10 rows=10 width=150) (actual
time=0.017..0.017
rows=10 loops=1)        Buckets: 1024  Batches: 1  Memory Usage: 1kB        ->  Seq Scan on products p
(cost=0.00..1.10rows=10 width=150)
 
(actual time=0.010..0.012 rows=10 loops=1)Total runtime: 295.612 ms
(8 rows)

Regards

David Rowley


>             regards, tom lane




Re: Functional dependency in GROUP BY through JOINs

From
"David Rowley"
Date:
> From: Simon Riggs [mailto:simon@2ndQuadrant.com]
> Sent: 07 December 2012 05:44
> To: David Rowley
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Functional dependency in GROUP BY through JOINs
> 
> On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote:
> 
> > Though this plan might not be quite as optimal as it could be as it
> > performs the grouping after the join.
> 
> PostgreSQL always calculates aggregation as the last step.

I didn't know that was always the case, but it makes sense I guess. 
This is probably a bigger project than I imagined it would be then.

> 
> It's a well known optimisation to push-down GROUP BY clauses to the lowest
> level, but we don't do that, yet.
> 
> You're right that it can make a massive difference to many queries.
> 

I agree.

Maybe it'd be something worthwhile for the future then. Perhaps if others
agree it should be something to go on the TODO list?

Regards

David Rowley

> --
>  Simon Riggs                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services