Thread: Functional dependency in GROUP BY through JOINs
<div class="WordSection1"><p class="MsoNormal">Some idle thoughts has provoked me to sending this email. I’m wondering whatI’m seeing here, is a simple missed optimisation or something that would be very difficult to implement, or even perhapssomething I’ve completely misunderstood.<p class="MsoNormal"> <p class="MsoNormal">To set the scene let’s go withthe old products and sales example. In this example each product has been given a unique ID maybe due to perhaps product_codesbeing a bit long and to keep the sales table as narrow as possible.<p class="MsoNormal"> <p class="MsoNormal">Ifwe wanted to see the sales per product we could write something like this:<p class="MsoNormal"> <p class="MsoNormal">SELECTp.product_code,SUM(s.quantity)<p class="MsoNormal">FROM products p<p class="MsoNormal">INNER JOINbigsalestable s ON p.productid = s.productid<p class="MsoNormal">GROUP BY p.product_code;<p class="MsoNormal"> <p class="MsoNormal">Thoughthis plan might not be quite as optimal as it could be as it performs the grouping after the join.<pclass="MsoNormal"> <p class="MsoNormal"> QUERY PLAN<p class="MsoNormal">-------------------------------------------------------------------------------------<p class="MsoNormal">HashAggregate (cost=33195.13..33199.63 rows=450 width=150)<p class="MsoNormal"> -> Hash Join (cost=20.13..28195.13rows=1000000 width=150)<p class="MsoNormal"> Hash Cond: (s.productid = p.productid)<p class="MsoNormal"> -> Seq Scan on bigsalestable s (cost=0.00..14425.00 rows=1000000 width=8)<p class="MsoNormal"> -> Hash (cost=14.50..14.50 rows=450 width=150)<p class="MsoNormal"> -> SeqScan on products p (cost=0.00..14.50 rows=450 width=150)<p class="MsoNormal">(6 rows)<p class="MsoNormal"> <p class="MsoNormal">Ofcourse the query could have been written in the first place as:<p class="MsoNormal"> <p class="MsoNormal">SELECTp.product_code,s.quantity<p class="MsoNormal">FROM products AS p<p class="MsoNormal">INNER JOIN (SELECTproductid,SUM(quantity) AS quantity FROM bigsalestable GROUP BY productid) AS s ON p.productid = s.productid;<p class="MsoNormal"> <pclass="MsoNormal">And that would have given us a more efficient plan.<p class="MsoNormal"> <p class="MsoNormal">Ofcourse, for these actual plans to be equivalent there would naturally have to be a unique index on product_codein the products table.<p class="MsoNormal">I think I’m right in thinking that if a unique index exists to matchthe group by clause, and the join condition is equality (probably using the same operator class as the unique btreeindex?), then the grouping could be pushed up to before the join.<p class="MsoNormal"> <p class="MsoNormal">In the attachedscript the hand optimised query runs about twice as fast with 1 million record sales table.<p class="MsoNormal"> <pclass="MsoNormal">The existence of join removal makes me think we don’t want to always allow it up towho or what is writing the query to allow the best choice in plan.<p class="MsoNormal"> <p class="MsoNormal">I’m not reallyskilled enough in the arts of postgresql’s planner to look into how hard it would be to implement this, but I thoughtI’d throw it out there to collect some comments on the idea.<p class="MsoNormal"> <p class="MsoNormal">Regards<p class="MsoNormal"> <pclass="MsoNormal">David Rowley<p class="MsoNormal"> </div>
David Rowley wrote: > If we wanted to see the sales per product we could write > something like this: > SELECT p.product_code,SUM(s.quantity) > FROM products p > INNER JOIN bigsalestable s ON p.productid = s.productid > GROUP BY p.product_code; > Though this plan might not be quite as optimal as it could be as > it performs the grouping after the join. > Of course the query could have been written in the first place > as: > SELECT p.product_code,s.quantity > FROM products AS p > INNER JOIN (SELECT productid,SUM(quantity) AS quantity > FROM bigsalestable GROUP BY productid) AS s > ON p.productid = s.productid; > And that would have given us a more efficient plan. > Of course, for these actual plans to be equivalent there would > naturally have to be a unique index on product_code in the > products table. > > I think I'm right in thinking that if a unique index exists to > match the group by clause, and the join condition is equality > (probably using the same operator class as the unique btree > index?), then the grouping could be pushed up to before the join. Off-hand, it seems equivalent to me; I don't know how much work it would be. Out of curiosity, does the first query's plan change if you run this instead?: SELECT s.product_code,SUM(s.quantity) FROM products p INNER JOIN bigsalestable s ON p.productid = s.productid GROUP BY s.product_code; -Kevin
On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote: > Though this plan might not be quite as optimal as it could be as it performs > the grouping after the join. PostgreSQL always calculates aggregation as the last step. It's a well known optimisation to push-down GROUP BY clauses to the lowest level, but we don't do that, yet. You're right that it can make a massive difference to many queries. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote: >> Though this plan might not be quite as optimal as it could be as it performs >> the grouping after the join. > PostgreSQL always calculates aggregation as the last step. > It's a well known optimisation to push-down GROUP BY clauses to the > lowest level, but we don't do that, yet. > You're right that it can make a massive difference to many queries. In the case being presented here, it's not apparent to me that there's any advantage to be had at all. You still need to aggregate over the rows joining to each uniquely-keyed row. So how exactly are you going to "push down the GROUP BY", and where does the savings come from? regards, tom lane
On 6 December 2012 17:21, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote: >>> Though this plan might not be quite as optimal as it could be as it performs >>> the grouping after the join. > >> PostgreSQL always calculates aggregation as the last step. > >> It's a well known optimisation to push-down GROUP BY clauses to the >> lowest level, but we don't do that, yet. > >> You're right that it can make a massive difference to many queries. > > In the case being presented here, it's not apparent to me that there's > any advantage to be had at all. You still need to aggregate over the > rows joining to each uniquely-keyed row. So how exactly are you going > to "push down the GROUP BY", and where does the savings come from? David presents SQL that shows how that is possible. In terms of operators, after push down we aggregate 1 million rows and then join 450. Which seems cheaper than join 1 million rows and aggregate 1 million. So we're passing nearly 1 million fewer rows into the join. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Tom Lane wrote: > In the case being presented here, it's not apparent to me that > there's any advantage to be had at all. The OP reported a different plan which was twice as fast, although showing EXPLAIN ANALYZE results for both would be nice confirmation of that. > You still need to aggregate over the rows joining to each > uniquely-keyed row. Yes. > So how exactly are you going to "push down the GROUP BY", and > where does the savings come from? There are 1000000 million rows in bigsalestable that fall into 450 groups. The question is whether you look up related data in the products table once per row or once per group. Apparently those extra 999550 lookups take enough time to matter. -Kevin
> From: Kevin Grittner [mailto:kgrittn@mail.com] > > I think I'm right in thinking that if a unique index exists to match > > the group by clause, and the join condition is equality (probably > > using the same operator class as the unique btree index?), then the > > grouping could be pushed up to before the join. > > Off-hand, it seems equivalent to me; I don't know how much work it would > be. > > Out of curiosity, does the first query's plan change if you run this instead?: > > SELECT s.product_code,SUM(s.quantity) > FROM products p > INNER JOIN bigsalestable s ON p.productid = s.productid GROUP BY > s.product_code; > I should have made it more clear about the lack of product_code in the bigsalestable, there's only productid, the lookupto the product table was to obtain the actual more user friendly product_code. The actual table def's are in the attachment of the original email. Regards David Rowley > -Kevin
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 07 December 2012 06:22 > To: Simon Riggs > Cc: David Rowley; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Functional dependency in GROUP BY through JOINs > > Simon Riggs <simon@2ndQuadrant.com> writes: > > On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> > wrote: > >> Though this plan might not be quite as optimal as it could be as it > >> performs the grouping after the join. > > > PostgreSQL always calculates aggregation as the last step. > > > It's a well known optimisation to push-down GROUP BY clauses to the > > lowest level, but we don't do that, yet. > > > You're right that it can make a massive difference to many queries. > > In the case being presented here, it's not apparent to me that there's any > advantage to be had at all. You still need to aggregate over the rows joining > to each uniquely-keyed row. So how exactly are you going to "push down > the GROUP BY", and where does the savings come from? > I guess the saving is, in at least this case it's because the join only joins 10 rows on either side, but I think also because the grouping would also be cheaper because it's done on an INT rather than CHAR(). But I'm thinking you're meaning the planner would have to know this is cheaper and compare both versions of the plan. I should have showed the plan of the other nested query originally, so here it is this time. Version = 9.2.1 test=# EXPLAIN ANALYZE test-# SELECT p.product_code,SUM(s.quantity) test-# FROM products p test-# INNER JOIN bigsalestable s ON p.productid = s.productid test-# GROUP BY p.product_code; QUERY PLAN ---------------------------------------------------------------------------- ----------------------------------------------------------HashAggregate (cost=18926.22..18926.32 rows=10 width=150) (actual time=553.403..553.405 rows=10 loops=1) -> Hash Join (cost=1.23..18676.22 rows=50000 width=150) (actual time=0.041..324.970 rows=1000000 loops=1) Hash Cond: (s.productid = p.productid) -> Seq Scan on bigsalestables (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.012..70.627 rows=1000000 loops=1) -> Hash (cost=1.10..1.10 rows=10 width=150) (actual time=0.013..0.013 rows=10 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Seq Scan onproducts p (cost=0.00..1.10 rows=10 width=150) (actual time=0.004..0.007 rows=10 loops=1)Total runtime: 553.483 ms (8 rows) test=# EXPLAIN ANALYZE test-# SELECT p.product_code,s.quantity test-# FROM products AS p test-# INNER JOIN (SELECT productid,SUM(quantity) AS quantity FROM bigsalestable GROUP BY productid) AS s ON p.productid = s.productid; QUERY PLAN ---------------------------------------------------------------------------- --------------------------------------------------------Hash Join (cost=19426.22..19431.07 rows=10 width=154) (actual time=295.548..295.557 rows=10 loops=1) Hash Cond: (bigsalestable.productid = p.productid) -> HashAggregate (cost=19425.00..19427.00rows=200 width=8) (actual time=295.514..295.518 rows=10 loops=1) -> Seq Scan on bigsalestable (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.004..59.330 rows=1000000 loops=1) -> Hash (cost=1.10..1.10 rows=10 width=150) (actual time=0.017..0.017 rows=10 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Seq Scan on products p (cost=0.00..1.10rows=10 width=150) (actual time=0.010..0.012 rows=10 loops=1)Total runtime: 295.612 ms (8 rows) Regards David Rowley > regards, tom lane
> From: Simon Riggs [mailto:simon@2ndQuadrant.com] > Sent: 07 December 2012 05:44 > To: David Rowley > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Functional dependency in GROUP BY through JOINs > > On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote: > > > Though this plan might not be quite as optimal as it could be as it > > performs the grouping after the join. > > PostgreSQL always calculates aggregation as the last step. I didn't know that was always the case, but it makes sense I guess. This is probably a bigger project than I imagined it would be then. > > It's a well known optimisation to push-down GROUP BY clauses to the lowest > level, but we don't do that, yet. > > You're right that it can make a massive difference to many queries. > I agree. Maybe it'd be something worthwhile for the future then. Perhaps if others agree it should be something to go on the TODO list? Regards David Rowley > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services