Re: Removing Functionally Dependent GROUP BY Columns - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: Removing Functionally Dependent GROUP BY Columns
Date
Msg-id 5694D782.8010401@dalibo.com
Whole thread Raw
In response to Re: Removing Functionally Dependent GROUP BY Columns  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: Removing Functionally Dependent GROUP BY Columns  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
On 01/12/2015 12:07, David Rowley wrote:
> On 1 December 2015 at 17:09, Marko Tiikkaja <marko@joh.to
> <mailto:marko@joh.to>> wrote:
> 
>     On 2015-12-01 05:00, David Rowley wrote:
> 
>         We already allow a SELECT's target list to contain
>         non-aggregated columns
>         in a GROUP BY query in cases where the non-aggregated column is
>         functionally dependent on the GROUP BY clause.
> 
>         For example a query such as;
> 
>         SELECT p.product_id,p.description, SUM(s.quantity)
>         FROM product p
>         INNER JOIN sale s ON p.product_id = s.product_id
>         GROUP BY p.product_id;
> 
>         is perfectly fine in PostgreSQL, as p.description is
>         functionally dependent
>         on p.product_id (assuming product_id is the PRIMARY KEY of product).
> 
> 
>     This has come up before (on other forums, at least), and my main
>     concern has been that unlike the case where we go from throwing an
>     error to allowing a query, this has a chance to make the planning of
>     currently legal queries slower.  Have you tried to measure the
>     impact of this on queries where there's no runtime gains to be had?
> 
> 
> I've performed a series of benchmarks on the following queries:
> 
> Test1: explain select id1,id2 from t1 group by id1,id2;
> Test2: explain select id from t2 group by id;
> Test3: explain select t1.id1,t1.id2 from t2 inner join t1 on
> t1.id1=t2.id <http://t2.id> group by t1.id1,t1.id2;
>  
> I ran each of these with pgbench for 60 seconds, 3 runs per query. In
> each case below I've converted the TPS into seconds using the average
> TPS over the 3 runs.
> 
> In summary:
> 
> Test1 is the worst case test. It's a very simple query so planning
> overhead of join searching is non-existent. The fact that there's 2
> columns in the GROUP BY means that the fast path cannot be used. I added
> this as if there's only 1 column in the GROUP BY then there's no point
> in searching for something to remove.
> 
> [...]
> 
> It seems that the worst case test adds about 7.6 microseconds onto
> planning time. To get this worse case result I had to add two GROUP BY
> columns, as having only 1 triggers a fast path as the code knows it
> can't remove any columns, since there's only 1. A similar fast path also
> exists which will only lookup the PRIMARY KEY details if there's more
> than 1 column per relation in the GROUP BY, so for example GROUP BY
> rel1.col1, rel2.col1 won't lookup any PRIMARY KEY constraint.
> 
> Given that the extra code really only does anything if the GROUP BY has
> 2 or more expressions, are you worried that this will affect too many
> short and fast to execute queries negatively?
> 
> 

In identify_key_vars()

+       /* Only PK constraints are of interest for now, see comment above */
+       if (con->contype != CONSTRAINT_PRIMARY)
+           continue;

I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).


+       /*
+        * Exit if the constraint is deferrable, there's no point in looking
+        * for another PK constriant, as there can only be one.
+        */

small typo on "constriant"



+               {
+                   varlistmatches = bms_add_member(varlistmatches,
varlistidx);
+                   found_col = true;
+                   /* don't break, there might be dupicates */
+               }

small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?


in remove_useless_groupby_columns()

+       /*
+        * Skip if there were no Vars found in the GROUP BY clause which
belong
+        * to this relation. We can also skip if there's only 1 Var, as
there
+        * needs to be more than one Var for there to be any columns
which are
+        * functionally dependent on another column.
+        */


This comment doesn't sound very clear. I'm not a native english speaker,
so maybe it's just me.

+   /*
+    * If we found any surplus Vars in the GROUP BY clause, then we'll build
+    * a new GROUP BY clause without these surplus Vars.
+    */
+   if (anysurplus)
+   {
+       List *new_groupby = NIL;
+
+       foreach (lc, root->parse->groupClause)
+       {
+           SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+           TargetEntry *tle;
+           Var *var;
+
+           tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+           var = (Var *) tle->expr;
+
+           if (!IsA(var, Var))
+               continue;
+ [...]

if var isn't a Var, it needs to be added in new_groupby.


+           /* XXX do we to alter tleSortGroupRef to remove gaps? */

no idea on that :/

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org



pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: Fuzzy substring searching with the pg_trgm extension
Next
From: Michal Novotny
Date:
Subject: Question about DROP TABLE