Re: Removing Functionally Dependent GROUP BY Columns - Mailing list pgsql-hackers
From | Julien Rouhaud |
---|---|
Subject | Re: Removing Functionally Dependent GROUP BY Columns |
Date | |
Msg-id | 56A36A89.6050107@dalibo.com Whole thread Raw |
In response to | Re: Removing Functionally Dependent GROUP BY Columns (David Rowley <david.rowley@2ndquadrant.com>) |
Responses |
Re: Removing Functionally Dependent GROUP BY Columns
Re: Removing Functionally Dependent GROUP BY Columns |
List | pgsql-hackers |
On 23/01/2016 10:44, David Rowley wrote: > On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I spent some time looking through this but decided that it needs some work >> to be committable. >> >> The main thing I didn't like was that identify_key_vars seems to have a >> rather poorly chosen API: it mixes up identifying a rel's pkey with >> sorting through the GROUP BY list. I think it would be better to create >> a function that, given a relid, hands back the OID of the pkey constraint >> and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's >> no pkey or it's deferrable). The column numbers should be offset by >> FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be >> represented correctly --- compare RelationGetIndexAttrBitmap(). > > That seems like a much better design to me too. I've made changes and > attached the updated patch. > >> The reason this seems like a more attractive solution is that the output >> of the pg_constraint lookup becomes independent of the particular query >> and thus is potentially cacheable (in the relcache, say). I do not think >> we need to cache it right now but I'd like to have the option to do so. > > I've not touched that yet, but good idea. > >> (I wonder BTW whether check_functional_grouping couldn't be refactored >> to use the output of such a function, too.) > > I've attached a separate patch for that too. It applies after the > prunegroupby patch. > This refactoring makes the code much more better, +1 for me. I also reviewed it, it looks fine. However, the following removed comment could be kept for clarity: - /* The PK is a subset of grouping_columns, so we win */ >> Some lesser points: >> >> * I did not like where you plugged in the call to >> remove_useless_groupby_columns; there are weird interactions with grouping >> sets as to whether it will get called or not, and also that whole chunk >> of code is due for refactoring. I'd be inclined to call it from >> subquery_planner instead, maybe near where the HAVING clause preprocessing >> happens. > > I've moved the call to subquery_planner() > >> * You've got it iterating over every RTE, even those that aren't >> RTE_RELATION RTEs. It's pure luck that identify_key_vars doesn't fail >> outright when passed a bogus relid, and it's certainly wasteful. > > :-( I've fixed that now. > >> * Both of the loops iterating over the groupClause neglect to check >> varlevelsup, thus leading to assert failures or worse if an outer-level >> Var is present in the GROUP BY list. (I'm pretty sure outer Vars can >> still be present at this point, though I might be wrong.) > > Fixed in the first loop, and the way I've rewritten the code to use > bms_difference, there's no need to check again in the 2nd loop. > >> * I'd be inclined to see if the relvarlists couldn't be converted into >> bitmapsets too. Then the test to see if the pkey is a subset of the >> grouping columns becomes a simple and cheap bms_is_subset test. You don't >> really need surplusvars either, because you can instead test Vars for >> membership in the pkey bitmapsets that pg_constraint.c will be handing >> back. > > I've changed to use a bitmapset now, but I've kept surplusvars, having > this allows a single pass over the group by clause to remove the > surplus Vars, rather than doing it again and again for each relation. > I've also found a better way so that array is only allocated if some > surplus Vars are found. I understand what you mean, and yes, it is > possible to get rid of it, but I'd need to still add something else to > mark that this rel's extra Vars are okay to be removed. I could do > that by adding another bitmapset and marking the relid in that, but I > quite like how I've changed it now as it saves having to check > varlevelsup again on the Vars in the GROUP BY clause, as I just use > bms_difference() on the originally found Vars subtracting off the PK > vars, and assume all of those are surplus. > I wonder if in remove_useless_groupby_columns(), in the foreach loop you could change the + if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1) + { by something like + if (bms_num_members(relattnos) <= bms_num_members(pkattnos)) + continue; + + if (bms_is_subset(pkattnos, relattnos)) + { which may be cheaper. Otherwise, I couldn't find any issue with this new version of the patch. >> * What you did to join.sql/join.out seems a bit bizarre. The existing >> test case that you modified was meant to test something else, and >> conflating this behavior with the pre-existing one (and not documenting >> it) is confusing as can be. A bit more work on regression test cases >> seems indicated. > > The history behind that is that at one point during developing the > patch that test had started failing due to the group by item being > removed therefore allowing the join removal conditions to be met. On > testing again with the old test query I see this no longer happens, so > I've removed the change, although the expected output still differs > due to the group by item being removed. > >> I'm going to set this back to "Waiting on Author". I think the basic >> idea is good but it needs a rewrite. > > Thanks for the review and the good ideas. > -- Julien Rouhaud http://dalibo.com - http://dalibo.org
pgsql-hackers by date: