Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities) - Mailing list pgsql-hackers
From | Pavel Stehule |
---|---|
Subject | Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities) |
Date | |
Msg-id | 162867790905171250j1b84085buf984545cb1c404b8@mail.gmail.com Whole thread Raw |
In response to | Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities) (Олег Царев <zabivator@gmail.com>) |
List | pgsql-hackers |
Hello, here is last my grouping sets patch - next work will be done by Oleg. This patch has full functionality of grouping sets - with support for grouping via expr. postgres=# select extract(year from a),a,b, sum(c) from foo group by grouping sets(extract(year from a), a, b,()); date_part | a | b | sum -----------+------------+----+----- 2001 | | | 40 | 2001-10-12 | | 20 | 2001-10-11 | | 20 | | 10 | 40 | | | 40 (5 rows) postgres=# select * from foo; a | b | c ------------+----+---- 2001-10-11 | 10 | 20 2001-10-12 | 10 | 20 (2 rows) This patch is WIP - has full functionality, but is based on CTE - what is mas/menos ugly hack. But for testing and first view about grouping sets, it should do good work. regards Pavel Stehule 2009/5/10 Олег Царев <zabivator@gmail.com>: > I will write separated mail about rollup. > Can you send me some links or information about "CTE"? What is it? > Also, I need some deep knownledge about postgresql aggregation > calculation (executor part) - can you help me (links, books, name of > source files, etc)?. > After get additional information i will can continue discussion. > Thanls > 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>: >> 2009/5/10 Олег Царев <zabivator@gmail.com>: >>> Hello, Pavel. >>> >>> I read you letter, and found some points: >>> >>> a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL >>> require clone source for every group. >>> It's so slow. >>> My point: we can make for start simple implementation. >>> b) Your sollution it's so good, IMHO >>> WITH q AS (SELECT * FROM some) >>> SELECT * FROM q GROUP BY a >>> UNION ALL >>> SELECT * FROM q GROUP BY b >>> But can you describe me how it's used on planner (calculate cost) and exector. >> >> look on CTE source code or ask to CTE author. >> >>> Sharing source - it's mind - we don't process tree, we process >>> DIRECTED GRAPH. Hard, many bugs, so on, not? >> >> for example - code for DISTINCT is shared with code for aggregation. >> My idea is based on similarity >> >> GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL >> >>> PostgreSQL support sharing nodes of executor? How? Where i can read? >> >> I don't know, what do you thing exactly sharing nodes? Before CTE I >> had to write special holder node, but it isn't necessary now. >> >>> Next. Ok, we spool source. After we use simple hash group/sort group + >>> union all? Or more advanced techniques? If different group by - it's >>> not different from my idea, it's logic additional (spool source for >>> fast rewind) >> >> I thing so trivial implementation via UNION ALL should work, but any >> optimisations means lot of work - and when you add rewind >> functionality, you will got trivial nonrecursive CTE implementation. >> When I wrote patch, there wasn't possible an use CTE, so I played wit >> own nodes. Now, the more clean solution is probably based on current >> CTE base. >> >>> >>> c) Don't forget also about node requiments - if we use hash table as >>> container for source - we reduce some sorting properties - and query >>> select A,B from TABLE group by ((A,B),(B)) order by a,b require >>> sorting on output of grouping sets. >>> It's mind - we need insert sort. >> >> Yes, probably there should be possible some techniques - that should >> to eliminate final sort op. I am not sure if it is really important. >> Now I thinking what is the bigger devil a) patch complexity b) some >> suboptimality (mainly redundant call of agg nodes). For me a. >> Aggregates are not gratis, but significantly more expensive is >> repeated seq data scan. So it is first goal. Later we should to thing >> about next optimalizations. >> >>> >>> d) We can make SORT ROLLUP and SORT REDUCE ROLLUP. >>> first - sort group by use sorting for grouping data and calculate aggregations. >>> So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by '' >>> What it's mind? It's mind we can use sort features for implement SORT >>> ROLLUP. For every sub group we calculate self aggregates... oh, i bad >>> say. Look on "select a,sum(b) from table group by a". Sort group >>> grouping by a, and for every a-group calculate aggregations (sum), for >>> ROLLUP A,B we can make three aggregations ; for A,B than A than () - >>> and use one path on source calculate aggregations on every step, and >>> for split source on different subsets. >>> It's more perfomance, than different group by and union all. >> >> Maybe, I don't know. What I know: >> >> * UNION ALL will have better result on indexed sets and MIN, MAX >> aggregates. CTE isn't optimized now for these cases. >> * UNION ALL will be slow for large sets (over cache), >> * CTE needs some more optimalizations, >> * pg core hackers dislike big and complex patches, >> * pg core hackers like any improved current code base. >> >> I am thinking so Grouping Sets based on CTE should be more commitable >> code. It doesn't mean so your ideas are wrong, but these >> optimalization should to work on CTE too. >> >> select * from table group by rollup(a,b,c) >> >> have to have generate same plan as >> >> with q as (select * from table) >> select * from q group by a,b,c >> union all >> select * from q group by a,b >> union all >> select * from q group by a >> union all >> select * from q; >> >> and CTE is more general then Grouping Sets, so it is better do >> optimalization over CTE than Grouping Sets. >> >> Regards >> Pavel Stehule >>> >>> Look for MS SQL: >>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx >>> Why MS SQL don't support distinct aggregations? I think - because >>> every distinct aggregations in MS SQL require hash, and many >>> aggregations - it's so slow. >>> >>> Thank you! >>> 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>: >>>> Hello >>>> >>>> some other info is on http://wiki.postgresql.org/wiki/Grouping_Sets >>>> >>>> Maybe in e few weak I'll have some a prototype based on CTE. My older >>>> technique based on hashed tables should be well, but it carry lot of >>>> unshared code. Using CTE means so we can optimize CTE and GROUPING >>>> SETS together. I plan to transform query >>>> >>>> SELECT * FROM some GROUP BY GROUPING SETS(a, b) >>>> >>>> to >>>> >>>> WITH q AS (SELECT * FROM some) >>>> SELECT * FROM q GROUP BY a >>>> UNION ALL >>>> SELECT * FROM q GROUP BY b >>>> >>>> >>>> 2009/5/10 Олег Царев <zabivator@gmail.com>: >>>>> Hello all. >>>>> Please, approve my ideas for implementation. >>>>> >>>>> Standart has feature T431: Extended grouping capabilities. >>>>> This feature i found in TODO-list: >>>>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO >>>>> >>>>> MS SQL 2005 partial support this feature: >>>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx >>>>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx >>>>> >>>>> MS SQL 2008 support this feature: >>>>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx >>>>> >>>>> Oracle support this feature: >>>>> http://www.compshack.com/sql/oracle-group-rollup >>>>> >>>>> So, it's short notes about GROUPING SETS, but more complete >>>>> information have in a official documentation of MS SQL and Oracle >>>>> (copyright limited for send as attach). >>>>> >>>>> First. GROUPG SETS. >>>>> >>>>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A), >>>>> () ) - it's example of use grouping sets. >>>>> Semantic of this construction - make group by over source more, than >>>>> one group of column. >>>>> It's very wide key - A,B C. In result set of this example we can find >>>>> result set of select select A,B,C,SUM(D) from table group by A,B,C - >>>>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING >>>>> SETS( (A,B,C), (A), () ) >>>>> Two subset - is GROUP BY A B, and instead C column we look NULL. >>>>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name >>>>> "GRAND TOTAL". - calculate over all subset without grouping >>>>> >>>>> Also have function "GROUPING" it's function say about null - "real >>>>> null" (from table) or generated by "GROUP BY GROUPING SETS" >>>>> >>>>> My point: this feature can implement over GROUP BY and UNION ALL >>>>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),() >>>>> )" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select >>>>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table >>>>> group by(); >>>>> >>>>> So, it's very simple, don't require modification of executor and >>>>> callibrate cost - only parser and semantic anylysis, >>>>> ' >>>>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS( >>>>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ), >>>>> CUBE - analogue. >>>>> >>>>> If this idea it's good - i can write code base on old patch >>>>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or >>>>> from clean list (as you wish). >>>> >>>> This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ..., >>>> where X is some joined data, then you repeat JOIN on every grouping >>>> set. So your solution is simple for implementation, but it should be >>>> really slow. >>>> >>>> Regards >>>> Pavel Stehule >>>> >>>>> >>>>> In future i know how to implement ROLLUP more optimal (executor >>>>> iterator) and use this ROLLUP for optimisation another GROUP BY, >>>>> GROUPING SETS. >>>>> >>>>> Thanks. >>>>> >>>>> -- >>>>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >>>>> To make changes to your subscription: >>>>> http://www.postgresql.org/mailpref/pgsql-hackers >>>>> >>>> >>> >> >
Attachment
pgsql-hackers by date: