Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities) - Mailing list pgsql-hackers
From | Олег Царев |
---|---|
Subject | Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities) |
Date | |
Msg-id | 54f48e4f0905100841l79595f2ekbfa166e2d586b98a@mail.gmail.com Whole thread Raw |
In response to | Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities) (Pavel Stehule <pavel.stehule@gmail.com>) |
Responses |
Re: Implementation of GROUPING SETS (T431: Extended
grouping capabilities)
|
List | pgsql-hackers |
I will write separated mail about rollup. Can you send me some links or information about "CTE"? What is it? Also, I need some deep knownledge about postgresql aggregation calculation (executor part) - can you help me (links, books, name of source files, etc)?. After get additional information i will can continue discussion. Thanls 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>: > 2009/5/10 Олег Царев <zabivator@gmail.com>: >> Hello, Pavel. >> >> I read you letter, and found some points: >> >> a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL >> require clone source for every group. >> It's so slow. >> My point: we can make for start simple implementation. >> b) Your sollution it's so good, IMHO >> WITH q AS (SELECT * FROM some) >> SELECT * FROM q GROUP BY a >> UNION ALL >> SELECT * FROM q GROUP BY b >> But can you describe me how it's used on planner (calculate cost) and exector. > > look on CTE source code or ask to CTE author. > >> Sharing source - it's mind - we don't process tree, we process >> DIRECTED GRAPH. Hard, many bugs, so on, not? > > for example - code for DISTINCT is shared with code for aggregation. > My idea is based on similarity > > GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL > >> PostgreSQL support sharing nodes of executor? How? Where i can read? > > I don't know, what do you thing exactly sharing nodes? Before CTE I > had to write special holder node, but it isn't necessary now. > >> Next. Ok, we spool source. After we use simple hash group/sort group + >> union all? Or more advanced techniques? If different group by - it's >> not different from my idea, it's logic additional (spool source for >> fast rewind) > > I thing so trivial implementation via UNION ALL should work, but any > optimisations means lot of work - and when you add rewind > functionality, you will got trivial nonrecursive CTE implementation. > When I wrote patch, there wasn't possible an use CTE, so I played wit > own nodes. Now, the more clean solution is probably based on current > CTE base. > >> >> c) Don't forget also about node requiments - if we use hash table as >> container for source - we reduce some sorting properties - and query >> select A,B from TABLE group by ((A,B),(B)) order by a,b require >> sorting on output of grouping sets. >> It's mind - we need insert sort. > > Yes, probably there should be possible some techniques - that should > to eliminate final sort op. I am not sure if it is really important. > Now I thinking what is the bigger devil a) patch complexity b) some > suboptimality (mainly redundant call of agg nodes). For me a. > Aggregates are not gratis, but significantly more expensive is > repeated seq data scan. So it is first goal. Later we should to thing > about next optimalizations. > >> >> d) We can make SORT ROLLUP and SORT REDUCE ROLLUP. >> first - sort group by use sorting for grouping data and calculate aggregations. >> So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by '' >> What it's mind? It's mind we can use sort features for implement SORT >> ROLLUP. For every sub group we calculate self aggregates... oh, i bad >> say. Look on "select a,sum(b) from table group by a". Sort group >> grouping by a, and for every a-group calculate aggregations (sum), for >> ROLLUP A,B we can make three aggregations ; for A,B than A than () - >> and use one path on source calculate aggregations on every step, and >> for split source on different subsets. >> It's more perfomance, than different group by and union all. > > Maybe, I don't know. What I know: > > * UNION ALL will have better result on indexed sets and MIN, MAX > aggregates. CTE isn't optimized now for these cases. > * UNION ALL will be slow for large sets (over cache), > * CTE needs some more optimalizations, > * pg core hackers dislike big and complex patches, > * pg core hackers like any improved current code base. > > I am thinking so Grouping Sets based on CTE should be more commitable > code. It doesn't mean so your ideas are wrong, but these > optimalization should to work on CTE too. > > select * from table group by rollup(a,b,c) > > have to have generate same plan as > > with q as (select * from table) > select * from q group by a,b,c > union all > select * from q group by a,b > union all > select * from q group by a > union all > select * from q; > > and CTE is more general then Grouping Sets, so it is better do > optimalization over CTE than Grouping Sets. > > Regards > Pavel Stehule >> >> Look for MS SQL: >> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx >> Why MS SQL don't support distinct aggregations? I think - because >> every distinct aggregations in MS SQL require hash, and many >> aggregations - it's so slow. >> >> Thank you! >> 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>: >>> Hello >>> >>> some other info is on http://wiki.postgresql.org/wiki/Grouping_Sets >>> >>> Maybe in e few weak I'll have some a prototype based on CTE. My older >>> technique based on hashed tables should be well, but it carry lot of >>> unshared code. Using CTE means so we can optimize CTE and GROUPING >>> SETS together. I plan to transform query >>> >>> SELECT * FROM some GROUP BY GROUPING SETS(a, b) >>> >>> to >>> >>> WITH q AS (SELECT * FROM some) >>> SELECT * FROM q GROUP BY a >>> UNION ALL >>> SELECT * FROM q GROUP BY b >>> >>> >>> 2009/5/10 Олег Царев <zabivator@gmail.com>: >>>> Hello all. >>>> Please, approve my ideas for implementation. >>>> >>>> Standart has feature T431: Extended grouping capabilities. >>>> This feature i found in TODO-list: >>>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO >>>> >>>> MS SQL 2005 partial support this feature: >>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx >>>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx >>>> >>>> MS SQL 2008 support this feature: >>>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx >>>> >>>> Oracle support this feature: >>>> http://www.compshack.com/sql/oracle-group-rollup >>>> >>>> So, it's short notes about GROUPING SETS, but more complete >>>> information have in a official documentation of MS SQL and Oracle >>>> (copyright limited for send as attach). >>>> >>>> First. GROUPG SETS. >>>> >>>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A), >>>> () ) - it's example of use grouping sets. >>>> Semantic of this construction - make group by over source more, than >>>> one group of column. >>>> It's very wide key - A,B C. In result set of this example we can find >>>> result set of select select A,B,C,SUM(D) from table group by A,B,C - >>>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING >>>> SETS( (A,B,C), (A), () ) >>>> Two subset - is GROUP BY A B, and instead C column we look NULL. >>>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name >>>> "GRAND TOTAL". - calculate over all subset without grouping >>>> >>>> Also have function "GROUPING" it's function say about null - "real >>>> null" (from table) or generated by "GROUP BY GROUPING SETS" >>>> >>>> My point: this feature can implement over GROUP BY and UNION ALL >>>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),() >>>> )" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select >>>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table >>>> group by(); >>>> >>>> So, it's very simple, don't require modification of executor and >>>> callibrate cost - only parser and semantic anylysis, >>>> ' >>>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS( >>>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ), >>>> CUBE - analogue. >>>> >>>> If this idea it's good - i can write code base on old patch >>>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or >>>> from clean list (as you wish). >>> >>> This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ..., >>> where X is some joined data, then you repeat JOIN on every grouping >>> set. So your solution is simple for implementation, but it should be >>> really slow. >>> >>> Regards >>> Pavel Stehule >>> >>>> >>>> In future i know how to implement ROLLUP more optimal (executor >>>> iterator) and use this ROLLUP for optimisation another GROUP BY, >>>> GROUPING SETS. >>>> >>>> Thanks. >>>> >>>> -- >>>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >>>> To make changes to your subscription: >>>> http://www.postgresql.org/mailpref/pgsql-hackers >>>> >>> >> >
pgsql-hackers by date: