Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities) - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)
Date
Msg-id 162867790905171250j1b84085buf984545cb1c404b8@mail.gmail.com
Whole thread Raw
In response to Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)  (Олег Царев <zabivator@gmail.com>)
List pgsql-hackers
Hello,

here is last my grouping sets patch - next work will be done by Oleg.

This patch has full functionality of grouping sets - with support for
grouping via expr.

postgres=# select  extract(year from a),a,b, sum(c) from foo group by
grouping sets(extract(year from a), a, b,()); date_part |     a      |
b  | sum
-----------+------------+----+-----
     2001 |            |    |  40
          | 2001-10-12 |    |  20
          | 2001-10-11 |    |  20
          |            | 10 |  40
          |            |    |  40
(5 rows)

postgres=# select * from foo;
     a      | b  | c
------------+----+----
 2001-10-11 | 10 | 20
 2001-10-12 | 10 | 20
(2 rows)

This patch is WIP - has full functionality, but is based on CTE - what
is mas/menos ugly hack. But for testing and first view about grouping
sets, it should do good work.

regards
Pavel Stehule



2009/5/10 Олег Царев <zabivator@gmail.com>:
> I will write separated mail about rollup.
> Can you send me some links or information about "CTE"? What is it?
> Also, I need some deep knownledge about postgresql aggregation
> calculation (executor part) - can you help me (links, books, name of
> source files, etc)?.
> After get additional information i will can continue discussion.
> Thanls
> 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:
>> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>>> Hello, Pavel.
>>>
>>> I read you letter, and found some points:
>>>
>>> a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
>>> require clone source for every group.
>>> It's so slow.
>>> My point: we can make for start simple implementation.
>>> b) Your sollution it's so good, IMHO
>>> WITH q AS (SELECT * FROM some)
>>>  SELECT * FROM q GROUP BY a
>>>  UNION ALL
>>>  SELECT * FROM q GROUP BY b
>>> But can you describe me how it's used on planner (calculate cost) and exector.
>>
>> look on CTE source code or ask to CTE author.
>>
>>> Sharing source - it's mind - we don't process tree, we process
>>> DIRECTED GRAPH. Hard, many bugs, so on, not?
>>
>> for example - code for DISTINCT is shared with code for aggregation.
>> My idea is based on similarity
>>
>> GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL
>>
>>> PostgreSQL support sharing nodes of executor? How? Where i can read?
>>
>> I don't know, what do you thing exactly sharing nodes? Before CTE I
>> had to write special holder node, but it isn't necessary now.
>>
>>> Next. Ok, we spool source. After we use simple hash group/sort group +
>>> union all? Or more advanced techniques?  If different group by - it's
>>> not different from my idea, it's logic additional (spool source for
>>> fast rewind)
>>
>> I thing so trivial implementation via UNION ALL should work, but any
>> optimisations means lot of work - and when you add rewind
>> functionality, you will got trivial nonrecursive CTE implementation.
>> When I wrote patch, there wasn't possible an use CTE, so I played wit
>> own nodes. Now, the more clean solution is probably based on current
>> CTE base.
>>
>>>
>>> c) Don't forget also about node requiments - if we use hash table as
>>> container for source -  we reduce some sorting properties - and query
>>> select A,B from TABLE group by ((A,B),(B)) order by a,b require
>>> sorting on output of grouping sets.
>>> It's mind - we need insert sort.
>>
>> Yes, probably there should be possible some techniques - that should
>> to eliminate final sort op. I am not sure if it is really important.
>> Now I thinking what is the bigger devil a) patch complexity b) some
>> suboptimality  (mainly redundant call of agg nodes). For me a.
>> Aggregates are not gratis, but significantly more expensive is
>> repeated seq data scan. So it is first goal. Later we should to thing
>> about next optimalizations.
>>
>>>
>>> d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
>>> first - sort group by use sorting for grouping data and calculate aggregations.
>>> So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
>>> What it's mind? It's mind we can use sort features for implement SORT
>>> ROLLUP. For every sub group we calculate self aggregates... oh, i bad
>>> say. Look on "select a,sum(b) from table group by a".  Sort group
>>> grouping by a, and for every a-group calculate aggregations (sum), for
>>> ROLLUP A,B we can make three aggregations ; for A,B than A than () -
>>> and use one path on source calculate aggregations on every step, and
>>> for split source on different subsets.
>>> It's more perfomance, than different group by and union all.
>>
>> Maybe, I don't know. What I know:
>>
>> * UNION ALL will have better result on indexed sets and MIN, MAX
>> aggregates. CTE isn't optimized now for these cases.
>> * UNION ALL will be slow for large sets (over cache),
>> * CTE needs some more optimalizations,
>> * pg core hackers dislike big and complex patches,
>> * pg core hackers like any improved current code base.
>>
>> I am thinking so Grouping Sets based on CTE should be more commitable
>> code. It doesn't mean so your ideas are wrong, but these
>> optimalization should to work on CTE too.
>>
>> select * from table group by rollup(a,b,c)
>>
>> have to have generate same plan as
>>
>> with q as (select * from table)
>>  select * from q group by a,b,c
>>  union all
>>  select * from q group by a,b
>>  union all
>>  select * from q group by a
>>  union all
>>  select * from q;
>>
>> and CTE is more general then Grouping Sets, so it is better do
>> optimalization over CTE than Grouping Sets.
>>
>> Regards
>> Pavel Stehule
>>>
>>> Look for MS SQL:
>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>>> Why MS SQL don't support distinct aggregations? I think - because
>>> every distinct aggregations in MS SQL require hash, and many
>>> aggregations - it's so slow.
>>>
>>> Thank you!
>>> 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:
>>>> Hello
>>>>
>>>> some other info is on  http://wiki.postgresql.org/wiki/Grouping_Sets
>>>>
>>>> Maybe in e few weak I'll have some a prototype based on CTE. My older
>>>> technique based on hashed tables should be well, but it carry lot of
>>>> unshared code. Using CTE means so we can optimize CTE and GROUPING
>>>> SETS together. I plan to transform query
>>>>
>>>> SELECT * FROM some GROUP BY GROUPING SETS(a, b)
>>>>
>>>> to
>>>>
>>>> WITH q AS (SELECT * FROM some)
>>>>  SELECT * FROM q GROUP BY a
>>>>  UNION ALL
>>>>  SELECT * FROM q GROUP BY b
>>>>
>>>>
>>>> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>>>>> Hello all.
>>>>> Please, approve my ideas for implementation.
>>>>>
>>>>> Standart has feature T431: Extended grouping capabilities.
>>>>> This feature i found in TODO-list:
>>>>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>>>>>
>>>>> MS SQL 2005 partial support this feature:
>>>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>>>>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>>>>>
>>>>> MS SQL 2008 support this feature:
>>>>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>>>>>
>>>>> Oracle support this feature:
>>>>> http://www.compshack.com/sql/oracle-group-rollup
>>>>>
>>>>> So, it's short notes about GROUPING SETS, but more complete
>>>>> information have in a official documentation of MS SQL and Oracle
>>>>> (copyright limited for send as attach).
>>>>>
>>>>> First. GROUPG SETS.
>>>>>
>>>>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
>>>>> () ) - it's example of use grouping sets.
>>>>> Semantic of this construction - make group by over source more, than
>>>>> one group of column.
>>>>> It's very wide key - A,B C. In result set of this example we can find
>>>>> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
>>>>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
>>>>> SETS( (A,B,C), (A), () )
>>>>> Two subset - is GROUP BY A B, and instead C column we look NULL.
>>>>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
>>>>> "GRAND TOTAL". - calculate over all subset without grouping
>>>>>
>>>>> Also have function "GROUPING"  it's function say about null - "real
>>>>> null" (from table) or generated by "GROUP BY GROUPING SETS"
>>>>>
>>>>> My point: this feature can implement over GROUP BY and UNION ALL
>>>>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
>>>>> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
>>>>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
>>>>> group by();
>>>>>
>>>>> So, it's very simple, don't require modification of executor and
>>>>> callibrate cost - only parser and semantic anylysis,
>>>>> '
>>>>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
>>>>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
>>>>> CUBE - analogue.
>>>>>
>>>>> If this idea it's good -  i can write code base on old patch
>>>>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
>>>>> from clean list (as you wish).
>>>>
>>>> This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
>>>> where X is some joined data, then you repeat JOIN on every grouping
>>>> set. So your solution is simple for implementation, but it should be
>>>> really slow.
>>>>
>>>> Regards
>>>> Pavel Stehule
>>>>
>>>>>
>>>>> In future i know how to implement ROLLUP more optimal (executor
>>>>> iterator) and use this ROLLUP for optimisation another GROUP BY,
>>>>> GROUPING SETS.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>>>>> To make changes to your subscription:
>>>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>>>
>>>>
>>>
>>
>

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: generate_series from now to infinity...
Next
From: Simon Riggs
Date:
Subject: Re: Optimizing Read-Only Scalability