Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities) - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)
Date
Msg-id 162867790905100811y335648a3ke588b4fa1619fb0d@mail.gmail.com
Whole thread Raw
In response to Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)  (Олег Царев <zabivator@gmail.com>)
Responses Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)  (Олег Царев <zabivator@gmail.com>)
Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)  (Hitoshi Harada <umi.tanuki@gmail.com>)
List pgsql-hackers
2009/5/10 Олег Царев <zabivator@gmail.com>:
> Hello, Pavel.
>
> I read you letter, and found some points:
>
> a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
> require clone source for every group.
> It's so slow.
> My point: we can make for start simple implementation.
> b) Your sollution it's so good, IMHO
> WITH q AS (SELECT * FROM some)
>  SELECT * FROM q GROUP BY a
>  UNION ALL
>  SELECT * FROM q GROUP BY b
> But can you describe me how it's used on planner (calculate cost) and exector.

look on CTE source code or ask to CTE author.

> Sharing source - it's mind - we don't process tree, we process
> DIRECTED GRAPH. Hard, many bugs, so on, not?

for example - code for DISTINCT is shared with code for aggregation.
My idea is based on similarity

GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL

> PostgreSQL support sharing nodes of executor? How? Where i can read?

I don't know, what do you thing exactly sharing nodes? Before CTE I
had to write special holder node, but it isn't necessary now.

> Next. Ok, we spool source. After we use simple hash group/sort group +
> union all? Or more advanced techniques?  If different group by - it's
> not different from my idea, it's logic additional (spool source for
> fast rewind)

I thing so trivial implementation via UNION ALL should work, but any
optimisations means lot of work - and when you add rewind
functionality, you will got trivial nonrecursive CTE implementation.
When I wrote patch, there wasn't possible an use CTE, so I played wit
own nodes. Now, the more clean solution is probably based on current
CTE base.

>
> c) Don't forget also about node requiments - if we use hash table as
> container for source -  we reduce some sorting properties - and query
> select A,B from TABLE group by ((A,B),(B)) order by a,b require
> sorting on output of grouping sets.
> It's mind - we need insert sort.

Yes, probably there should be possible some techniques - that should
to eliminate final sort op. I am not sure if it is really important.
Now I thinking what is the bigger devil a) patch complexity b) some
suboptimality  (mainly redundant call of agg nodes). For me a.
Aggregates are not gratis, but significantly more expensive is
repeated seq data scan. So it is first goal. Later we should to thing
about next optimalizations.

>
> d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
> first - sort group by use sorting for grouping data and calculate aggregations.
> So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
> What it's mind? It's mind we can use sort features for implement SORT
> ROLLUP. For every sub group we calculate self aggregates... oh, i bad
> say. Look on "select a,sum(b) from table group by a".  Sort group
> grouping by a, and for every a-group calculate aggregations (sum), for
> ROLLUP A,B we can make three aggregations ; for A,B than A than () -
> and use one path on source calculate aggregations on every step, and
> for split source on different subsets.
> It's more perfomance, than different group by and union all.

Maybe, I don't know. What I know:

* UNION ALL will have better result on indexed sets and MIN, MAX
aggregates. CTE isn't optimized now for these cases.
* UNION ALL will be slow for large sets (over cache),
* CTE needs some more optimalizations,
* pg core hackers dislike big and complex patches,
* pg core hackers like any improved current code base.

I am thinking so Grouping Sets based on CTE should be more commitable
code. It doesn't mean so your ideas are wrong, but these
optimalization should to work on CTE too.

select * from table group by rollup(a,b,c)

have to have generate same plan as

with q as (select * from table) select * from q group by a,b,c union all select * from q group by a,b union all select
*from q group by a union all select * from q; 

and CTE is more general then Grouping Sets, so it is better do
optimalization over CTE than Grouping Sets.

Regards
Pavel Stehule
>
> Look for MS SQL:
> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
> Why MS SQL don't support distinct aggregations? I think - because
> every distinct aggregations in MS SQL require hash, and many
> aggregations - it's so slow.
>
> Thank you!
> 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:
>> Hello
>>
>> some other info is on  http://wiki.postgresql.org/wiki/Grouping_Sets
>>
>> Maybe in e few weak I'll have some a prototype based on CTE. My older
>> technique based on hashed tables should be well, but it carry lot of
>> unshared code. Using CTE means so we can optimize CTE and GROUPING
>> SETS together. I plan to transform query
>>
>> SELECT * FROM some GROUP BY GROUPING SETS(a, b)
>>
>> to
>>
>> WITH q AS (SELECT * FROM some)
>>  SELECT * FROM q GROUP BY a
>>  UNION ALL
>>  SELECT * FROM q GROUP BY b
>>
>>
>> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>>> Hello all.
>>> Please, approve my ideas for implementation.
>>>
>>> Standart has feature T431: Extended grouping capabilities.
>>> This feature i found in TODO-list:
>>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>>>
>>> MS SQL 2005 partial support this feature:
>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>>>
>>> MS SQL 2008 support this feature:
>>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>>>
>>> Oracle support this feature:
>>> http://www.compshack.com/sql/oracle-group-rollup
>>>
>>> So, it's short notes about GROUPING SETS, but more complete
>>> information have in a official documentation of MS SQL and Oracle
>>> (copyright limited for send as attach).
>>>
>>> First. GROUPG SETS.
>>>
>>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
>>> () ) - it's example of use grouping sets.
>>> Semantic of this construction - make group by over source more, than
>>> one group of column.
>>> It's very wide key - A,B C. In result set of this example we can find
>>> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
>>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
>>> SETS( (A,B,C), (A), () )
>>> Two subset - is GROUP BY A B, and instead C column we look NULL.
>>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
>>> "GRAND TOTAL". - calculate over all subset without grouping
>>>
>>> Also have function "GROUPING"  it's function say about null - "real
>>> null" (from table) or generated by "GROUP BY GROUPING SETS"
>>>
>>> My point: this feature can implement over GROUP BY and UNION ALL
>>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
>>> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
>>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
>>> group by();
>>>
>>> So, it's very simple, don't require modification of executor and
>>> callibrate cost - only parser and semantic anylysis,
>>> '
>>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
>>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
>>> CUBE - analogue.
>>>
>>> If this idea it's good -  i can write code base on old patch
>>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
>>> from clean list (as you wish).
>>
>> This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
>> where X is some joined data, then you repeat JOIN on every grouping
>> set. So your solution is simple for implementation, but it should be
>> really slow.
>>
>> Regards
>> Pavel Stehule
>>
>>>
>>> In future i know how to implement ROLLUP more optimal (executor
>>> iterator) and use this ROLLUP for optimisation another GROUP BY,
>>> GROUPING SETS.
>>>
>>> Thanks.
>>>
>>> --
>>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>
>>
>


pgsql-hackers by date:

Previous
From: Guillaume Smet
Date:
Subject: Re: SQL state in log_line_prefix
Next
From: Олег Царев
Date:
Subject: Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)