Thread: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Олег Царев
Date:
Hello all.
Please, approve my ideas for implementation.

Standart has feature T431: Extended grouping capabilities.
This feature i found in TODO-list:
http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO

MS SQL 2005 partial support this feature:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx

MS SQL 2008 support this feature:
http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx

Oracle support this feature:
http://www.compshack.com/sql/oracle-group-rollup

So, it's short notes about GROUPING SETS, but more complete
information have in a official documentation of MS SQL and Oracle
(copyright limited for send as attach).

First. GROUPG SETS.

select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
() ) - it's example of use grouping sets.
Semantic of this construction - make group by over source more, than
one group of column.
It's very wide key - A,B C. In result set of this example we can find
result set of select   select A,B,C,SUM(D) from table group by A,B,C -
as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
SETS( (A,B,C), (A), () )
Two subset - is GROUP BY A B, and instead C column we look NULL.
Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
"GRAND TOTAL". - calculate over all subset without grouping

Also have function "GROUPING"  it's function say about null - "real
null" (from table) or generated by "GROUP BY GROUPING SETS"

My point: this feature can implement over GROUP BY and UNION ALL
We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
)" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
group by();

So, it's very simple, don't require modification of executor and
callibrate cost - only parser and semantic anylysis,
'
So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
(A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
CUBE - analogue.

If this idea it's good -  i can write code base on old patch
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
from clean list (as you wish).

In future i know how to implement ROLLUP more optimal (executor
iterator) and use this ROLLUP for optimisation another GROUP BY,
GROUPING SETS.

Thanks.


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
Hello

some other info is on  http://wiki.postgresql.org/wiki/Grouping_Sets

Maybe in e few weak I'll have some a prototype based on CTE. My older
technique based on hashed tables should be well, but it carry lot of
unshared code. Using CTE means so we can optimize CTE and GROUPING
SETS together. I plan to transform query

SELECT * FROM some GROUP BY GROUPING SETS(a, b)

to

WITH q AS (SELECT * FROM some) SELECT * FROM q GROUP BY a UNION ALL SELECT * FROM q GROUP BY b


2009/5/10 Олег Царев <zabivator@gmail.com>:
> Hello all.
> Please, approve my ideas for implementation.
>
> Standart has feature T431: Extended grouping capabilities.
> This feature i found in TODO-list:
> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>
> MS SQL 2005 partial support this feature:
> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>
> MS SQL 2008 support this feature:
> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>
> Oracle support this feature:
> http://www.compshack.com/sql/oracle-group-rollup
>
> So, it's short notes about GROUPING SETS, but more complete
> information have in a official documentation of MS SQL and Oracle
> (copyright limited for send as attach).
>
> First. GROUPG SETS.
>
> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
> () ) - it's example of use grouping sets.
> Semantic of this construction - make group by over source more, than
> one group of column.
> It's very wide key - A,B C. In result set of this example we can find
> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
> SETS( (A,B,C), (A), () )
> Two subset - is GROUP BY A B, and instead C column we look NULL.
> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
> "GRAND TOTAL". - calculate over all subset without grouping
>
> Also have function "GROUPING"  it's function say about null - "real
> null" (from table) or generated by "GROUP BY GROUPING SETS"
>
> My point: this feature can implement over GROUP BY and UNION ALL
> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
> group by();
>
> So, it's very simple, don't require modification of executor and
> callibrate cost - only parser and semantic anylysis,
> '
> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
> CUBE - analogue.
>
> If this idea it's good -  i can write code base on old patch
> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
> from clean list (as you wish).

This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
where X is some joined data, then you repeat JOIN on every grouping
set. So your solution is simple for implementation, but it should be
really slow.

Regards
Pavel Stehule

>
> In future i know how to implement ROLLUP more optimal (executor
> iterator) and use this ROLLUP for optimisation another GROUP BY,
> GROUPING SETS.
>
> Thanks.
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Олег Царев
Date:
Hello, Pavel.

I read you letter, and found some points:

a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
require clone source for every group.
It's so slow.
My point: we can make for start simple implementation.
b) Your sollution it's so good, IMHO
WITH q AS (SELECT * FROM some)SELECT * FROM q GROUP BY aUNION ALLSELECT * FROM q GROUP BY b
But can you describe me how it's used on planner (calculate cost) and exector.
Sharing source - it's mind - we don't process tree, we process
DIRECTED GRAPH. Hard, many bugs, so on, not?
PostgreSQL support sharing nodes of executor? How? Where i can read?
Next. Ok, we spool source. After we use simple hash group/sort group +
union all? Or more advanced techniques?  If different group by - it's
not different from my idea, it's logic additional (spool source for
fast rewind)

c) Don't forget also about node requiments - if we use hash table as
container for source -  we reduce some sorting properties - and query
select A,B from TABLE group by ((A,B),(B)) order by a,b require
sorting on output of grouping sets.
It's mind - we need insert sort.

d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
first - sort group by use sorting for grouping data and calculate aggregations.
So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
What it's mind? It's mind we can use sort features for implement SORT
ROLLUP. For every sub group we calculate self aggregates... oh, i bad
say. Look on "select a,sum(b) from table group by a".  Sort group
grouping by a, and for every a-group calculate aggregations (sum), for
ROLLUP A,B we can make three aggregations ; for A,B than A than () -
and use one path on source calculate aggregations on every step, and
for split source on different subsets.
It's more perfomance, than different group by and union all.

Look for MS SQL:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
Why MS SQL don't support distinct aggregations? I think - because
every distinct aggregations in MS SQL require hash, and many
aggregations - it's so slow.

Thank you!
2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:
> Hello
>
> some other info is on  http://wiki.postgresql.org/wiki/Grouping_Sets
>
> Maybe in e few weak I'll have some a prototype based on CTE. My older
> technique based on hashed tables should be well, but it carry lot of
> unshared code. Using CTE means so we can optimize CTE and GROUPING
> SETS together. I plan to transform query
>
> SELECT * FROM some GROUP BY GROUPING SETS(a, b)
>
> to
>
> WITH q AS (SELECT * FROM some)
>  SELECT * FROM q GROUP BY a
>  UNION ALL
>  SELECT * FROM q GROUP BY b
>
>
> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>> Hello all.
>> Please, approve my ideas for implementation.
>>
>> Standart has feature T431: Extended grouping capabilities.
>> This feature i found in TODO-list:
>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>>
>> MS SQL 2005 partial support this feature:
>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>>
>> MS SQL 2008 support this feature:
>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>>
>> Oracle support this feature:
>> http://www.compshack.com/sql/oracle-group-rollup
>>
>> So, it's short notes about GROUPING SETS, but more complete
>> information have in a official documentation of MS SQL and Oracle
>> (copyright limited for send as attach).
>>
>> First. GROUPG SETS.
>>
>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
>> () ) - it's example of use grouping sets.
>> Semantic of this construction - make group by over source more, than
>> one group of column.
>> It's very wide key - A,B C. In result set of this example we can find
>> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
>> SETS( (A,B,C), (A), () )
>> Two subset - is GROUP BY A B, and instead C column we look NULL.
>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
>> "GRAND TOTAL". - calculate over all subset without grouping
>>
>> Also have function "GROUPING"  it's function say about null - "real
>> null" (from table) or generated by "GROUP BY GROUPING SETS"
>>
>> My point: this feature can implement over GROUP BY and UNION ALL
>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
>> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
>> group by();
>>
>> So, it's very simple, don't require modification of executor and
>> callibrate cost - only parser and semantic anylysis,
>> '
>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
>> CUBE - analogue.
>>
>> If this idea it's good -  i can write code base on old patch
>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
>> from clean list (as you wish).
>
> This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
> where X is some joined data, then you repeat JOIN on every grouping
> set. So your solution is simple for implementation, but it should be
> really slow.
>
> Regards
> Pavel Stehule
>
>>
>> In future i know how to implement ROLLUP more optimal (executor
>> iterator) and use this ROLLUP for optimisation another GROUP BY,
>> GROUPING SETS.
>>
>> Thanks.
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/5/10 Олег Царев <zabivator@gmail.com>:
> Hello, Pavel.
>
> I read you letter, and found some points:
>
> a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
> require clone source for every group.
> It's so slow.
> My point: we can make for start simple implementation.
> b) Your sollution it's so good, IMHO
> WITH q AS (SELECT * FROM some)
>  SELECT * FROM q GROUP BY a
>  UNION ALL
>  SELECT * FROM q GROUP BY b
> But can you describe me how it's used on planner (calculate cost) and exector.

look on CTE source code or ask to CTE author.

> Sharing source - it's mind - we don't process tree, we process
> DIRECTED GRAPH. Hard, many bugs, so on, not?

for example - code for DISTINCT is shared with code for aggregation.
My idea is based on similarity

GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL

> PostgreSQL support sharing nodes of executor? How? Where i can read?

I don't know, what do you thing exactly sharing nodes? Before CTE I
had to write special holder node, but it isn't necessary now.

> Next. Ok, we spool source. After we use simple hash group/sort group +
> union all? Or more advanced techniques?  If different group by - it's
> not different from my idea, it's logic additional (spool source for
> fast rewind)

I thing so trivial implementation via UNION ALL should work, but any
optimisations means lot of work - and when you add rewind
functionality, you will got trivial nonrecursive CTE implementation.
When I wrote patch, there wasn't possible an use CTE, so I played wit
own nodes. Now, the more clean solution is probably based on current
CTE base.

>
> c) Don't forget also about node requiments - if we use hash table as
> container for source -  we reduce some sorting properties - and query
> select A,B from TABLE group by ((A,B),(B)) order by a,b require
> sorting on output of grouping sets.
> It's mind - we need insert sort.

Yes, probably there should be possible some techniques - that should
to eliminate final sort op. I am not sure if it is really important.
Now I thinking what is the bigger devil a) patch complexity b) some
suboptimality  (mainly redundant call of agg nodes). For me a.
Aggregates are not gratis, but significantly more expensive is
repeated seq data scan. So it is first goal. Later we should to thing
about next optimalizations.

>
> d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
> first - sort group by use sorting for grouping data and calculate aggregations.
> So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
> What it's mind? It's mind we can use sort features for implement SORT
> ROLLUP. For every sub group we calculate self aggregates... oh, i bad
> say. Look on "select a,sum(b) from table group by a".  Sort group
> grouping by a, and for every a-group calculate aggregations (sum), for
> ROLLUP A,B we can make three aggregations ; for A,B than A than () -
> and use one path on source calculate aggregations on every step, and
> for split source on different subsets.
> It's more perfomance, than different group by and union all.

Maybe, I don't know. What I know:

* UNION ALL will have better result on indexed sets and MIN, MAX
aggregates. CTE isn't optimized now for these cases.
* UNION ALL will be slow for large sets (over cache),
* CTE needs some more optimalizations,
* pg core hackers dislike big and complex patches,
* pg core hackers like any improved current code base.

I am thinking so Grouping Sets based on CTE should be more commitable
code. It doesn't mean so your ideas are wrong, but these
optimalization should to work on CTE too.

select * from table group by rollup(a,b,c)

have to have generate same plan as

with q as (select * from table) select * from q group by a,b,c union all select * from q group by a,b union all select
*from q group by a union all select * from q; 

and CTE is more general then Grouping Sets, so it is better do
optimalization over CTE than Grouping Sets.

Regards
Pavel Stehule
>
> Look for MS SQL:
> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
> Why MS SQL don't support distinct aggregations? I think - because
> every distinct aggregations in MS SQL require hash, and many
> aggregations - it's so slow.
>
> Thank you!
> 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:
>> Hello
>>
>> some other info is on  http://wiki.postgresql.org/wiki/Grouping_Sets
>>
>> Maybe in e few weak I'll have some a prototype based on CTE. My older
>> technique based on hashed tables should be well, but it carry lot of
>> unshared code. Using CTE means so we can optimize CTE and GROUPING
>> SETS together. I plan to transform query
>>
>> SELECT * FROM some GROUP BY GROUPING SETS(a, b)
>>
>> to
>>
>> WITH q AS (SELECT * FROM some)
>>  SELECT * FROM q GROUP BY a
>>  UNION ALL
>>  SELECT * FROM q GROUP BY b
>>
>>
>> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>>> Hello all.
>>> Please, approve my ideas for implementation.
>>>
>>> Standart has feature T431: Extended grouping capabilities.
>>> This feature i found in TODO-list:
>>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>>>
>>> MS SQL 2005 partial support this feature:
>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>>>
>>> MS SQL 2008 support this feature:
>>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>>>
>>> Oracle support this feature:
>>> http://www.compshack.com/sql/oracle-group-rollup
>>>
>>> So, it's short notes about GROUPING SETS, but more complete
>>> information have in a official documentation of MS SQL and Oracle
>>> (copyright limited for send as attach).
>>>
>>> First. GROUPG SETS.
>>>
>>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
>>> () ) - it's example of use grouping sets.
>>> Semantic of this construction - make group by over source more, than
>>> one group of column.
>>> It's very wide key - A,B C. In result set of this example we can find
>>> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
>>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
>>> SETS( (A,B,C), (A), () )
>>> Two subset - is GROUP BY A B, and instead C column we look NULL.
>>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
>>> "GRAND TOTAL". - calculate over all subset without grouping
>>>
>>> Also have function "GROUPING"  it's function say about null - "real
>>> null" (from table) or generated by "GROUP BY GROUPING SETS"
>>>
>>> My point: this feature can implement over GROUP BY and UNION ALL
>>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
>>> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
>>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
>>> group by();
>>>
>>> So, it's very simple, don't require modification of executor and
>>> callibrate cost - only parser and semantic anylysis,
>>> '
>>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
>>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
>>> CUBE - analogue.
>>>
>>> If this idea it's good -  i can write code base on old patch
>>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
>>> from clean list (as you wish).
>>
>> This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
>> where X is some joined data, then you repeat JOIN on every grouping
>> set. So your solution is simple for implementation, but it should be
>> really slow.
>>
>> Regards
>> Pavel Stehule
>>
>>>
>>> In future i know how to implement ROLLUP more optimal (executor
>>> iterator) and use this ROLLUP for optimisation another GROUP BY,
>>> GROUPING SETS.
>>>
>>> Thanks.
>>>
>>> --
>>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>
>>
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Олег Царев
Date:
I will write separated mail about rollup.
Can you send me some links or information about "CTE"? What is it?
Also, I need some deep knownledge about postgresql aggregation
calculation (executor part) - can you help me (links, books, name of
source files, etc)?.
After get additional information i will can continue discussion.
Thanls
2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:
> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>> Hello, Pavel.
>>
>> I read you letter, and found some points:
>>
>> a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
>> require clone source for every group.
>> It's so slow.
>> My point: we can make for start simple implementation.
>> b) Your sollution it's so good, IMHO
>> WITH q AS (SELECT * FROM some)
>>  SELECT * FROM q GROUP BY a
>>  UNION ALL
>>  SELECT * FROM q GROUP BY b
>> But can you describe me how it's used on planner (calculate cost) and exector.
>
> look on CTE source code or ask to CTE author.
>
>> Sharing source - it's mind - we don't process tree, we process
>> DIRECTED GRAPH. Hard, many bugs, so on, not?
>
> for example - code for DISTINCT is shared with code for aggregation.
> My idea is based on similarity
>
> GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL
>
>> PostgreSQL support sharing nodes of executor? How? Where i can read?
>
> I don't know, what do you thing exactly sharing nodes? Before CTE I
> had to write special holder node, but it isn't necessary now.
>
>> Next. Ok, we spool source. After we use simple hash group/sort group +
>> union all? Or more advanced techniques?  If different group by - it's
>> not different from my idea, it's logic additional (spool source for
>> fast rewind)
>
> I thing so trivial implementation via UNION ALL should work, but any
> optimisations means lot of work - and when you add rewind
> functionality, you will got trivial nonrecursive CTE implementation.
> When I wrote patch, there wasn't possible an use CTE, so I played wit
> own nodes. Now, the more clean solution is probably based on current
> CTE base.
>
>>
>> c) Don't forget also about node requiments - if we use hash table as
>> container for source -  we reduce some sorting properties - and query
>> select A,B from TABLE group by ((A,B),(B)) order by a,b require
>> sorting on output of grouping sets.
>> It's mind - we need insert sort.
>
> Yes, probably there should be possible some techniques - that should
> to eliminate final sort op. I am not sure if it is really important.
> Now I thinking what is the bigger devil a) patch complexity b) some
> suboptimality  (mainly redundant call of agg nodes). For me a.
> Aggregates are not gratis, but significantly more expensive is
> repeated seq data scan. So it is first goal. Later we should to thing
> about next optimalizations.
>
>>
>> d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
>> first - sort group by use sorting for grouping data and calculate aggregations.
>> So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
>> What it's mind? It's mind we can use sort features for implement SORT
>> ROLLUP. For every sub group we calculate self aggregates... oh, i bad
>> say. Look on "select a,sum(b) from table group by a".  Sort group
>> grouping by a, and for every a-group calculate aggregations (sum), for
>> ROLLUP A,B we can make three aggregations ; for A,B than A than () -
>> and use one path on source calculate aggregations on every step, and
>> for split source on different subsets.
>> It's more perfomance, than different group by and union all.
>
> Maybe, I don't know. What I know:
>
> * UNION ALL will have better result on indexed sets and MIN, MAX
> aggregates. CTE isn't optimized now for these cases.
> * UNION ALL will be slow for large sets (over cache),
> * CTE needs some more optimalizations,
> * pg core hackers dislike big and complex patches,
> * pg core hackers like any improved current code base.
>
> I am thinking so Grouping Sets based on CTE should be more commitable
> code. It doesn't mean so your ideas are wrong, but these
> optimalization should to work on CTE too.
>
> select * from table group by rollup(a,b,c)
>
> have to have generate same plan as
>
> with q as (select * from table)
>  select * from q group by a,b,c
>  union all
>  select * from q group by a,b
>  union all
>  select * from q group by a
>  union all
>  select * from q;
>
> and CTE is more general then Grouping Sets, so it is better do
> optimalization over CTE than Grouping Sets.
>
> Regards
> Pavel Stehule
>>
>> Look for MS SQL:
>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>> Why MS SQL don't support distinct aggregations? I think - because
>> every distinct aggregations in MS SQL require hash, and many
>> aggregations - it's so slow.
>>
>> Thank you!
>> 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:
>>> Hello
>>>
>>> some other info is on  http://wiki.postgresql.org/wiki/Grouping_Sets
>>>
>>> Maybe in e few weak I'll have some a prototype based on CTE. My older
>>> technique based on hashed tables should be well, but it carry lot of
>>> unshared code. Using CTE means so we can optimize CTE and GROUPING
>>> SETS together. I plan to transform query
>>>
>>> SELECT * FROM some GROUP BY GROUPING SETS(a, b)
>>>
>>> to
>>>
>>> WITH q AS (SELECT * FROM some)
>>>  SELECT * FROM q GROUP BY a
>>>  UNION ALL
>>>  SELECT * FROM q GROUP BY b
>>>
>>>
>>> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>>>> Hello all.
>>>> Please, approve my ideas for implementation.
>>>>
>>>> Standart has feature T431: Extended grouping capabilities.
>>>> This feature i found in TODO-list:
>>>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>>>>
>>>> MS SQL 2005 partial support this feature:
>>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>>>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>>>>
>>>> MS SQL 2008 support this feature:
>>>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>>>>
>>>> Oracle support this feature:
>>>> http://www.compshack.com/sql/oracle-group-rollup
>>>>
>>>> So, it's short notes about GROUPING SETS, but more complete
>>>> information have in a official documentation of MS SQL and Oracle
>>>> (copyright limited for send as attach).
>>>>
>>>> First. GROUPG SETS.
>>>>
>>>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
>>>> () ) - it's example of use grouping sets.
>>>> Semantic of this construction - make group by over source more, than
>>>> one group of column.
>>>> It's very wide key - A,B C. In result set of this example we can find
>>>> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
>>>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
>>>> SETS( (A,B,C), (A), () )
>>>> Two subset - is GROUP BY A B, and instead C column we look NULL.
>>>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
>>>> "GRAND TOTAL". - calculate over all subset without grouping
>>>>
>>>> Also have function "GROUPING"  it's function say about null - "real
>>>> null" (from table) or generated by "GROUP BY GROUPING SETS"
>>>>
>>>> My point: this feature can implement over GROUP BY and UNION ALL
>>>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
>>>> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
>>>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
>>>> group by();
>>>>
>>>> So, it's very simple, don't require modification of executor and
>>>> callibrate cost - only parser and semantic anylysis,
>>>> '
>>>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
>>>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
>>>> CUBE - analogue.
>>>>
>>>> If this idea it's good -  i can write code base on old patch
>>>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
>>>> from clean list (as you wish).
>>>
>>> This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
>>> where X is some joined data, then you repeat JOIN on every grouping
>>> set. So your solution is simple for implementation, but it should be
>>> really slow.
>>>
>>> Regards
>>> Pavel Stehule
>>>
>>>>
>>>> In future i know how to implement ROLLUP more optimal (executor
>>>> iterator) and use this ROLLUP for optimisation another GROUP BY,
>>>> GROUPING SETS.
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>>
>>>
>>
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Hitoshi Harada
Date:
2009/5/11 Pavel Stehule <pavel.stehule@gmail.com>:
> I am thinking so Grouping Sets based on CTE should be more commitable
> code. It doesn't mean so your ideas are wrong, but these
> optimalization should to work on CTE too.
>
> select * from table group by rollup(a,b,c)
>
> have to have generate same plan as
>
> with q as (select * from table)
>  select * from q group by a,b,c
>  union all
>  select * from q group by a,b
>  union all
>  select * from q group by a
>  union all
>  select * from q;
>
> and CTE is more general then Grouping Sets, so it is better do
> optimalization over CTE than Grouping Sets.

If you need to buffer tuples from the outer plan and to rescan it
multiple times, tuplestore seems more appropriate solution than using
CTE node, from semantic point of view. During CTE and window functions
development, tuplestore now has that kind of capability and CTE node
is only a wrapper of tuplestore.

Moreover, I guess you don't even need to buffer tuples to aggregate by
different keys. What you have to do is only to prepare more than one
hash tables (, or set up sort order if the plan detects hash table is
too large to fit in the memory), and one time seq scan will do. The
trans values are only to be stored in the memory, not the outer plan's
results. It will win greately in performance.


Regards,

--
Hitoshi Harada


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/5/12 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2009/5/11 Pavel Stehule <pavel.stehule@gmail.com>:
>> I am thinking so Grouping Sets based on CTE should be more commitable
>> code. It doesn't mean so your ideas are wrong, but these
>> optimalization should to work on CTE too.
>>
>> select * from table group by rollup(a,b,c)
>>
>> have to have generate same plan as
>>
>> with q as (select * from table)
>>  select * from q group by a,b,c
>>  union all
>>  select * from q group by a,b
>>  union all
>>  select * from q group by a
>>  union all
>>  select * from q;
>>
>> and CTE is more general then Grouping Sets, so it is better do
>> optimalization over CTE than Grouping Sets.
>
> If you need to buffer tuples from the outer plan and to rescan it
> multiple times, tuplestore seems more appropriate solution than using
> CTE node, from semantic point of view. During CTE and window functions
> development, tuplestore now has that kind of capability and CTE node
> is only a wrapper of tuplestore.
>
> Moreover, I guess you don't even need to buffer tuples to aggregate by
> different keys. What you have to do is only to prepare more than one
> hash tables (, or set up sort order if the plan detects hash table is
> too large to fit in the memory), and one time seq scan will do. The
> trans values are only to be stored in the memory, not the outer plan's
> results. It will win greately in performance.

it was my first solution. But I would to prepare one non hash method.
But now I thinking about some special executor node, that fill all
necessary hash parallel. It's special variant of hash agreggate.

regards
Pavel Stehule

>
>
> Regards,
>
> --
> Hitoshi Harada
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
Hello Oleg

I am sending a new CTE based variant of my GROUPING SETS patch,

this patch has some bugs but it is good prototype (it's more stable
than old patch):

postgres=# select selling_date, baguette, sum(items) from
baguette_selling group by grouping sets(1,2);
 selling_date | baguette | sum
--------------+----------+-----
 2007-10-30   |          |  17
 2007-10-31   |          |  12
              | golf     |   9
              | buster   |  20
(4 rows)

postgres=# select selling_date, baguette, sum(items),
grouping(selling_date), grouping(baguette), grouping_id(selling_date,
baguette) from baguette_selling group by grouping sets(1,2);
 selling_date | baguette | sum | grouping | grouping | grouping_id
--------------+----------+-----+----------+----------+-------------
 2007-10-30   |          |  17 |        1 |        0 |           2
 2007-10-31   |          |  12 |        1 |        0 |           2
              | golf     |   9 |        0 |        1 |           1
              | buster   |  20 |        0 |        1 |           1
(4 rows)

postgres=# select selling_date, baguette, sum(items),
grouping(selling_date), grouping(baguette), grouping_id(selling_date,
baguette) from baguette_selling group by grouping sets(1,2,());
 selling_date | baguette | sum | grouping | grouping | grouping_id
--------------+----------+-----+----------+----------+-------------
 2007-10-30   |          |  17 |        1 |        0 |           2
 2007-10-31   |          |  12 |        1 |        0 |           2
              | golf     |   9 |        0 |        1 |           1
              | buster   |  20 |        0 |        1 |           1
              |          |  29 |        0 |        0 |           0
(5 rows)

I thing so parser part is well and correct (and ported to 8.4).

CTE works well, but not 100% effective, and will be better to use
direct tuplestore interface (as second technique - when hash tables
can't to be used).

I am thinking, so the best solution is enhancing current Aggregate
node for support of GroupingSets. The code base on UNION ALL is +/-
equal to CTE, and I don't thing, so this should be optimal. But work
freely, please. I have not free time for this patch next two months.
So if you have time, it's your.

regards
Pavel Stehule



2009/5/10 Олег Царев <zabivator@gmail.com>:
> Hello all.
> Please, approve my ideas for implementation.
>
> Standart has feature T431: Extended grouping capabilities.
> This feature i found in TODO-list:
> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>
> MS SQL 2005 partial support this feature:
> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>
> MS SQL 2008 support this feature:
> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>
> Oracle support this feature:
> http://www.compshack.com/sql/oracle-group-rollup
>
> So, it's short notes about GROUPING SETS, but more complete
> information have in a official documentation of MS SQL and Oracle
> (copyright limited for send as attach).
>
> First. GROUPG SETS.
>
> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
> () ) - it's example of use grouping sets.
> Semantic of this construction - make group by over source more, than
> one group of column.
> It's very wide key - A,B C. In result set of this example we can find
> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
> SETS( (A,B,C), (A), () )
> Two subset - is GROUP BY A B, and instead C column we look NULL.
> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
> "GRAND TOTAL". - calculate over all subset without grouping
>
> Also have function "GROUPING"  it's function say about null - "real
> null" (from table) or generated by "GROUP BY GROUPING SETS"
>
> My point: this feature can implement over GROUP BY and UNION ALL
> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
> group by();
>
> So, it's very simple, don't require modification of executor and
> callibrate cost - only parser and semantic anylysis,
> '
> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
> CUBE - analogue.
>
> If this idea it's good -  i can write code base on old patch
> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
> from clean list (as you wish).
>
> In future i know how to implement ROLLUP more optimal (executor
> iterator) and use this ROLLUP for optimisation another GROUP BY,
> GROUPING SETS.
>
> Thanks.
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Attachment

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Олег Царев
Date:
Hello all!
If no one objecte (all agree, in other say) i continue work on patch -
particulary, i want support second strategy (tuple store instead of
hash-table) for save order of source (more cheap solution in case with
grouping sets + order by), investigate and brainstorm another
optimisation, writing regression tests and technical documentation.
But I need some time for complete my investigation internals of
PostgreSQL, particulary CTE.

Thanks.

2009/5/13 Pavel Stehule <pavel.stehule@gmail.com>:
> Hello Oleg
>
> I am sending a new CTE based variant of my GROUPING SETS patch,
>
> this patch has some bugs but it is good prototype (it's more stable
> than old patch):
>
> postgres=# select selling_date, baguette, sum(items) from
> baguette_selling group by grouping sets(1,2);
>  selling_date | baguette | sum
> --------------+----------+-----
>  2007-10-30   |          |  17
>  2007-10-31   |          |  12
>              | golf     |   9
>              | buster   |  20
> (4 rows)
>
> postgres=# select selling_date, baguette, sum(items),
> grouping(selling_date), grouping(baguette), grouping_id(selling_date,
> baguette) from baguette_selling group by grouping sets(1,2);
>  selling_date | baguette | sum | grouping | grouping | grouping_id
> --------------+----------+-----+----------+----------+-------------
>  2007-10-30   |          |  17 |        1 |        0 |           2
>  2007-10-31   |          |  12 |        1 |        0 |           2
>              | golf     |   9 |        0 |        1 |           1
>              | buster   |  20 |        0 |        1 |           1
> (4 rows)
>
> postgres=# select selling_date, baguette, sum(items),
> grouping(selling_date), grouping(baguette), grouping_id(selling_date,
> baguette) from baguette_selling group by grouping sets(1,2,());
>  selling_date | baguette | sum | grouping | grouping | grouping_id
> --------------+----------+-----+----------+----------+-------------
>  2007-10-30   |          |  17 |        1 |        0 |           2
>  2007-10-31   |          |  12 |        1 |        0 |           2
>              | golf     |   9 |        0 |        1 |           1
>              | buster   |  20 |        0 |        1 |           1
>              |          |  29 |        0 |        0 |           0
> (5 rows)
>
> I thing so parser part is well and correct (and ported to 8.4).
>
> CTE works well, but not 100% effective, and will be better to use
> direct tuplestore interface (as second technique - when hash tables
> can't to be used).
>
> I am thinking, so the best solution is enhancing current Aggregate
> node for support of GroupingSets. The code base on UNION ALL is +/-
> equal to CTE, and I don't thing, so this should be optimal. But work
> freely, please. I have not free time for this patch next two months.
> So if you have time, it's your.
>
> regards
> Pavel Stehule
>
>
>
> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>> Hello all.
>> Please, approve my ideas for implementation.
>>
>> Standart has feature T431: Extended grouping capabilities.
>> This feature i found in TODO-list:
>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>>
>> MS SQL 2005 partial support this feature:
>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>>
>> MS SQL 2008 support this feature:
>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>>
>> Oracle support this feature:
>> http://www.compshack.com/sql/oracle-group-rollup
>>
>> So, it's short notes about GROUPING SETS, but more complete
>> information have in a official documentation of MS SQL and Oracle
>> (copyright limited for send as attach).
>>
>> First. GROUPG SETS.
>>
>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
>> () ) - it's example of use grouping sets.
>> Semantic of this construction - make group by over source more, than
>> one group of column.
>> It's very wide key - A,B C. In result set of this example we can find
>> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
>> SETS( (A,B,C), (A), () )
>> Two subset - is GROUP BY A B, and instead C column we look NULL.
>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
>> "GRAND TOTAL". - calculate over all subset without grouping
>>
>> Also have function "GROUPING"  it's function say about null - "real
>> null" (from table) or generated by "GROUP BY GROUPING SETS"
>>
>> My point: this feature can implement over GROUP BY and UNION ALL
>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
>> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
>> group by();
>>
>> So, it's very simple, don't require modification of executor and
>> callibrate cost - only parser and semantic anylysis,
>> '
>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
>> CUBE - analogue.
>>
>> If this idea it's good -  i can write code base on old patch
>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
>> from clean list (as you wish).
>>
>> In future i know how to implement ROLLUP more optimal (executor
>> iterator) and use this ROLLUP for optimisation another GROUP BY,
>> GROUPING SETS.
>>
>> Thanks.
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Joshua Tolley
Date:
On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:
> this patch has some bugs but it is good prototype (it's more stable
> than old patch):

I'm not sure if you're at the point that you're interested in bug reports, but
here's something that didn't behave as expected:

5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
quantity integer);
CREATE TABLE
5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
100);
INSERT 0 100
5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
cube (prod_id, cust_id) order by 1, 2;prod_id | cust_id | sum
---------+---------+-----      5 |       7 |   4      8 |      16 |   3      9 |      19 |   8      4 |      13 |   3
  8 |       8 |  15      5 |       2 |   4      7 |       6 |   7      6 |       6 |   3 
</snip>

Note that the results aren't sorted. The following, though, works around it:

5432 josh@josh*# select * from (select prod_id, cust_id, sum(quantity) from
gsettest group by cube (prod_id, cust_id)) f order by 1, 2;prod_id | cust_id | sum
---------+---------+-----      0 |       2 |   8      0 |       4 |   8      0 |       5 |   2      0 |       7 |  11
  0 |       8 |   7      0 |       9 |   1      0 |      12 |   3      0 |      14 |   7      0 |      16 |   5      0
|     17 |   8      0 |      18 |   9      0 |      19 |   2      0 |         |  71 
</snip>

EXPLAIN output is as follows:
5432 josh@josh*# explain select prod_id, cust_id, sum(quantity) from gsettest
group by cube (prod_id, cust_id) order by 1, 2;                               QUERY PLAN
---------------------------------------------------------------------------Append  (cost=193.54..347.71 rows=601
width=9) CTE **g**    ->  Sort  (cost=135.34..140.19 rows=1940 width=12)          Sort Key: gsettest.prod_id,
gsettest.cust_id         ->  Seq Scan on gsettest  (cost=0.00..29.40 rows=1940 width=12)  ->  HashAggregate
(cost=53.35..55.85rows=200 width=12)        ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=12)  ->
HashAggregate (cost=48.50..51.00 rows=200 width=8)        ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=8)
->  HashAggregate  (cost=48.50..51.00 rows=200 width=8)        ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940
width=8) ->  Aggregate  (cost=43.65..43.66 rows=1 width=4)        ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940
width=4)
(13 rows)

...and without the ORDER BY clause just to prove that it really is the reason
for the Sort step...

5432 josh@josh*# explain select prod_id, cust_id, sum(quantity) from gsettest
group by cube (prod_id, cust_id);                              QUERY PLAN
------------------------------------------------------------------------Append  (cost=82.75..236.92 rows=601 width=9)
CTE**g**    ->  Seq Scan on gsettest  (cost=0.00..29.40 rows=1940 width=12)  ->  HashAggregate  (cost=53.35..55.85
rows=200width=12)        ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=12)  ->  HashAggregate
(cost=48.50..51.00rows=200 width=8)        ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=8)  ->
HashAggregate (cost=48.50..51.00 rows=200 width=8)        ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=8)
->  Aggregate  (cost=43.65..43.66 rows=1 width=4)        ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=4) 
(11 rows)

I'm hoping I'll get a chance to poke at the patch some. This could be very
useful...

- Josh / eggyknap

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/5/13 Joshua Tolley <eggyknap@gmail.com>:
> On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:
>> this patch has some bugs but it is good prototype (it's more stable
>> than old patch):
>
> I'm not sure if you're at the point that you're interested in bug reports, but
> here's something that didn't behave as expected:
>
> 5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
> quantity integer);
> CREATE TABLE
> 5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
> floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
> 100);
> INSERT 0 100
> 5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
> cube (prod_id, cust_id) order by 1, 2;
>  prod_id | cust_id | sum
> ---------+---------+-----
>       5 |       7 |   4
>       8 |      16 |   3
>       9 |      19 |   8
>       4 |      13 |   3
>       8 |       8 |  15
>       5 |       2 |   4
>       7 |       6 |   7
>       6 |       6 |   3
> </snip>
>
> Note that the results aren't sorted. The following, though, works around it:

I thing, so result should not be sorted - it's same like normal group by.

regards
Pavel Stehule



>
> 5432 josh@josh*# select * from (select prod_id, cust_id, sum(quantity) from
> gsettest group by cube (prod_id, cust_id)) f order by 1, 2;
>  prod_id | cust_id | sum
> ---------+---------+-----
>       0 |       2 |   8
>       0 |       4 |   8
>       0 |       5 |   2
>       0 |       7 |  11
>       0 |       8 |   7
>       0 |       9 |   1
>       0 |      12 |   3
>       0 |      14 |   7
>       0 |      16 |   5
>       0 |      17 |   8
>       0 |      18 |   9
>       0 |      19 |   2
>       0 |         |  71
> </snip>
>
> EXPLAIN output is as follows:
> 5432 josh@josh*# explain select prod_id, cust_id, sum(quantity) from gsettest
> group by cube (prod_id, cust_id) order by 1, 2;
>                                QUERY PLAN
> ---------------------------------------------------------------------------
>  Append  (cost=193.54..347.71 rows=601 width=9)
>   CTE **g**
>     ->  Sort  (cost=135.34..140.19 rows=1940 width=12)
>           Sort Key: gsettest.prod_id, gsettest.cust_id
>           ->  Seq Scan on gsettest  (cost=0.00..29.40 rows=1940 width=12)
>   ->  HashAggregate  (cost=53.35..55.85 rows=200 width=12)
>         ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=12)
>   ->  HashAggregate  (cost=48.50..51.00 rows=200 width=8)
>         ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=8)
>   ->  HashAggregate  (cost=48.50..51.00 rows=200 width=8)
>         ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=8)
>   ->  Aggregate  (cost=43.65..43.66 rows=1 width=4)
>         ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=4)
> (13 rows)
>
> ...and without the ORDER BY clause just to prove that it really is the reason
> for the Sort step...
>
> 5432 josh@josh*# explain select prod_id, cust_id, sum(quantity) from gsettest
> group by cube (prod_id, cust_id);
>                               QUERY PLAN
> ------------------------------------------------------------------------
>  Append  (cost=82.75..236.92 rows=601 width=9)
>   CTE **g**
>     ->  Seq Scan on gsettest  (cost=0.00..29.40 rows=1940 width=12)
>   ->  HashAggregate  (cost=53.35..55.85 rows=200 width=12)
>         ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=12)
>   ->  HashAggregate  (cost=48.50..51.00 rows=200 width=8)
>         ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=8)
>   ->  HashAggregate  (cost=48.50..51.00 rows=200 width=8)
>         ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=8)
>   ->  Aggregate  (cost=43.65..43.66 rows=1 width=4)
>         ->  CTE Scan on "**g**"  (cost=0.00..38.80 rows=1940 width=4)
> (11 rows)
>
> I'm hoping I'll get a chance to poke at the patch some. This could be very
> useful...
>
> - Josh / eggyknap
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkoKOdUACgkQRiRfCGf1UMOpFQCeJGQftMheSi6blMwheK4HI89p
> E7cAnjdWi4FaerR/+RTBeSv9Zc0RRXQ3
> =xW04
> -----END PGP SIGNATURE-----
>
>


On Tue, May 12, 2009 at 2:21 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> Moreover, I guess you don't even need to buffer tuples to aggregate by
>> different keys. What you have to do is only to prepare more than one
>> hash tables (, or set up sort order if the plan detects hash table is
>> too large to fit in the memory), and one time seq scan will do. The
>> trans values are only to be stored in the memory, not the outer plan's
>> results. It will win greately in performance.
>
> it was my first solution. But I would to prepare one non hash method.
> But now I thinking about some special executor node, that fill all
> necessary hash parallel. It's special variant of hash agreggate.

I think HashAggregate will often be the fastest method of executing
this kind of operation, but it would be nice to have an alternative
(such as repeatedly sorting a tuplestore) to handle non-hashable
datatypes or cases where the HashAggregate would eat too much memory.

But that leads me to a question - does the existing HashAggregate code
make any attempt to obey work_mem?  I know that the infrastructure is
present for HashJoin/Hash, but on a quick pass I didn't notice
anything similar in HashAggregate.

And on a slightly off-topic note for this thread, is there any
compelling reason why we have at least three different hash
implementations in the executor?  HashJoin/Hash uses one for regular
batches and one for the skew batch, and I believe that HashAggregate
does something else entirely.  It seems like it might improve code
maintainability, if nothing else, to unify these to the extent
possible.

...Robert


Robert Haas <robertmhaas@gmail.com> writes:
> But that leads me to a question - does the existing HashAggregate code
> make any attempt to obey work_mem?

No.
        regards, tom lane


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Joshua Tolley
Date:
On Wed, May 13, 2009 at 06:29:41AM +0200, Pavel Stehule wrote:
> 2009/5/13 Joshua Tolley <eggyknap@gmail.com>:
> > On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:
> >> this patch has some bugs but it is good prototype (it's more stable
> >> than old patch):
> >
> > I'm not sure if you're at the point that you're interested in bug reports, but
> > here's something that didn't behave as expected:
> >
> > 5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
> > quantity integer);
> > CREATE TABLE
> > 5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
> > floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
> > 100);
> > INSERT 0 100
> > 5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
> > cube (prod_id, cust_id) order by 1, 2;
> >  prod_id | cust_id | sum
> > ---------+---------+-----
> >       5 |       7 |   4
> >       8 |      16 |   3
> >       9 |      19 |   8
> >       4 |      13 |   3
> >       8 |       8 |  15
> >       5 |       2 |   4
> >       7 |       6 |   7
> >       6 |       6 |   3
> > </snip>
> >
> > Note that the results aren't sorted. The following, though, works around it:
>
> I thing, so result should not be sorted - it's same like normal group by.

Normal GROUP BY wouldn't have ignored the ORDER BY clause I included.

- Josh

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/5/13 Joshua Tolley <eggyknap@gmail.com>:
> On Wed, May 13, 2009 at 06:29:41AM +0200, Pavel Stehule wrote:
>> 2009/5/13 Joshua Tolley <eggyknap@gmail.com>:
>> > On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:
>> >> this patch has some bugs but it is good prototype (it's more stable
>> >> than old patch):
>> >
>> > I'm not sure if you're at the point that you're interested in bug reports, but
>> > here's something that didn't behave as expected:
>> >
>> > 5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
>> > quantity integer);
>> > CREATE TABLE
>> > 5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
>> > floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
>> > 100);
>> > INSERT 0 100
>> > 5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
>> > cube (prod_id, cust_id) order by 1, 2;
>> >  prod_id | cust_id | sum
>> > ---------+---------+-----
>> >       5 |       7 |   4
>> >       8 |      16 |   3
>> >       9 |      19 |   8
>> >       4 |      13 |   3
>> >       8 |       8 |  15
>> >       5 |       2 |   4
>> >       7 |       6 |   7
>> >       6 |       6 |   3
>> > </snip>
>> >
>> > Note that the results aren't sorted. The following, though, works around it:
>>
>> I thing, so result should not be sorted - it's same like normal group by.
>
> Normal GROUP BY wouldn't have ignored the ORDER BY clause I included.
>

sorry, now I understand - simply it is a bug. I fixed it

Thank You
Pavel

> - Josh
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkoKxLQACgkQRiRfCGf1UMOj/wCgkPnRiheRr+BNPLBCjzA9XlFW
> 0rsAoI0eOGSGlxIv0eNB8zqum7kw/Cqw
> =FCTz
> -----END PGP SIGNATURE-----
>
>


On Wed, May 13, 2009 at 03:12:51PM +0200, Pavel Stehule wrote:
> 2009/5/13 Joshua Tolley <eggyknap@gmail.com>:
> > On Wed, May 13, 2009 at 06:29:41AM +0200, Pavel Stehule wrote:
> >> 2009/5/13 Joshua Tolley <eggyknap@gmail.com>:
> >> > On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:
> >> >> this patch has some bugs but it is good prototype (it's more stable
> >> >> than old patch):
> >> >
> >> > I'm not sure if you're at the point that you're interested in bug reports, but
> >> > here's something that didn't behave as expected:
> >> >
> >> > 5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
> >> > quantity integer);
> >> > CREATE TABLE
> >> > 5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
> >> > floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
> >> > 100);
> >> > INSERT 0 100
> >> > 5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
> >> > cube (prod_id, cust_id) order by 1, 2;
> >> >  prod_id | cust_id | sum
> >> > ---------+---------+-----
> >> >       5 |       7 |   4
> >> >       8 |      16 |   3
> >> >       9 |      19 |   8
> >> >       4 |      13 |   3
> >> >       8 |       8 |  15
> >> >       5 |       2 |   4
> >> >       7 |       6 |   7
> >> >       6 |       6 |   3
> >> > </snip>
> >> >
> >> > Note that the results aren't sorted. The following, though, works around it:
> >>
> >> I thing, so result should not be sorted - it's same like normal group by.
> >
> > Normal GROUP BY wouldn't have ignored the ORDER BY clause I included.
> 
> sorry, now I understand - simply it is a bug. I fixed it

Where's the new patch?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
Here is.

I checked result with Oracle and basic results are same with one exception

this patch doesn't well do with expr specified sets

this result is correct

postgres=# select selling_date, baguette, canteen, sum(items),
grouping(baguette), grouping(selling_date),
grouping_id(baguette,selling_date) from baguette_selling group by
grouping sets(baguette, selling_date, canteen,());
 selling_date | baguette | canteen | sum | grouping | grouping | grouping_id
--------------+----------+---------+-----+----------+----------+-------------
              | golf     |         |   9 |        0 |        1 |           1
              | buster   |         |  20 |        0 |        1 |           1
 2007-10-30   |          |         |  17 |        1 |        0 |           2
 2007-10-31   |          |         |  12 |        1 |        0 |           2
              |          | Prague  |  14 |        1 |        1 |           3
              |          | Berlin  |  15 |        1 |        1 |           3
              |          |         |  29 |        1 |        1 |           3
(7 rows)

but this result not:

postgres=# select extract(day from selling_date), selling_date,
baguette, canteen, sum(items), grouping(baguette),
grouping(selling_date), grouping_id(baguette,selling_date) from
baguette_selling group by grouping sets(baguette, selling_date,
canteen, extract(day from selling_date))
;
 date_part | selling_date | baguette | canteen | sum | grouping |
grouping | grouping_id
-----------+--------------+----------+---------+-----+----------+----------+-------------
           |              | golf     |         |   9 |        0 |
  1 |           1
           |              | buster   |         |  20 |        0 |
  1 |           1
        30 | 2007-10-30   |          |         |  17 |        1 |
  0 |           2
        31 | 2007-10-31   |          |         |  12 |        1 |
  0 |           2
           |              |          | Prague  |  14 |        1 |
  1 |           3
           |              |          | Berlin  |  15 |        1 |
  1 |           3
           |              |          |         |  29 |        1 |
  1 |           3
(7 rows)

date_part column is problematic.

regards
Pavel Stehule




2009/5/13 David Fetter <david@fetter.org>:
> On Wed, May 13, 2009 at 03:12:51PM +0200, Pavel Stehule wrote:
>> 2009/5/13 Joshua Tolley <eggyknap@gmail.com>:
>> > On Wed, May 13, 2009 at 06:29:41AM +0200, Pavel Stehule wrote:
>> >> 2009/5/13 Joshua Tolley <eggyknap@gmail.com>:
>> >> > On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:
>> >> >> this patch has some bugs but it is good prototype (it's more stable
>> >> >> than old patch):
>> >> >
>> >> > I'm not sure if you're at the point that you're interested in bug reports, but
>> >> > here's something that didn't behave as expected:
>> >> >
>> >> > 5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
>> >> > quantity integer);
>> >> > CREATE TABLE
>> >> > 5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
>> >> > floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
>> >> > 100);
>> >> > INSERT 0 100
>> >> > 5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
>> >> > cube (prod_id, cust_id) order by 1, 2;
>> >> >  prod_id | cust_id | sum
>> >> > ---------+---------+-----
>> >> >       5 |       7 |   4
>> >> >       8 |      16 |   3
>> >> >       9 |      19 |   8
>> >> >       4 |      13 |   3
>> >> >       8 |       8 |  15
>> >> >       5 |       2 |   4
>> >> >       7 |       6 |   7
>> >> >       6 |       6 |   3
>> >> > </snip>
>> >> >
>> >> > Note that the results aren't sorted. The following, though, works around it:
>> >>
>> >> I thing, so result should not be sorted - it's same like normal group by.
>> >
>> > Normal GROUP BY wouldn't have ignored the ORDER BY clause I included.
>>
>> sorry, now I understand - simply it is a bug. I fixed it
>
> Where's the new patch?
>
> Cheers,
> David.
> --
> David Fetter <david@fetter.org> http://fetter.org/
> Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
> Skype: davidfetter      XMPP: david.fetter@gmail.com
>
> Remember to vote!
> Consider donating to Postgres: http://www.postgresql.org/about/donate
>

Attachment

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
Hello,

here is last my grouping sets patch - next work will be done by Oleg.

This patch has full functionality of grouping sets - with support for
grouping via expr.

postgres=# select  extract(year from a),a,b, sum(c) from foo group by
grouping sets(extract(year from a), a, b,()); date_part |     a      |
b  | sum
-----------+------------+----+-----
     2001 |            |    |  40
          | 2001-10-12 |    |  20
          | 2001-10-11 |    |  20
          |            | 10 |  40
          |            |    |  40
(5 rows)

postgres=# select * from foo;
     a      | b  | c
------------+----+----
 2001-10-11 | 10 | 20
 2001-10-12 | 10 | 20
(2 rows)

This patch is WIP - has full functionality, but is based on CTE - what
is mas/menos ugly hack. But for testing and first view about grouping
sets, it should do good work.

regards
Pavel Stehule



2009/5/10 Олег Царев <zabivator@gmail.com>:
> I will write separated mail about rollup.
> Can you send me some links or information about "CTE"? What is it?
> Also, I need some deep knownledge about postgresql aggregation
> calculation (executor part) - can you help me (links, books, name of
> source files, etc)?.
> After get additional information i will can continue discussion.
> Thanls
> 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:
>> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>>> Hello, Pavel.
>>>
>>> I read you letter, and found some points:
>>>
>>> a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
>>> require clone source for every group.
>>> It's so slow.
>>> My point: we can make for start simple implementation.
>>> b) Your sollution it's so good, IMHO
>>> WITH q AS (SELECT * FROM some)
>>>  SELECT * FROM q GROUP BY a
>>>  UNION ALL
>>>  SELECT * FROM q GROUP BY b
>>> But can you describe me how it's used on planner (calculate cost) and exector.
>>
>> look on CTE source code or ask to CTE author.
>>
>>> Sharing source - it's mind - we don't process tree, we process
>>> DIRECTED GRAPH. Hard, many bugs, so on, not?
>>
>> for example - code for DISTINCT is shared with code for aggregation.
>> My idea is based on similarity
>>
>> GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL
>>
>>> PostgreSQL support sharing nodes of executor? How? Where i can read?
>>
>> I don't know, what do you thing exactly sharing nodes? Before CTE I
>> had to write special holder node, but it isn't necessary now.
>>
>>> Next. Ok, we spool source. After we use simple hash group/sort group +
>>> union all? Or more advanced techniques?  If different group by - it's
>>> not different from my idea, it's logic additional (spool source for
>>> fast rewind)
>>
>> I thing so trivial implementation via UNION ALL should work, but any
>> optimisations means lot of work - and when you add rewind
>> functionality, you will got trivial nonrecursive CTE implementation.
>> When I wrote patch, there wasn't possible an use CTE, so I played wit
>> own nodes. Now, the more clean solution is probably based on current
>> CTE base.
>>
>>>
>>> c) Don't forget also about node requiments - if we use hash table as
>>> container for source -  we reduce some sorting properties - and query
>>> select A,B from TABLE group by ((A,B),(B)) order by a,b require
>>> sorting on output of grouping sets.
>>> It's mind - we need insert sort.
>>
>> Yes, probably there should be possible some techniques - that should
>> to eliminate final sort op. I am not sure if it is really important.
>> Now I thinking what is the bigger devil a) patch complexity b) some
>> suboptimality  (mainly redundant call of agg nodes). For me a.
>> Aggregates are not gratis, but significantly more expensive is
>> repeated seq data scan. So it is first goal. Later we should to thing
>> about next optimalizations.
>>
>>>
>>> d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
>>> first - sort group by use sorting for grouping data and calculate aggregations.
>>> So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
>>> What it's mind? It's mind we can use sort features for implement SORT
>>> ROLLUP. For every sub group we calculate self aggregates... oh, i bad
>>> say. Look on "select a,sum(b) from table group by a".  Sort group
>>> grouping by a, and for every a-group calculate aggregations (sum), for
>>> ROLLUP A,B we can make three aggregations ; for A,B than A than () -
>>> and use one path on source calculate aggregations on every step, and
>>> for split source on different subsets.
>>> It's more perfomance, than different group by and union all.
>>
>> Maybe, I don't know. What I know:
>>
>> * UNION ALL will have better result on indexed sets and MIN, MAX
>> aggregates. CTE isn't optimized now for these cases.
>> * UNION ALL will be slow for large sets (over cache),
>> * CTE needs some more optimalizations,
>> * pg core hackers dislike big and complex patches,
>> * pg core hackers like any improved current code base.
>>
>> I am thinking so Grouping Sets based on CTE should be more commitable
>> code. It doesn't mean so your ideas are wrong, but these
>> optimalization should to work on CTE too.
>>
>> select * from table group by rollup(a,b,c)
>>
>> have to have generate same plan as
>>
>> with q as (select * from table)
>>  select * from q group by a,b,c
>>  union all
>>  select * from q group by a,b
>>  union all
>>  select * from q group by a
>>  union all
>>  select * from q;
>>
>> and CTE is more general then Grouping Sets, so it is better do
>> optimalization over CTE than Grouping Sets.
>>
>> Regards
>> Pavel Stehule
>>>
>>> Look for MS SQL:
>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>>> Why MS SQL don't support distinct aggregations? I think - because
>>> every distinct aggregations in MS SQL require hash, and many
>>> aggregations - it's so slow.
>>>
>>> Thank you!
>>> 2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:
>>>> Hello
>>>>
>>>> some other info is on  http://wiki.postgresql.org/wiki/Grouping_Sets
>>>>
>>>> Maybe in e few weak I'll have some a prototype based on CTE. My older
>>>> technique based on hashed tables should be well, but it carry lot of
>>>> unshared code. Using CTE means so we can optimize CTE and GROUPING
>>>> SETS together. I plan to transform query
>>>>
>>>> SELECT * FROM some GROUP BY GROUPING SETS(a, b)
>>>>
>>>> to
>>>>
>>>> WITH q AS (SELECT * FROM some)
>>>>  SELECT * FROM q GROUP BY a
>>>>  UNION ALL
>>>>  SELECT * FROM q GROUP BY b
>>>>
>>>>
>>>> 2009/5/10 Олег Царев <zabivator@gmail.com>:
>>>>> Hello all.
>>>>> Please, approve my ideas for implementation.
>>>>>
>>>>> Standart has feature T431: Extended grouping capabilities.
>>>>> This feature i found in TODO-list:
>>>>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>>>>>
>>>>> MS SQL 2005 partial support this feature:
>>>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>>>>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>>>>>
>>>>> MS SQL 2008 support this feature:
>>>>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>>>>>
>>>>> Oracle support this feature:
>>>>> http://www.compshack.com/sql/oracle-group-rollup
>>>>>
>>>>> So, it's short notes about GROUPING SETS, but more complete
>>>>> information have in a official documentation of MS SQL and Oracle
>>>>> (copyright limited for send as attach).
>>>>>
>>>>> First. GROUPG SETS.
>>>>>
>>>>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
>>>>> () ) - it's example of use grouping sets.
>>>>> Semantic of this construction - make group by over source more, than
>>>>> one group of column.
>>>>> It's very wide key - A,B C. In result set of this example we can find
>>>>> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
>>>>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
>>>>> SETS( (A,B,C), (A), () )
>>>>> Two subset - is GROUP BY A B, and instead C column we look NULL.
>>>>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
>>>>> "GRAND TOTAL". - calculate over all subset without grouping
>>>>>
>>>>> Also have function "GROUPING"  it's function say about null - "real
>>>>> null" (from table) or generated by "GROUP BY GROUPING SETS"
>>>>>
>>>>> My point: this feature can implement over GROUP BY and UNION ALL
>>>>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
>>>>> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
>>>>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
>>>>> group by();
>>>>>
>>>>> So, it's very simple, don't require modification of executor and
>>>>> callibrate cost - only parser and semantic anylysis,
>>>>> '
>>>>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
>>>>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
>>>>> CUBE - analogue.
>>>>>
>>>>> If this idea it's good -  i can write code base on old patch
>>>>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
>>>>> from clean list (as you wish).
>>>>
>>>> This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
>>>> where X is some joined data, then you repeat JOIN on every grouping
>>>> set. So your solution is simple for implementation, but it should be
>>>> really slow.
>>>>
>>>> Regards
>>>> Pavel Stehule
>>>>
>>>>>
>>>>> In future i know how to implement ROLLUP more optimal (executor
>>>>> iterator) and use this ROLLUP for optimisation another GROUP BY,
>>>>> GROUPING SETS.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>>>>> To make changes to your subscription:
>>>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>>>
>>>>
>>>
>>
>

Attachment

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Alvaro Herrera
Date:
Олег Царев escribió:
> Hello all!
> If no one objecte (all agree, in other say) i continue work on patch -
> particulary, i want support second strategy (tuple store instead of
> hash-table) for save order of source (more cheap solution in case with
> grouping sets + order by), investigate and brainstorm another
> optimisation, writing regression tests and technical documentation.
> But I need some time for complete my investigation internals of
> PostgreSQL, particulary CTE.

Where are we on this patch?  Is it moving forward?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Hitoshi Harada
Date:
2009/8/8 Alvaro Herrera <alvherre@commandprompt.com>:
> Олег Царев escribió:
>> Hello all!
>> If no one objecte (all agree, in other say) i continue work on patch -
>> particulary, i want support second strategy (tuple store instead of
>> hash-table) for save order of source (more cheap solution in case with
>> grouping sets + order by), investigate and brainstorm another
>> optimisation, writing regression tests and technical documentation.
>> But I need some time for complete my investigation internals of
>> PostgreSQL, particulary CTE.
>
> Where are we on this patch?  Is it moving forward?
>

It seems to me that the patch goes backward.

I looked trough the gsets-0.6.diff for about an hour, and found it is
now only a syntax sugar that builds multiple GROUP BY queries based on
CTE functionality. There's no executor modification.

If I remember correctly, the original patch touched executor parts.
I'd buy if the GROUPING SETS touches executor but I don't if this is
only syntax sugar, because you can write it as the same by yourself
without GROUPING SETS syntax. The motivation we push this forward is
performance that cannot be made by rewriting query, I guess.

Because GROUP BY we have today is a subset of GROUPING SETS by
definition, I suppose we'll refactor nodeAgg.c so that it is allowed
to take multiple group definitions. And we must support both of
HashAgg and GroupAgg. For HashAgg, it is easier in any case as the
earlier patch does. For GroupAgg, it is a bit complicated since we
sort by different key sets.

When we want GROUPING SET(a, b), at first we sort by a and aggregate
then sort by b and aggregate. This is the same as:

select a, null, count(*) from x group by a
union all
select null, b, count(*) from x group by b

so nothing better than query rewriting unless we invent something new.

But in case of sub total and grand total like ROLLUP query, GroupAgg
can do it by one-time scan by having multiple life cycle PerGroup
state.

Anyway, before going ahead we need to find rough sketch of how to
implement this feature. Only syntax sugar is acceptable? Or internal
executor support is necessary?


Regards,


--
Hitoshi Harada


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Олег Царев
Date:
2009/8/13 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2009/8/8 Alvaro Herrera <alvherre@commandprompt.com>:
>> Олег Царев escribió:
>>> Hello all!
>>> If no one objecte (all agree, in other say) i continue work on patch -
>>> particulary, i want support second strategy (tuple store instead of
>>> hash-table) for save order of source (more cheap solution in case with
>>> grouping sets + order by), investigate and brainstorm another
>>> optimisation, writing regression tests and technical documentation.
>>> But I need some time for complete my investigation internals of
>>> PostgreSQL, particulary CTE.
>>
>> Where are we on this patch?  Is it moving forward?
>>
>
> It seems to me that the patch goes backward.
>
> I looked trough the gsets-0.6.diff for about an hour, and found it is
> now only a syntax sugar that builds multiple GROUP BY queries based on
> CTE functionality. There's no executor modification.
>
> If I remember correctly, the original patch touched executor parts.
> I'd buy if the GROUPING SETS touches executor but I don't if this is
> only syntax sugar, because you can write it as the same by yourself
> without GROUPING SETS syntax. The motivation we push this forward is
> performance that cannot be made by rewriting query, I guess.
>
> Because GROUP BY we have today is a subset of GROUPING SETS by
> definition, I suppose we'll refactor nodeAgg.c so that it is allowed
> to take multiple group definitions. And we must support both of
> HashAgg and GroupAgg. For HashAgg, it is easier in any case as the
> earlier patch does. For GroupAgg, it is a bit complicated since we
> sort by different key sets.
>
> When we want GROUPING SET(a, b), at first we sort by a and aggregate
> then sort by b and aggregate. This is the same as:
>
> select a, null, count(*) from x group by a
> union all
> select null, b, count(*) from x group by b
>
> so nothing better than query rewriting unless we invent something new.
>
> But in case of sub total and grand total like ROLLUP query, GroupAgg
> can do it by one-time scan by having multiple life cycle PerGroup
> state.
>
> Anyway, before going ahead we need to find rough sketch of how to
> implement this feature. Only syntax sugar is acceptable? Or internal
> executor support is necessary?
>
>
> Regards,
>
>
> --
> Hitoshi Harada
>

All rights, exclude
> Because GROUP BY we have today is a subset of GROUPING SETS by
> definition, I suppose we'll refactor nodeAgg.c so that it is allowed
> to take multiple group definitions. And we must support both of
> HashAgg and GroupAgg. For HashAgg, it is easier in any case as the
> earlier patch does. For GroupAgg, it is a bit complicated since we
> sort by different key sets.
because group by it's optimized version of grouping sets.
Of course, we can extend the current definition of group by, but we
regress perfomance of it.
Some questions for you:

How calcualte aggregation on ROLLUP on single pass?
Stupid way - store different buffer of aggregations for every group,
and accumulate every record on group for every calculator. When a
group has changed, return key of this group to output set with  NULL
for fields not contains in this group, and restart current buffer of
aggregation.
Better way - add operation "merge aggregations", and calculate one
buffer on every group, when group has cnahged - merge this "main
buffer" to other, and return some intermediate result.

I think, support this of grouping operation isn't simple, and
different implementation of ROLLUP it's better.

Regards, Tsarev Oleg


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/8/13 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2009/8/8 Alvaro Herrera <alvherre@commandprompt.com>:
>> Олег Царев escribió:
>>> Hello all!
>>> If no one objecte (all agree, in other say) i continue work on patch -
>>> particulary, i want support second strategy (tuple store instead of
>>> hash-table) for save order of source (more cheap solution in case with
>>> grouping sets + order by), investigate and brainstorm another
>>> optimisation, writing regression tests and technical documentation.
>>> But I need some time for complete my investigation internals of
>>> PostgreSQL, particulary CTE.
>>
>> Where are we on this patch?  Is it moving forward?
>>
>
> It seems to me that the patch goes backward.

little bit.
>
> I looked trough the gsets-0.6.diff for about an hour, and found it is
> now only a syntax sugar that builds multiple GROUP BY queries based on
> CTE functionality. There's no executor modification.
>

I wrote older version in time when CTE wasn't implemented. The old
patch had own executor node, and was based on creating hash table per
group. This patch was maybe little bit faster than 0.6, but had lot of
bugs. Who knows planner code for grouping, then have to agree with me.
This code isn't readable and I wouldn't to it more complicated and
less readable. So I had idea, to join grouping sets with CTE. Grouping
sets is subset of CTE, so it is possible. Grouping sets is non
recursive CTE generally, and (I believe) this should be optimized
together.

I prefered using CTE, because this way was the most short to small
bugs less prototype - with full functionality.

> If I remember correctly, the original patch touched executor parts.
> I'd buy if the GROUPING SETS touches executor but I don't if this is
> only syntax sugar, because you can write it as the same by yourself
> without GROUPING SETS syntax. The motivation we push this forward is
> performance that cannot be made by rewriting query, I guess.
>

I don't thing, so you can do simply transformation from grouping sets
syntax to CTE. And what's more. Why you have optimized grouping sets
and not optimized non recursive CTE?

> Because GROUP BY we have today is a subset of GROUPING SETS by
> definition, I suppose we'll refactor nodeAgg.c so that it is allowed
> to take multiple group definitions. And we must support both of
> HashAgg and GroupAgg. For HashAgg, it is easier in any case as the
> earlier patch does. For GroupAgg, it is a bit complicated since we
> sort by different key sets.
>

This way is possible too. But needs absolutely grouping planner and
executor reworking. Maybe is time do it. It is work for somebody other
to me. My place is stored procedures.

> When we want GROUPING SET(a, b), at first we sort by a and aggregate
> then sort by b and aggregate. This is the same as:
>
> select a, null, count(*) from x group by a
> union all
> select null, b, count(*) from x group by b
>
> so nothing better than query rewriting unless we invent something new.
>

the problem is when x is subquery. Then is better using CTE, because
we don't need repeat x evaluation twice. The most typical use case is,
so x isn't table.

> But in case of sub total and grand total like ROLLUP query, GroupAgg
> can do it by one-time scan by having multiple life cycle PerGroup
> state.

>
> Anyway, before going ahead we need to find rough sketch of how to
> implement this feature. Only syntax sugar is acceptable? Or internal
> executor support is necessary?

I thing, so both ways are possible. Probably the most clean way is
total refactoring of grouping executor and grouping planner. I am not
sure if we need new nodes. There are all. But these nodes cannot work
paralel now.


>
>
> Regards,
>
>
> --
> Hitoshi Harada
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Hitoshi Harada
Date:
2009/8/14 Олег Царев <zabivator@gmail.com>:
> All rights, exclude
>> Because GROUP BY we have today is a subset of GROUPING SETS by
>> definition, I suppose we'll refactor nodeAgg.c so that it is allowed
>> to take multiple group definitions. And we must support both of
>> HashAgg and GroupAgg. For HashAgg, it is easier in any case as the
>> earlier patch does. For GroupAgg, it is a bit complicated since we
>> sort by different key sets.
> because group by it's optimized version of grouping sets.
> Of course, we can extend the current definition of group by, but we
> regress perfomance of it.
> Some questions for you:
>
> How calcualte aggregation on ROLLUP on single pass?

I'd imagine such like:

select a, b, count(*) from x group by rollup(a, b);

PerGroup all = init_agg(), a = init_agg(), ab = init_agg();
while(row = fetch()){ if(group_is_changed(ab, row)){   result_ab = finalize_agg(ab);   ab = init_agg(); }
if(group_is_changed(a,row)){   result_a = finalize_agg(a);   a = init_agg(); } advance_agg(all, row); advance_agg(a,
row);advance_agg(ab, row); 
}
result_all = finalize_agg(all);

of course you should care best way to return result row and continue
aggregates and the number of grouping key varies from 1 to many, it is
quite possible. And normal GROUP BY is a case of key = a only, there
won't be performance regression.

> Better way - add operation "merge aggregations", and calculate one
> buffer on every group, when group has cnahged - merge this "main
> buffer" to other, and return some intermediate result.

"Merge aggregates" sounds fascinating to me in not only this feature
but also partitioned table aggregates. But adding another function
("merge function?") to the current aggregate system is quite far way.

>
> I think, support this of grouping operation isn't simple, and
> different implementation of ROLLUP it's better.

Surely not simple. Adding another node is one of the choices, but from
code maintenance point of view I feel it is better to integrate it
into nodeAgg. nodeWindowAgg and nodeAgg have similar aggregate
processes but don't share it so a bug fix in nodeAgg isn't completed
in itself but we must re-check nodeWindowAgg also. To add another
agg-like node *may* be kind of nightmare.


Regards,

--
Hitoshi Harada


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/8/13 Олег Царев <zabivator@gmail.com>:
> 2009/8/13 Hitoshi Harada <umi.tanuki@gmail.com>:
>> 2009/8/8 Alvaro Herrera <alvherre@commandprompt.com>:
>>> Олег Царев escribió:
>>>> Hello all!
>>>> If no one objecte (all agree, in other say) i continue work on patch -
>>>> particulary, i want support second strategy (tuple store instead of
>>>> hash-table) for save order of source (more cheap solution in case with
>>>> grouping sets + order by), investigate and brainstorm another
>>>> optimisation, writing regression tests and technical documentation.
>>>> But I need some time for complete my investigation internals of
>>>> PostgreSQL, particulary CTE.
>>>
>>> Where are we on this patch?  Is it moving forward?
>>>
>>
>> It seems to me that the patch goes backward.
>>
>> I looked trough the gsets-0.6.diff for about an hour, and found it is
>> now only a syntax sugar that builds multiple GROUP BY queries based on
>> CTE functionality. There's no executor modification.
>>
>> If I remember correctly, the original patch touched executor parts.
>> I'd buy if the GROUPING SETS touches executor but I don't if this is
>> only syntax sugar, because you can write it as the same by yourself
>> without GROUPING SETS syntax. The motivation we push this forward is
>> performance that cannot be made by rewriting query, I guess.
>>
>> Because GROUP BY we have today is a subset of GROUPING SETS by
>> definition, I suppose we'll refactor nodeAgg.c so that it is allowed
>> to take multiple group definitions. And we must support both of
>> HashAgg and GroupAgg. For HashAgg, it is easier in any case as the
>> earlier patch does. For GroupAgg, it is a bit complicated since we
>> sort by different key sets.
>>
>> When we want GROUPING SET(a, b), at first we sort by a and aggregate
>> then sort by b and aggregate. This is the same as:
>>
>> select a, null, count(*) from x group by a
>> union all
>> select null, b, count(*) from x group by b
>>
>> so nothing better than query rewriting unless we invent something new.
>>
>> But in case of sub total and grand total like ROLLUP query, GroupAgg
>> can do it by one-time scan by having multiple life cycle PerGroup
>> state.
>>
>> Anyway, before going ahead we need to find rough sketch of how to
>> implement this feature. Only syntax sugar is acceptable? Or internal
>> executor support is necessary?
>>
>>
>> Regards,
>>
>>
>> --
>> Hitoshi Harada
>>
>
> All rights, exclude
>> Because GROUP BY we have today is a subset of GROUPING SETS by
>> definition, I suppose we'll refactor nodeAgg.c so that it is allowed
>> to take multiple group definitions. And we must support both of
>> HashAgg and GroupAgg. For HashAgg, it is easier in any case as the
>> earlier patch does. For GroupAgg, it is a bit complicated since we
>> sort by different key sets.
> because group by it's optimized version of grouping sets.
> Of course, we can extend the current definition of group by, but we
> regress perfomance of it.
> Some questions for you:
>
> How calcualte aggregation on ROLLUP on single pass?
> Stupid way - store different buffer of aggregations for every group,
> and accumulate every record on group for every calculator. When a
> group has changed, return key of this group to output set with  NULL
> for fields not contains in this group, and restart current buffer of
> aggregation.
> Better way - add operation "merge aggregations", and calculate one
> buffer on every group, when group has cnahged - merge this "main
> buffer" to other, and return some intermediate result.
>

I don't thing, so this is possible for all operations. Don't forgot.
People can to implement own aggregates. example: weighted average

regards
Pavel Stehule

> I think, support this of grouping operation isn't simple, and
> different implementation of ROLLUP it's better.
>
> Regards, Tsarev Oleg
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Hitoshi Harada
Date:
2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
> I prefered using CTE, because this way was the most short to small
> bugs less prototype - with full functionality.

You could make it by query rewriting, but as you say the best cleanest
way is total refactoring of existing nodeAgg. How easy to implement is
not convincing.

>> When we want GROUPING SET(a, b), at first we sort by a and aggregate
>> then sort by b and aggregate. This is the same as:
>>
>> select a, null, count(*) from x group by a
>> union all
>> select null, b, count(*) from x group by b
>>
>> so nothing better than query rewriting unless we invent something new.
>>
> the problem is when x is subquery. Then is better using CTE, because
> we don't need repeat x evaluation twice. The most typical use case is,
> so x isn't table.

So we need single scan aggregate as far as possible. Buffering
subquery's result is possible without CTE node. Tuplestore has that
functionality but I found the buffered result will be sorted multiple
times, one way might be to allow tuplesort to perform sort multiple
times with different keys.



Regards,


-- 
Hitoshi Harada


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/8/13 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>> I prefered using CTE, because this way was the most short to small
>> bugs less prototype - with full functionality.
>
> You could make it by query rewriting, but as you say the best cleanest
> way is total refactoring of existing nodeAgg. How easy to implement is
> not convincing.
>

I agree. Simply I am not have time and force do it. I would to
concentrate on finishing some plpgsql issues, and then I have to do
some other things than PostgreSQL. There are fully functional
prototype and everybody is welcome to continue in this work.


>>> When we want GROUPING SET(a, b), at first we sort by a and aggregate
>>> then sort by b and aggregate. This is the same as:
>>>
>>> select a, null, count(*) from x group by a
>>> union all
>>> select null, b, count(*) from x group by b
>>>
>>> so nothing better than query rewriting unless we invent something new.
>>>
>> the problem is when x is subquery. Then is better using CTE, because
>> we don't need repeat x evaluation twice. The most typical use case is,
>> so x isn't table.
>
> So we need single scan aggregate as far as possible. Buffering
> subquery's result is possible without CTE node. Tuplestore has that
> functionality but I found the buffered result will be sorted multiple
> times, one way might be to allow tuplesort to perform sort multiple
> times with different keys.
>

yes, I don't afraid multiple evaluation of aggregates. It's cheap.
Problem is multiple table scan. I though about some new version of
aggregates. Current aggregates process row by row with final
operation. Some new kind of aggregates should to work over tuple
store. Internally it get pointer to tupplestore and number of rows.
This should be very fast for functions like median, or array_agg. I
thing so it's similar to window functions - only it not window
function.

If you like to optimalize to speed, then the most faster solution will
be using hash tables - then you don't need tuplestore. For rollup is
possible maybe one single scan - but I am not sure - there are
important fakt - final function cannot modify intermediate data.

Pavel

>
>
> Regards,
>
>
> --
> Hitoshi Harada
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Hitoshi Harada
Date:
2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
> 2009/8/13 Hitoshi Harada <umi.tanuki@gmail.com>:
>> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>>> I prefered using CTE, because this way was the most short to small
>>> bugs less prototype - with full functionality.
>>
>> You could make it by query rewriting, but as you say the best cleanest
>> way is total refactoring of existing nodeAgg. How easy to implement is
>> not convincing.
>>
>
> I agree. Simply I am not have time and force do it. I would to
> concentrate on finishing some plpgsql issues, and then I have to do
> some other things than PostgreSQL. There are fully functional
> prototype and everybody is welcome to continue in this work.
>

I see your situation. Actually your prototype is good shape to be
discussed in both ways. But since you've been focusing on this feature
it'd be better if you keep your eyes on this.

So, Oleg, do you continue on this?


Regards,


-- 
Hitoshi Harada


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Олег Царев
Date:
2009/8/14 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>> 2009/8/13 Hitoshi Harada <umi.tanuki@gmail.com>:
>>> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>>>> I prefered using CTE, because this way was the most short to small
>>>> bugs less prototype - with full functionality.
>>>
>>> You could make it by query rewriting, but as you say the best cleanest
>>> way is total refactoring of existing nodeAgg. How easy to implement is
>>> not convincing.
>>>
>>
>> I agree. Simply I am not have time and force do it. I would to
>> concentrate on finishing some plpgsql issues, and then I have to do
>> some other things than PostgreSQL. There are fully functional
>> prototype and everybody is welcome to continue in this work.
>>
>
> I see your situation. Actually your prototype is good shape to be
> discussed in both ways. But since you've been focusing on this feature
> it'd be better if you keep your eyes on this.
>
> So, Oleg, do you continue on this?
>
>
> Regards,
>
>
> --
> Hitoshi Harada
>

> I'd imagine such like:
>
> select a, b, count(*) from x group by rollup(a, b);
>
> PerGroup all = init_agg(), a = init_agg(), ab = init_agg();
> while(row = fetch()){
>  if(group_is_changed(ab, row)){
>    result_ab = finalize_agg(ab);
>    ab = init_agg();
>  }
>  if(group_is_changed(a, row)){
>    result_a = finalize_agg(a);
>    a = init_agg();
>  }
>  advance_agg(all, row);
>  advance_agg(a, row);
>  advance_agg(ab, row);
> }
> result_all = finalize_agg(all);
Fun =) My implementation of rollup in DBMS qd work as your imagine there! =)
Also, multiply sort of source we take for CUBE implementation, but
this hard for support (sort in group by - it's bloat).
As result we have merge implementation of group by, rollup, and window
functions with some common code - it's way for grouping of source,
Hash implementation group xxx on different hash-tables (with different
keys) it's very expensive (require many memory for keys).
I hope continue my work, after end of time trouble on work =( (bad
TPC-H perfomance)


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/8/14 Олег Царев <zabivator@gmail.com>:
> 2009/8/14 Hitoshi Harada <umi.tanuki@gmail.com>:
>> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>>> 2009/8/13 Hitoshi Harada <umi.tanuki@gmail.com>:
>>>> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>>>>> I prefered using CTE, because this way was the most short to small
>>>>> bugs less prototype - with full functionality.
>>>>
>>>> You could make it by query rewriting, but as you say the best cleanest
>>>> way is total refactoring of existing nodeAgg. How easy to implement is
>>>> not convincing.
>>>>
>>>
>>> I agree. Simply I am not have time and force do it. I would to
>>> concentrate on finishing some plpgsql issues, and then I have to do
>>> some other things than PostgreSQL. There are fully functional
>>> prototype and everybody is welcome to continue in this work.
>>>
>>
>> I see your situation. Actually your prototype is good shape to be
>> discussed in both ways. But since you've been focusing on this feature
>> it'd be better if you keep your eyes on this.
>>
>> So, Oleg, do you continue on this?
>>
>>
>> Regards,
>>
>>
>> --
>> Hitoshi Harada
>>
>
>> I'd imagine such like:
>>
>> select a, b, count(*) from x group by rollup(a, b);
>>
>> PerGroup all = init_agg(), a = init_agg(), ab = init_agg();
>> while(row = fetch()){
>>  if(group_is_changed(ab, row)){
>>    result_ab = finalize_agg(ab);
>>    ab = init_agg();
>>  }
>>  if(group_is_changed(a, row)){
>>    result_a = finalize_agg(a);
>>    a = init_agg();
>>  }
>>  advance_agg(all, row);
>>  advance_agg(a, row);
>>  advance_agg(ab, row);
>> }
>> result_all = finalize_agg(all);
> Fun =) My implementation of rollup in DBMS qd work as your imagine there! =)
> Also, multiply sort of source we take for CUBE implementation, but
> this hard for support (sort in group by - it's bloat).
> As result we have merge implementation of group by, rollup, and window
> functions with some common code - it's way for grouping of source,
> Hash implementation group xxx on different hash-tables (with different
> keys) it's very expensive (require many memory for keys).
> I hope continue my work, after end of time trouble on work =( (bad
> TPC-H perfomance)
>

I thing, so you are afraid too much about memory. Look on current
postgres. Any hash grouping is faster than sort grouping. Try and see.
PostgreSQL isn't embeded database. So there are not main goal an using
less memory. The goal is has features with clean, readable and
maintainable source code.


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Олег Царев
Date:
After week-lengthed investigation, now i 'm sure - my level of
qualification not enough for implementation task "GROUPING SETS".
I require documentation about the executor and the planner, i can't
understand scheme of work by source code.
Many code, many cases, but very little information "what is it" and
"how thos work". May be i stupid.
Sorry.

2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
> 2009/8/14 Олег Царев <zabivator@gmail.com>:
>> 2009/8/14 Hitoshi Harada <umi.tanuki@gmail.com>:
>>> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>>>> 2009/8/13 Hitoshi Harada <umi.tanuki@gmail.com>:
>>>>> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>>>>>> I prefered using CTE, because this way was the most short to small
>>>>>> bugs less prototype - with full functionality.
>>>>>
>>>>> You could make it by query rewriting, but as you say the best cleanest
>>>>> way is total refactoring of existing nodeAgg. How easy to implement is
>>>>> not convincing.
>>>>>
>>>>
>>>> I agree. Simply I am not have time and force do it. I would to
>>>> concentrate on finishing some plpgsql issues, and then I have to do
>>>> some other things than PostgreSQL. There are fully functional
>>>> prototype and everybody is welcome to continue in this work.
>>>>
>>>
>>> I see your situation. Actually your prototype is good shape to be
>>> discussed in both ways. But since you've been focusing on this feature
>>> it'd be better if you keep your eyes on this.
>>>
>>> So, Oleg, do you continue on this?
>>>
>>>
>>> Regards,
>>>
>>>
>>> --
>>> Hitoshi Harada
>>>
>>
>>> I'd imagine such like:
>>>
>>> select a, b, count(*) from x group by rollup(a, b);
>>>
>>> PerGroup all = init_agg(), a = init_agg(), ab = init_agg();
>>> while(row = fetch()){
>>>  if(group_is_changed(ab, row)){
>>>    result_ab = finalize_agg(ab);
>>>    ab = init_agg();
>>>  }
>>>  if(group_is_changed(a, row)){
>>>    result_a = finalize_agg(a);
>>>    a = init_agg();
>>>  }
>>>  advance_agg(all, row);
>>>  advance_agg(a, row);
>>>  advance_agg(ab, row);
>>> }
>>> result_all = finalize_agg(all);
>> Fun =) My implementation of rollup in DBMS qd work as your imagine there! =)
>> Also, multiply sort of source we take for CUBE implementation, but
>> this hard for support (sort in group by - it's bloat).
>> As result we have merge implementation of group by, rollup, and window
>> functions with some common code - it's way for grouping of source,
>> Hash implementation group xxx on different hash-tables (with different
>> keys) it's very expensive (require many memory for keys).
>> I hope continue my work, after end of time trouble on work =( (bad
>> TPC-H perfomance)
>>
>
> I thing, so you are afraid too much about memory. Look on current
> postgres. Any hash grouping is faster than sort grouping. Try and see.
> PostgreSQL isn't embeded database. So there are not main goal an using
> less memory. The goal is has features with clean, readable and
> maintainable source code.
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Joshua Tolley
Date:
On Thu, Sep 03, 2009 at 01:19:25AM +0400, Олег Царев wrote:
> After week-lengthed investigation, now i 'm sure - my level of
> qualification not enough for implementation task "GROUPING SETS".
> I require documentation about the executor and the planner, i can't
> understand scheme of work by source code.
> Many code, many cases, but very little information "what is it" and
> "how thos work". May be i stupid.

I doubt you're stupid; a stupid person wouldn't know what GROUPING SETS meant,
wouldn't bother finding out, and certainly wouldn't bother trying to implement
it. It's very helpful that you've let us know you're not working on it. That
way Pavel, if he finds he has time and interest, or someone else, can work on
it without fear of conflicting with what you're doing. Thanks for your work;
please don't get discouraged!

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/9/2 Олег Царев <zabivator@gmail.com>:
> After week-lengthed investigation, now i 'm sure - my level of
> qualification not enough for implementation task "GROUPING SETS".
> I require documentation about the executor and the planner, i can't
> understand scheme of work by source code.
> Many code, many cases, but very little information "what is it" and
> "how thos work". May be i stupid.

GROUPING SETS is too hard begin. Grouping planner isn't really
readable code. It is reason, whay I did GROUPING SETS over CTE and why
I would to share code with CTE.

regards
Pavel

> Sorry.
>
> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>> 2009/8/14 Олег Царев <zabivator@gmail.com>:
>>> 2009/8/14 Hitoshi Harada <umi.tanuki@gmail.com>:
>>>> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>>>>> 2009/8/13 Hitoshi Harada <umi.tanuki@gmail.com>:
>>>>>> 2009/8/14 Pavel Stehule <pavel.stehule@gmail.com>:
>>>>>>> I prefered using CTE, because this way was the most short to small
>>>>>>> bugs less prototype - with full functionality.
>>>>>>
>>>>>> You could make it by query rewriting, but as you say the best cleanest
>>>>>> way is total refactoring of existing nodeAgg. How easy to implement is
>>>>>> not convincing.
>>>>>>
>>>>>
>>>>> I agree. Simply I am not have time and force do it. I would to
>>>>> concentrate on finishing some plpgsql issues, and then I have to do
>>>>> some other things than PostgreSQL. There are fully functional
>>>>> prototype and everybody is welcome to continue in this work.
>>>>>
>>>>
>>>> I see your situation. Actually your prototype is good shape to be
>>>> discussed in both ways. But since you've been focusing on this feature
>>>> it'd be better if you keep your eyes on this.
>>>>
>>>> So, Oleg, do you continue on this?
>>>>
>>>>
>>>> Regards,
>>>>
>>>>
>>>> --
>>>> Hitoshi Harada
>>>>
>>>
>>>> I'd imagine such like:
>>>>
>>>> select a, b, count(*) from x group by rollup(a, b);
>>>>
>>>> PerGroup all = init_agg(), a = init_agg(), ab = init_agg();
>>>> while(row = fetch()){
>>>>  if(group_is_changed(ab, row)){
>>>>    result_ab = finalize_agg(ab);
>>>>    ab = init_agg();
>>>>  }
>>>>  if(group_is_changed(a, row)){
>>>>    result_a = finalize_agg(a);
>>>>    a = init_agg();
>>>>  }
>>>>  advance_agg(all, row);
>>>>  advance_agg(a, row);
>>>>  advance_agg(ab, row);
>>>> }
>>>> result_all = finalize_agg(all);
>>> Fun =) My implementation of rollup in DBMS qd work as your imagine there! =)
>>> Also, multiply sort of source we take for CUBE implementation, but
>>> this hard for support (sort in group by - it's bloat).
>>> As result we have merge implementation of group by, rollup, and window
>>> functions with some common code - it's way for grouping of source,
>>> Hash implementation group xxx on different hash-tables (with different
>>> keys) it's very expensive (require many memory for keys).
>>> I hope continue my work, after end of time trouble on work =( (bad
>>> TPC-H perfomance)
>>>
>>
>> I thing, so you are afraid too much about memory. Look on current
>> postgres. Any hash grouping is faster than sort grouping. Try and see.
>> PostgreSQL isn't embeded database. So there are not main goal an using
>> less memory. The goal is has features with clean, readable and
>> maintainable source code.
>>
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Pavel Stehule
Date:
2009/9/3 Joshua Tolley <eggyknap@gmail.com>:
> On Thu, Sep 03, 2009 at 01:19:25AM +0400, Олег Царев wrote:
>> After week-lengthed investigation, now i 'm sure - my level of
>> qualification not enough for implementation task "GROUPING SETS".
>> I require documentation about the executor and the planner, i can't
>> understand scheme of work by source code.
>> Many code, many cases, but very little information "what is it" and
>> "how thos work". May be i stupid.
>
> I doubt you're stupid; a stupid person wouldn't know what GROUPING SETS meant,
> wouldn't bother finding out, and certainly wouldn't bother trying to implement
> it. It's very helpful that you've let us know you're not working on it. That
> way Pavel, if he finds he has time and interest, or someone else, can work on
> it without fear of conflicting with what you're doing. Thanks for your work;
> please don't get discouraged!

There some ways, how implement GROUPING SETS. Currently I don't would
to continue on this topic without some sponsoring. It got too my time
- and I am able to implement it only as some special variant of CTE.
Second way is total grouping planner refactoring - what is out of me.

regard
Pavel

>
> --
> Joshua Tolley / eggyknap
> End Point Corporation
> http://www.endpoint.com
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkqfM2QACgkQRiRfCGf1UMOHvgCgpzV9cvjhCWhzcmvRDbXjdBQ1
> 4RYAn2E+ZLRLdho+c+ZFleslPrbyxsZN
> =66vh
> -----END PGP SIGNATURE-----
>
>


Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From
Hitoshi Harada
Date:
2009/9/3 Pavel Stehule <pavel.stehule@gmail.com>:
> 2009/9/3 Joshua Tolley <eggyknap@gmail.com>:
>> On Thu, Sep 03, 2009 at 01:19:25AM +0400, Олег Царев wrote:
>>> After week-lengthed investigation, now i 'm sure - my level of
>>> qualification not enough for implementation task "GROUPING SETS".
>>> I require documentation about the executor and the planner, i can't
>>> understand scheme of work by source code.
>>> Many code, many cases, but very little information "what is it" and
>>> "how thos work". May be i stupid.
>>
>> I doubt you're stupid; a stupid person wouldn't know what GROUPING SETS meant,
>> wouldn't bother finding out, and certainly wouldn't bother trying to implement
>> it. It's very helpful that you've let us know you're not working on it. That
>> way Pavel, if he finds he has time and interest, or someone else, can work on
>> it without fear of conflicting with what you're doing. Thanks for your work;
>> please don't get discouraged!
>
> There some ways, how implement GROUPING SETS. Currently I don't would
> to continue on this topic without some sponsoring. It got too my time
> - and I am able to implement it only as some special variant of CTE.
> Second way is total grouping planner refactoring - what is out of me.

There supposed to be another way to implement, that is a "hybrid" way
of CTE approach and hash tables approach. As you noticed, in group
mode the planner may be untouchable which CTE helps to avoid and in
hash mode you've almost done it. I wouldn't like to agree to CTE based
approach totally, but the hybrid might be reasonable.

Regards,

>
> regard
> Pavel
>
>>
>> --
>> Joshua Tolley / eggyknap
>> End Point Corporation
>> http://www.endpoint.com
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.9 (GNU/Linux)
>>
>> iEYEARECAAYFAkqfM2QACgkQRiRfCGf1UMOHvgCgpzV9cvjhCWhzcmvRDbXjdBQ1
>> 4RYAn2E+ZLRLdho+c+ZFleslPrbyxsZN
>> =66vh
>> -----END PGP SIGNATURE-----
>>
>>
>



--
Hitoshi Harada