proposal - GROUPING SETS - Mailing list pgsql-hackers

From Pavel Stehule
Subject proposal - GROUPING SETS
Date
Msg-id 162867790809160327n23dd2f0eyd0f6ff6365afa25@mail.gmail.com
Whole thread Raw
Responses Re: proposal - GROUPING SETS  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hello,

== Proposal - GROUPING SETS ==

a grouping set feature allows multiple grouping clauses in one
query. Result of grouping sets is union of results each groupby
clause's result.

create table t(a int, b int);
insert into t values(10,20);
insert into t values(30,40);

select a, b from t group by grouping sets(a, b);

is same as:

select a, NULL from t group by a
union all
select NULL, b from t group by b;

Note: all ungrouped vars are transformed to NULL

Groupby clause should contains cube and rollup lists.
These are transformed to grouping sets via transformed rules:

create table t1(a int, b int, c int)

group by rollup(a, b, c)
->
group by grouping sets((a,b,c), (a,b), (a), ())

group by cube(a,b,c)
->
group by grouping sets ((a,b,c),(a,b),(a,c), (a), (b,c),                       (b), (c), ())

Groupby clause or grouping sets should contains more sets.
Result is multiplication of these sets:

group by grouping sets(a), grouping sets(b, (b,c), ())
->
group by grouping sets((a,b), (a,b,c), (a))

When grouping sets are used, then we should to use grouping
and grouping_id functions. Function grouping returns 1 when
parameter is in current group set, else returns 0. Function
grouping_id returns value as

grouping_id(a,b,c) =
to_dec(to_bin(grouping(a) || grouping(b) || grouping(c)))

postgres=# select * from t;a  | b  | c
----+----+----10 | 20 | 3010 | 20 | 30
(2 rows)

postgres=# select a,b,c from t group by grouping sets(a,b,c);a  | b  | c
----+----+----10 |    |   | 20 |   |    | 30
(3 rows)

postgres=# select a,b,c, grouping(a), grouping(b), grouping(c),          grouping_id(a,b,c) from t group by grouping
sets(a,b,c);a | b  | c  | grouping | grouping | grouping | grouping_id
 
----+----+----+----------+----------+----------+-------------10 |    |    |        1 |        0 |        0 |
4  | 20 |    |        0 |        1 |        0 |           2   |    | 30 |        0 |        0 |        1 |           1
 
(3 rows)

some real sample:

create table report( inserted date, locality varchar, name varchar, c int);

postgres=# copy report to stdout;
2008-10-10      Prague  Milk    10
2008-10-11      Prague  Milk    12
2008-10-10      Prague  Rum     2
2008-10-11      Prague  Rum     6
2008-10-10      Berlin  Milk    8
2008-10-11      Berlin  Milk    14
2008-10-10      Berlin  Beer    20
2008-10-11      Berlin  Beer    25

postgres=# select * from report; inserted  | locality | name | c
------------+----------+------+----2008-10-10 | Prague   | Milk | 102008-10-11 | Prague   | Milk | 122008-10-10 |
Prague  | Rum  |  22008-10-11 | Prague   | Rum  |  62008-10-10 | Berlin   | Milk |  82008-10-11 | Berlin   | Milk |
142008-10-10| Berlin   | Beer | 202008-10-11 | Berlin   | Beer | 25
 
(8 rows)

postgres=# select inserted, locality, name, sum(c)             from report            group by grouping sets(inserted,
locality,name); inserted  | locality | name | sum
 
------------+----------+------+-----2008-10-10 |          |      |  402008-10-11 |          |      |  57           |
Berlin  |      |  67           | Prague   |      |  30           |          | Milk |  44           |          | Rum  |
8           |          | Beer |  45
 
(7 rows)


postgres=# select inserted, locality, name, sum(c)             from report            group by grouping sets(inserted,
(locality,name)); inserted  | locality | name | sum
 
------------+----------+------+-----2008-10-10 |          |      |  402008-10-11 |          |      |  57           |
Prague  | Milk |  22           | Berlin   | Milk |  22           | Berlin   | Beer |  45           | Prague   | Rum  |
8
 
(6 rows)

postgres=# select inserted, locality, name, sum(c)             from report            group by name, grouping
sets(inserted,locality); inserted  | locality | name | sum
 
------------+----------+------+-----2008-10-11 |          | Rum  |   62008-10-10 |          | Rum  |   22008-10-10 |
     | Beer |  202008-10-11 |          | Beer |  252008-10-10 |          | Milk |  182008-10-11 |          | Milk |  26
         | Prague   | Rum  |   8           | Berlin   | Beer |  45           | Berlin   | Milk |  22           | Prague
 | Milk |  22
 
(10 rows)

== Implementation ==
Grouping sets introduce a new concept into SQL. One readed
tuple is multiple used. It's similar with WITH clause. It's little
bit dificult implement it for current PostgreSQL's executor.

In my prototype I used aux node Feeder. This node should to hold
only one tuple. I add new method for Agg node, that process only
one input tuple:
for (;;){    tuple = feeder->execute(grouping_sets->lefttree);    foreach(l, grouping_sets->subplans)    {
lfirst(l)->process(tuple);   }    if (is_null(tuple))            break;}
 

It supports only HASH Agg nodes. For non hash agg should be used
other method (currently it isn't supported).

postgres=# explain verbose select inserted, locality, name, sum(c)                             from report
             group by name,grouping sets(inserted, locality);                          QUERY PLAN
 
-----------------------------------------------------------------Grouping Sets  (cost=12.00..35.00 rows=400 width=68)
Output:NULL::date, locality, name, sum(c)  ->  Seq Scan on report  (cost=0.00..18.00 rows=800 width=68)        Output:
inserted,locality, name, c  ->  HashAggregate  (cost=6.00..8.50 rows=200 width=68)        Output: inserted,
NULL::charactervarying, name, sum(c)        ->  Feeder  (cost=0.00..0.00 rows=800 width=68)              Output:
inserted,locality, name, c  ->  HashAggregate  (cost=6.00..8.50 rows=200 width=68)        Output: NULL::date, locality,
name,sum(c)        ->  Feeder  (cost=0.00..0.00 rows=800 width=68)              Output: inserted, locality, name, c
 
(12 rows)

After work on prototype I don't see any problems in executor or
parser. I expect some dificulties in planner. With grouping sets
grouping_planner procedure will be much more complex:

1. targetlist and groupclause will be list of list,
2. we should to repeat estimation of NumGroups for each group  set (it's should be shared with CTE feature??).

== Parser problems ==
1. The identifier cube is used in contrib cube.  Solution: CUBE '(' ... ')' generates funcCall and it is transformed
togrouping sets only in groupby clause later.
 

I invite any ideas, notes and help with documentation.

Regards
Pavel Stehule


pgsql-hackers by date:

Previous
From: "Marko Kreen"
Date:
Subject: Re: [Review] fix dblink security hole
Next
From: Simon Riggs
Date:
Subject: Subtransaction commits and Hot Standby