[HACKERS] PoC: Grouped base relation - Mailing list pgsql-hackers

Attached is a draft patch that lets partial aggregation happen at base
relation level. If the relations contain relatively small number of groups,
the number of input rows of the aggregation at the query level can be reduced
this way.  Also, if append relation and postgres_fdw planning is enhanced
accordingly, patch like this can let us aggregate individual tables on remote
servers (e.g. shard nodes) and thus reduce the amount of rows subject to the
final aggregation.

For example, consider query

SELECT b.j, sum(a.x) FROM a, b WHERE a.i = b.j GROUP BY b.j;

and tables "a"

 i | x
-------
 1 | 3
 1 | 4

and "b"

 j
---
 1
 1

The base relations grouped look like

 i | sum(a.x)| count(*)
-----------------------
 1 |       7 |       2

and

 j | count(*)
-------------
 1 |       2


A few "footnotes":

* Equivalence class {a.i, b.j} tells that "a" can be grouped by "i", besides
the grouping of "b" which is explicitly required by GROUP BY b.j clause.)

* To transfer the aggregate results to upper nodes, I introduced a concept of
"grouped variable". Base relation has special target which the planner uses to
generate "grouped paths". The grouped target contains one grouped variable per
aggregate that the relation computes. During final processing of the plan
(setrefs.c), the corresponding (partial) aggregate is restored in the query
target if needed - typically this happens to ensure that the final aggregate
references the output of the partial aggregate.

* So far the grouped variable is only used for aggregates, but it might be
useful for grouping expressions in the future as well. Currently the base
relation can only be grouped by a plain Var, but it might be worth grouping it
by generic grouping expressions of the GROUP BY clause, and using the grouped
var mechanism to propagate the expression value to the query target.


As for the example, the processing continues by joining the partially grouped
sets:

 i | sum(x)| count(i.*) | j | count(j.*)
----------------------------------------
 1 |     7 |       2    | 1 |       3


Before performing the final aggregation, we need to multiply sum(a.x) by
count(j.*) because w/o the aggregation at base relation level the input
of the query-level aggregation would look like

 a.i | a.x | b.j
----------------
 1   |   3 |  1
 1   |   4 |  1
 1   |   3 |  1
 1   |   4 |  1

In other words, grouping of the base relation "b" below the join prevents the
join from bringing per-group input set to the aggregate input multiple
times. To compensate for this effect, I've added a new field "aggtransmultifn"
to pg_aggregate catalog. It "multiplies" the aggregate state w/o repeated
processing of the same input set many times. sum() is an example of an
aggregate that needs such processing, avg() is one that does not.

The example query can eventually produce plans like

                      QUERY PLAN
------------------------------------------------------
 Finalize HashAggregate
   Group Key: b.j
   ->  Gather
         Workers Planned: 2
         ->  Hash Join
               Hash Cond: (a.i = b.j)
               ->  Partial HashAggregate
                     Group Key: a.i
                     ->  Parallel Seq Scan on a
               ->  Hash
                     ->  Partial HashAggregate
                           Group Key: b.j
                           ->  Parallel Seq Scan on b

or

                      QUERY PLAN
------------------------------------------------------
 Finalize HashAggregate
   Group Key: b.j
   ->  Hash Join
         Hash Cond: (a.i = b.j)
         ->  Gather
               Workers Planned: 2
               ->  Partial HashAggregate
                     Group Key: a.i
                     ->  Parallel Seq Scan on a
         ->  Hash
               ->  Gather
                     Workers Planned: 1
                     ->  Partial HashAggregate
                           Group Key: b.j
                           ->  Parallel Seq Scan on b


An obvious limitation is that neither grouping expression nor aggregate
argument can be below the nullable side of outer join. In such a case the
aggregate at the base relation level wouldn't receive the NULL values that it
does receive at the query level. Also, no aggregate can reference multiple
tables.

Does this concept seem worth to continue coding?


BTW, if anyone wants to play with the current version:

1. Don't forget to initialize a new cluster (initdb) first. I decided not to
bump CATALOG_VERSION_NO so far because it'd probably make the patch
incompatible with master branch quite soon.

2. Only hash aggregation is implemented so far at the base relation level.

3. As for sum() aggregate, only sum(float4) is supposed to work correctly so
far - this is related to the pg_aggregate changes mentioned above. avg()
should work in general, and I didn't care about the other ones so far.

4. As for joins, only hash join is able to process the grouped relations. I
didn't want to do too much coding until there's a consensus on the design.


--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] RustgreSQL
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] UNDO and in-place update