[HACKERS] PoC: Grouped base relation - Mailing list pgsql-hackers
From | Antonin Houska |
---|---|
Subject | [HACKERS] PoC: Grouped base relation |
Date | |
Msg-id | 29111.1483984605@localhost Whole thread Raw |
Responses |
Re: [HACKERS] PoC: Grouped base relation
Re: [HACKERS] PoC: Grouped base relation Re: [HACKERS] PoC: Grouped base relation Re: [HACKERS] PoC: Grouped base relation |
List | pgsql-hackers |
Attached is a draft patch that lets partial aggregation happen at base relation level. If the relations contain relatively small number of groups, the number of input rows of the aggregation at the query level can be reduced this way. Also, if append relation and postgres_fdw planning is enhanced accordingly, patch like this can let us aggregate individual tables on remote servers (e.g. shard nodes) and thus reduce the amount of rows subject to the final aggregation. For example, consider query SELECT b.j, sum(a.x) FROM a, b WHERE a.i = b.j GROUP BY b.j; and tables "a" i | x ------- 1 | 3 1 | 4 and "b" j --- 1 1 The base relations grouped look like i | sum(a.x)| count(*) ----------------------- 1 | 7 | 2 and j | count(*) ------------- 1 | 2 A few "footnotes": * Equivalence class {a.i, b.j} tells that "a" can be grouped by "i", besides the grouping of "b" which is explicitly required by GROUP BY b.j clause.) * To transfer the aggregate results to upper nodes, I introduced a concept of "grouped variable". Base relation has special target which the planner uses to generate "grouped paths". The grouped target contains one grouped variable per aggregate that the relation computes. During final processing of the plan (setrefs.c), the corresponding (partial) aggregate is restored in the query target if needed - typically this happens to ensure that the final aggregate references the output of the partial aggregate. * So far the grouped variable is only used for aggregates, but it might be useful for grouping expressions in the future as well. Currently the base relation can only be grouped by a plain Var, but it might be worth grouping it by generic grouping expressions of the GROUP BY clause, and using the grouped var mechanism to propagate the expression value to the query target. As for the example, the processing continues by joining the partially grouped sets: i | sum(x)| count(i.*) | j | count(j.*) ---------------------------------------- 1 | 7 | 2 | 1 | 3 Before performing the final aggregation, we need to multiply sum(a.x) by count(j.*) because w/o the aggregation at base relation level the input of the query-level aggregation would look like a.i | a.x | b.j ---------------- 1 | 3 | 1 1 | 4 | 1 1 | 3 | 1 1 | 4 | 1 In other words, grouping of the base relation "b" below the join prevents the join from bringing per-group input set to the aggregate input multiple times. To compensate for this effect, I've added a new field "aggtransmultifn" to pg_aggregate catalog. It "multiplies" the aggregate state w/o repeated processing of the same input set many times. sum() is an example of an aggregate that needs such processing, avg() is one that does not. The example query can eventually produce plans like QUERY PLAN ------------------------------------------------------ Finalize HashAggregate Group Key: b.j -> Gather Workers Planned: 2 -> Hash Join Hash Cond: (a.i = b.j) -> Partial HashAggregate Group Key: a.i -> Parallel Seq Scan on a -> Hash -> Partial HashAggregate Group Key: b.j -> Parallel Seq Scan on b or QUERY PLAN ------------------------------------------------------ Finalize HashAggregate Group Key: b.j -> Hash Join Hash Cond: (a.i = b.j) -> Gather Workers Planned: 2 -> Partial HashAggregate Group Key: a.i -> Parallel Seq Scan on a -> Hash -> Gather Workers Planned: 1 -> Partial HashAggregate Group Key: b.j -> Parallel Seq Scan on b An obvious limitation is that neither grouping expression nor aggregate argument can be below the nullable side of outer join. In such a case the aggregate at the base relation level wouldn't receive the NULL values that it does receive at the query level. Also, no aggregate can reference multiple tables. Does this concept seem worth to continue coding? BTW, if anyone wants to play with the current version: 1. Don't forget to initialize a new cluster (initdb) first. I decided not to bump CATALOG_VERSION_NO so far because it'd probably make the patch incompatible with master branch quite soon. 2. Only hash aggregation is implemented so far at the base relation level. 3. As for sum() aggregate, only sum(float4) is supposed to work correctly so far - this is related to the pg_aggregate changes mentioned above. avg() should work in general, and I didn't care about the other ones so far. 4. As for joins, only hash join is able to process the grouped relations. I didn't want to do too much coding until there's a consensus on the design. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: