Re: [HACKERS] WIP: Aggregation push-down - Mailing list pgsql-hackers
From | Antonin Houska |
---|---|
Subject | Re: [HACKERS] WIP: Aggregation push-down |
Date | |
Msg-id | 10529.1547561178@localhost Whole thread Raw |
In response to | Re: [HACKERS] WIP: Aggregation push-down (Antonin Houska <ah@cybertec.at>) |
Responses |
Re: [HACKERS] WIP: Aggregation push-down
|
List | pgsql-hackers |
This is the next version. A few more comments below. Antonin Houska <ah@cybertec.at> wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Antonin Houska <ah@cybertec.at> writes: > > Thanks! > > > Conceptually, we'd like to be able to push aggregation down below joins > > when that yields a win, which it could do by reducing the number of rows > > that have to be processed at the join stage. Thus consider > > > > SELECT agg(a.x) FROM a, b WHERE a.y = b.z; > > > > We can't simply aggregate during the scan of "a", because some values of > > "x" might not appear at all in the input of the naively-computed aggregate > > (if their rows have no join partner in "b"), or might appear multiple > > times (if their rows have multiple join partners). So at first glance, > > aggregating before the join is impossible. The key insight of the patch > > is that we can make some progress by considering only grouped aggregation: > > Yes, the patch does not handle queries with no GROUP BY clause. Primarily > because I don't know how the grouped "a" table in this example could emit the > "a.y" column w/o using it as a grouping key. > > > if we can group "a" into sets of rows that must all have the same join > > partners, then we can do preliminary aggregation within each such group, > > and take care of the number-of-repetitions problem when we join. If the > > groups are large then this reduces the number of rows processed by the > > join, at the cost that we might spend time computing the aggregate for > > some row sets that will just be discarded by the join. So now we consider > > > > SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?; > > > > and ask what properties the grouping column(s) "?" must have to make this > > work. I believe we can say that it's OK to push down through a join if > > its join clauses are all of the form "a.y = b.z", where either a.y or b.z > > is listed as a GROUP BY column *and* the join operator is equality for > > the btree opfamily specified by the SortGroupClause. I think the requirement for the join clauses to be of the form "a.y = b.z" would only be valid if we wanted to avoid the 2-stage aggregation. For example, in this query SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY b.z uniqueness of "b.z" ensures that groups resulting from the pushed-down aggregation (SELECT a.y, agg(a.x) FROM a GROUP BY a.y) do not get duplicated by the join. However if there's a consensus that the 2-stage aggregation will be used (because avoidance of it would limit usefulness of the feature too much), I think we can accept generic join clauses. (Note that the patch does not try to push aggregate down below nullable side of outer join - handling of generated NULL values by join clauses would be another problem.) > > The concerns Robert had about equality semantics are certainly vital, but I > > believe that the logic goes through correctly as long as the grouping > > equality operator and the join equality operator belong to the same > > opfamily. > > ok, generic approach like this is always better. I just could not devise it, > nor can I prove that your theory is wrong. > > > Robert postulated cases like "citext = text", but that doesn't apply > > here because no cross-type citext = text operator could be part of a > > well-behaved opfamily. > > The reason I can see now is that the GROUP BY operator (opfamily) essentially > has the grouping column on both sides, so it does not match the cross-type > operator. Even if my statement that the join clause can be of other kind than "<leftarg> = <rightarg>" was right, it might still be worth insisting on "<leftargt> <op> <rightarg>" where <op> is of the same operator family as the grouping operator? > > Also, Tomas complained in the earlier thread that he didn't think > > grouping on the join column was a very common use-case in the first > > place. I share that concern, but I think we could extend the logic > > to the case that Tomas posited as being useful: > > > > SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z; > > > > where the join column b.id is unique. If we group on a.y (using semantics > > compatible with the join operator and the uniqueness constraint), then all > > "a" rows in a given group will join to exactly one "b" row that > > necessarily has exactly one grouping value, so this group can safely be > > aggregated together. We might need to combine it post-join with other "b" > > rows that have equal "z" values, but we can do that as long as we're okay > > with partial aggregation. I think this example shows why the idea is far > > more powerful with partial aggregation than without. > > Good idea. I'll try to incorporate it in the next version. The previous version of the patch (before I removed the partial aggregation) could actually handle this case, so I restored it. It just does not check the uniqueness of the b.id column because: the final aggregation will eliminate the duplicated groups anyway. The patch views the problem from this perspective: "a.y" is needed by join clause, and the only way to retrieve it from the partially aggregated relation "a" is to use it as a grouping column. Such an "artificial" grouping expression can only be used if the plan contains the final aggregate node, which uses only the original GROUP BY clause. > > I also have no use for the stuff depending on bitwise equality rather than > > the user-visible operators; that's just a hack, and an ugly one. > > The purpose of this was to avoid aggregation push-down for data types like > numeric, see the example > > SELECT one.a > FROM one, two > WHERE one.a = two.a AND scale(two.a) > 1 > GROUP BY 1 > > in [2]. The problem is that aggregation can make some information lost, which > is needed above by WHERE/ON clauses. I appreciate any suggestions how to > address this problem. The new patch version does not contain this hack, but the problem is still open. Alternatively I think of a boolean flag to be added to pg_opfamily or pg_opclass catalog, saying whether equality also means "bitwise equality". If the value is false (the default) for the GROUP BY operator, no aggregate push-down can take place. > > I think the right representation is to create UPPERREL_GROUP_AGG > > RelOptInfos whose relids show the set of baserels joined before > > aggregating. Currently there's only one of those and its relids is equal > > to all_baserels (or should be, anyway). This patch would add instances of > > UPPERREL_GROUP_AGG RelOptInfos with singleton relids, and later we might > > put in the ability to consider aggregation across other relid subsets, and > > in any case we'd run join planning on those level-one rels and create new > > UPPERREL_GROUP_AGG RelOptInfos that represent the intermediate join steps > > leading up to the final join. The correct indexing structure for this > > collection of RelOptInfos is certainly not simple_rel_array; it should > > look more like the way we index the various regular join rels that we > > consider. > > (Maybe an appropriate preliminary patch is to refactor the support code > > associated with join_rel_list + join_rel_hash so that we can treat those > > fields as one instance of a structure we'll replicate later.) > Do you mean that, instead of a single RelOptInfo, join_rel_list would contain > a structure that points to two RelOptInfo structures, where one is the "plain" > one and the other is the "grouped" one? Eventually I thought that you propose a separate instance of join_rel_list (e.g. join_rel_list_grouped) for the 2nd run of the joining process, however that would mean that some functions need to receive RelOptInfos as arguments, and *also* search in the the join_rel_list_grouped when joining grouped relation to non-grouped one. So I decided to wrap both grouped and non-grouped relation into a new structure (RelOptInfoSet). Thus all the relations to be joined are passed as function arguments. make_join_rel() shows best how instances of the new structure is used. > [1] https://www.postgresql.org/message-id/113e9594-7c08-0f1f-ad15-41773b56a86b@iki.fi > > [2] https://www.postgresql.org/message-id/20239.1516971866%40localhost -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Attachment
pgsql-hackers by date: