Thread: Eager aggregation, take 3
Hi All,
Eager aggregation is a query optimization technique that partially
pushes a group-by past a join, and finalizes it once all the relations
are joined. Eager aggregation reduces the number of input rows to the
join and thus may result in a better overall plan. This technique is
thoroughly described in the 'Eager Aggregation and Lazy Aggregation'
paper [1].
Back in 2017, a patch set has been proposed by Antonin Houska to
implement eager aggregation in thread [2]. However, it was at last
withdrawn after entering the pattern of "please rebase thx" followed by
rebasing and getting no feedback until "please rebase again thx". A
second attempt in 2022 unfortunately fell into the same pattern about
one year ago and was eventually closed again [3].
That patch set has included most of the necessary concepts to implement
eager aggregation. However, as far as I can see, it has several weak
points that we need to address. It introduces invasive changes to some
core planner functions, such as make_join_rel(). And with such changes
join_is_legal() would be performed three times for the same proposed
join, which is not great. Another weak point is that the complexity of
join searching dramatically increases with the growing number of
relations to be joined. This occurs because when we generate partially
aggregated paths, each path of the input relation is considered as an
input path for the grouped paths. As a result, the number of grouped
paths we generate increases exponentially, leading to a significant
explosion in computational complexity. Other weak points include the
lack of support for outer joins and partitionwise joins. And during my
review of the code, I came across several bugs (planning error or crash)
that need to be addressed.
I'd like to give it another take to implement eager aggregation, while
borrowing lots of concepts and many chunks of codes from the previous
patch set. Please see attached. I have chosen to use the term 'Eager
Aggregation' from the paper [1] instead of 'Aggregation push-down', to
differentiate the aggregation push-down technique in FDW.
The patch has been split into small pieces to make the review easier.
Eager aggregation is a query optimization technique that partially
pushes a group-by past a join, and finalizes it once all the relations
are joined. Eager aggregation reduces the number of input rows to the
join and thus may result in a better overall plan. This technique is
thoroughly described in the 'Eager Aggregation and Lazy Aggregation'
paper [1].
Back in 2017, a patch set has been proposed by Antonin Houska to
implement eager aggregation in thread [2]. However, it was at last
withdrawn after entering the pattern of "please rebase thx" followed by
rebasing and getting no feedback until "please rebase again thx". A
second attempt in 2022 unfortunately fell into the same pattern about
one year ago and was eventually closed again [3].
That patch set has included most of the necessary concepts to implement
eager aggregation. However, as far as I can see, it has several weak
points that we need to address. It introduces invasive changes to some
core planner functions, such as make_join_rel(). And with such changes
join_is_legal() would be performed three times for the same proposed
join, which is not great. Another weak point is that the complexity of
join searching dramatically increases with the growing number of
relations to be joined. This occurs because when we generate partially
aggregated paths, each path of the input relation is considered as an
input path for the grouped paths. As a result, the number of grouped
paths we generate increases exponentially, leading to a significant
explosion in computational complexity. Other weak points include the
lack of support for outer joins and partitionwise joins. And during my
review of the code, I came across several bugs (planning error or crash)
that need to be addressed.
I'd like to give it another take to implement eager aggregation, while
borrowing lots of concepts and many chunks of codes from the previous
patch set. Please see attached. I have chosen to use the term 'Eager
Aggregation' from the paper [1] instead of 'Aggregation push-down', to
differentiate the aggregation push-down technique in FDW.
The patch has been split into small pieces to make the review easier.
0001 introduces the RelInfoList structure, which encapsulates both a
list and a hash table, so that we can leverage the hash table for faster
lookups not only for join relations but also for upper relations. With
eager aggregation, it is possible that we generate so many upper rels of
type UPPERREL_PARTIAL_GROUP_AGG that a hash table can help a lot with
lookups.
0002 introduces the RelAggInfo structure to store information needed to
create grouped paths for base and join rels. It also revises the
RelInfoList related structures and functions so that they can be used
with RelAggInfos.
0003 checks if eager aggregation is applicable, and if so, collects
suitable aggregate expressions and grouping expressions in the query,
and records them in root->agg_clause_list and root->group_expr_list
respectively.
0004 implements the functions that check if eager aggregation is
applicable for a given relation, and if so, create RelAggInfo structure
for the relation, using the infos about aggregate expressions and
grouping expressions we collected earlier. In this patch, when we check
if a target expression can act as grouping expression, we need to check
if this expression can be known equal to other expressions due to ECs
that can act as grouping expressions. This patch leverages function
exprs_known_equal() to achieve that, after enhancing this function to
consider opfamily if provided.
0005 implements the functions that generate paths for grouped relations
by adding sorted and hashed partial aggregation paths on top of paths of
the plain base or join relations. For sorted partial aggregation paths,
we only consider any suitably-sorted input paths as well as sorting the
cheapest-total path. For hashed partial aggregation paths, we only
consider the cheapest-total path as input. By not considering other
paths we can reduce the number of grouping paths as much as possible
while still achieving reasonable results.
0006 builds grouped relations for each base relation if possible, and
generates aggregation paths for the grouped base relations.
list and a hash table, so that we can leverage the hash table for faster
lookups not only for join relations but also for upper relations. With
eager aggregation, it is possible that we generate so many upper rels of
type UPPERREL_PARTIAL_GROUP_AGG that a hash table can help a lot with
lookups.
0002 introduces the RelAggInfo structure to store information needed to
create grouped paths for base and join rels. It also revises the
RelInfoList related structures and functions so that they can be used
with RelAggInfos.
0003 checks if eager aggregation is applicable, and if so, collects
suitable aggregate expressions and grouping expressions in the query,
and records them in root->agg_clause_list and root->group_expr_list
respectively.
0004 implements the functions that check if eager aggregation is
applicable for a given relation, and if so, create RelAggInfo structure
for the relation, using the infos about aggregate expressions and
grouping expressions we collected earlier. In this patch, when we check
if a target expression can act as grouping expression, we need to check
if this expression can be known equal to other expressions due to ECs
that can act as grouping expressions. This patch leverages function
exprs_known_equal() to achieve that, after enhancing this function to
consider opfamily if provided.
0005 implements the functions that generate paths for grouped relations
by adding sorted and hashed partial aggregation paths on top of paths of
the plain base or join relations. For sorted partial aggregation paths,
we only consider any suitably-sorted input paths as well as sorting the
cheapest-total path. For hashed partial aggregation paths, we only
consider the cheapest-total path as input. By not considering other
paths we can reduce the number of grouping paths as much as possible
while still achieving reasonable results.
0006 builds grouped relations for each base relation if possible, and
generates aggregation paths for the grouped base relations.
0007 builds grouped relations for each just-processed join relation if
possible, and generates aggregation paths for the grouped join
relations. The changes made to make_join_rel() are relatively minor,
with the addition of a new function make_grouped_join_rel(), which finds
or creates a grouped relation for the just-processed joinrel, and
generates grouped paths by joining a grouped input relation with a
non-grouped input relation.
The other way to generate grouped paths is by adding sorted and hashed
partial aggregation paths on top of paths of the joinrel. This occurs
in standard_join_search(), after we've run set_cheapest() for the
joinrel. The reason for performing this step after set_cheapest() is
that we need to know the joinrel's cheapest paths (see 0005).
This patch also makes the grouped relation for the topmost join rel act
as the upper rel representing the result of partial aggregation, so that
we can add the final aggregation on top of that. Additionally, this
patch extends the functionality of eager aggregation to work with
partitionwise join and geqo.
This patch also makes eager aggregation work with outer joins. With
outer join, the aggregate cannot be pushed down if any column referenced
by grouping expressions or aggregate functions is nullable by an outer
join above the relation to which we want to apply the partiall
aggregation. Thanks to Tom's outer-join-aware-Var infrastructure, we
can easily identify such situations and subsequently refrain from
pushing down the aggregates.
Starting from this patch, you should be able to see plans with eager
aggregation.
0008 adds test cases for eager aggregation.
0009 adds a section in README that describes this feature (copied from
previous patch set, with minor tweaks).
Thoughts and comments are welcome.
possible, and generates aggregation paths for the grouped join
relations. The changes made to make_join_rel() are relatively minor,
with the addition of a new function make_grouped_join_rel(), which finds
or creates a grouped relation for the just-processed joinrel, and
generates grouped paths by joining a grouped input relation with a
non-grouped input relation.
The other way to generate grouped paths is by adding sorted and hashed
partial aggregation paths on top of paths of the joinrel. This occurs
in standard_join_search(), after we've run set_cheapest() for the
joinrel. The reason for performing this step after set_cheapest() is
that we need to know the joinrel's cheapest paths (see 0005).
This patch also makes the grouped relation for the topmost join rel act
as the upper rel representing the result of partial aggregation, so that
we can add the final aggregation on top of that. Additionally, this
patch extends the functionality of eager aggregation to work with
partitionwise join and geqo.
This patch also makes eager aggregation work with outer joins. With
outer join, the aggregate cannot be pushed down if any column referenced
by grouping expressions or aggregate functions is nullable by an outer
join above the relation to which we want to apply the partiall
aggregation. Thanks to Tom's outer-join-aware-Var infrastructure, we
can easily identify such situations and subsequently refrain from
pushing down the aggregates.
Starting from this patch, you should be able to see plans with eager
aggregation.
0008 adds test cases for eager aggregation.
0009 adds a section in README that describes this feature (copied from
previous patch set, with minor tweaks).
Thoughts and comments are welcome.
Attachment
- v1-0005-Implement-functions-that-generate-paths-for-grouped-relations.patch
- v1-0001-Introduce-RelInfoList-structure.patch
- v1-0004-Implement-functions-that-create-RelAggInfos-if-applicable.patch
- v1-0002-Introduce-RelAggInfo-structure-to-store-info-for-grouped-paths.patch
- v1-0003-Set-up-for-eager-aggregation-by-collecting-needed-infos.patch
- v1-0006-Build-grouped-relations-out-of-base-relations.patch
- v1-0007-Build-grouped-relations-out-of-join-relations.patch
- v1-0009-Add-README.patch
- v1-0008-Add-test-cases.patch
Richard Guo <guofenglinux@gmail.com> writes: > Hi All, > > Eager aggregation is a query optimization technique that partially > pushes a group-by past a join, and finalizes it once all the relations > are joined. Eager aggregation reduces the number of input rows to the > join and thus may result in a better overall plan. This technique is > thoroughly described in the 'Eager Aggregation and Lazy Aggregation' > paper [1]. This is a really helpful and not easy task, even I am not sure when I can spend time to study this, I want to say "Thanks for working on this!" first and hope we can really progress on this topic. Good luck! -- Best Regards Andy Fan
On Mon, Mar 4, 2024 at 7:49 PM Andy Fan <zhihuifan1213@163.com> wrote:
This is a really helpful and not easy task, even I am not sure when I
can spend time to study this, I want to say "Thanks for working on
this!" first and hope we can really progress on this topic. Good luck!
Thanks. I hope this take can go even further and ultimately find its
way to be committed.
This needs a rebase after dbbca2cf29. I also revised the commit message
for 0007 and fixed a typo in 0009.
Thanks
Richard
way to be committed.
This needs a rebase after dbbca2cf29. I also revised the commit message
for 0007 and fixed a typo in 0009.
Thanks
Richard
Attachment
- v2-0001-Introduce-RelInfoList-structure.patch
- v2-0004-Implement-functions-that-create-RelAggInfos-if-applicable.patch
- v2-0002-Introduce-RelAggInfo-structure-to-store-info-for-grouped-paths.patch
- v2-0005-Implement-functions-that-generate-paths-for-grouped-relations.patch
- v2-0003-Set-up-for-eager-aggregation-by-collecting-needed-infos.patch
- v2-0006-Build-grouped-relations-out-of-base-relations.patch
- v2-0009-Add-README.patch
- v2-0007-Build-grouped-relations-out-of-join-relations.patch
- v2-0008-Add-test-cases.patch
On Tue, Mar 5, 2024 at 2:47 PM Richard Guo <guofenglinux@gmail.com> wrote:
This needs a rebase after dbbca2cf29. I also revised the commit message
for 0007 and fixed a typo in 0009.
Here is another rebase, mainly to make the test cases more stable by
adding ORDER BY clauses to the test queries. Also fixed more typos in
passing.
Thanks
Richard
adding ORDER BY clauses to the test queries. Also fixed more typos in
passing.
Thanks
Richard
Attachment
- v3-0001-Introduce-RelInfoList-structure.patch
- v3-0002-Introduce-RelAggInfo-structure-to-store-info-for-grouped-paths.patch
- v3-0003-Set-up-for-eager-aggregation-by-collecting-needed-infos.patch
- v3-0004-Implement-functions-that-create-RelAggInfos-if-applicable.patch
- v3-0005-Implement-functions-that-generate-paths-for-grouped-relations.patch
- v3-0006-Build-grouped-relations-out-of-base-relations.patch
- v3-0007-Build-grouped-relations-out-of-join-relations.patch
- v3-0008-Add-test-cases.patch
- v3-0009-Add-README.patch
On Tue, Mar 5, 2024 at 7:19 PM Richard Guo <guofenglinux@gmail.com> wrote:
Here is another rebase, mainly to make the test cases more stable by
adding ORDER BY clauses to the test queries. Also fixed more typos in
passing.
This needs another rebase after 97d85be365. I also addressed several
issues that I identified during self-review, which include:
* In some cases GroupPathExtraData.agg_final_costs, which is the cost of
final aggregation, fails to be calculated. This can lead to bogus cost
estimation and end up with unexpected plan.
* If the cheapest partially grouped path is generated through eager
aggregation, the number of groups estimated for the final phase will be
different from the number of groups estimated for non-split aggregation.
That is to say, we should not use 'dNumGroups' for the final aggregation
in add_paths_to_grouping_rel().
* It is possible that we may generate dummy grouped join relations, and
that would trigger the Assert in make_grouped_join_rel().
* More typo fixes.
Thanks
Richard
issues that I identified during self-review, which include:
* In some cases GroupPathExtraData.agg_final_costs, which is the cost of
final aggregation, fails to be calculated. This can lead to bogus cost
estimation and end up with unexpected plan.
* If the cheapest partially grouped path is generated through eager
aggregation, the number of groups estimated for the final phase will be
different from the number of groups estimated for non-split aggregation.
That is to say, we should not use 'dNumGroups' for the final aggregation
in add_paths_to_grouping_rel().
* It is possible that we may generate dummy grouped join relations, and
that would trigger the Assert in make_grouped_join_rel().
* More typo fixes.
Thanks
Richard
Attachment
- v4-0001-Introduce-RelInfoList-structure.patch
- v4-0002-Introduce-RelAggInfo-structure-to-store-info-for-grouped-paths.patch
- v4-0003-Set-up-for-eager-aggregation-by-collecting-needed-infos.patch
- v4-0004-Implement-functions-that-create-RelAggInfos-if-applicable.patch
- v4-0005-Implement-functions-that-generate-paths-for-grouped-relations.patch
- v4-0006-Build-grouped-relations-out-of-base-relations.patch
- v4-0007-Build-grouped-relations-out-of-join-relations.patch
- v4-0008-Add-test-cases.patch
- v4-0009-Add-README.patch
There is a conflict in the parallel_schedule file. So here is another
rebase. Nothing else has changed.
Thanks
Richard
rebase. Nothing else has changed.
Thanks
Richard
Attachment
- v5-0001-Introduce-RelInfoList-structure.patch
- v5-0002-Introduce-RelAggInfo-structure-to-store-info-for-grouped-paths.patch
- v5-0003-Set-up-for-eager-aggregation-by-collecting-needed-infos.patch
- v5-0004-Implement-functions-that-create-RelAggInfos-if-applicable.patch
- v5-0005-Implement-functions-that-generate-paths-for-grouped-relations.patch
- v5-0006-Build-grouped-relations-out-of-base-relations.patch
- v5-0007-Build-grouped-relations-out-of-join-relations.patch
- v5-0008-Add-test-cases.patch
- v5-0009-Add-README.patch
Here is an update of the patchset with the following changes:
* Fix a 'Aggref found where not expected' error caused by the PVC call
in is_var_in_aggref_only. This would happen if we have Aggrefs
contained in other expressions.
* Use joinrel's relids rather than the union of the relids of its outer
and inner to search for its grouped rel. This is more correct as we
need to include OJs into consideration.
* Remove RelAggInfo.agg_exprs as it is not used anymore.
Thanks
Richard
* Fix a 'Aggref found where not expected' error caused by the PVC call
in is_var_in_aggref_only. This would happen if we have Aggrefs
contained in other expressions.
* Use joinrel's relids rather than the union of the relids of its outer
and inner to search for its grouped rel. This is more correct as we
need to include OJs into consideration.
* Remove RelAggInfo.agg_exprs as it is not used anymore.
Thanks
Richard
Attachment
- v6-0001-Introduce-RelInfoList-structure.patch
- v6-0002-Introduce-RelAggInfo-structure-to-store-info-for-grouped-paths.patch
- v6-0003-Set-up-for-eager-aggregation-by-collecting-needed-infos.patch
- v6-0004-Implement-functions-that-create-RelAggInfos-if-applicable.patch
- v6-0005-Implement-functions-that-generate-paths-for-grouped-relations.patch
- v6-0006-Build-grouped-relations-out-of-base-relations.patch
- v6-0007-Build-grouped-relations-out-of-join-relations.patch
- v6-0008-Add-test-cases.patch
- v6-0009-Add-README.patch
Another rebase is needed after d1d286d83c. Also I realized that the
partially_grouped_rel generated by eager aggregation might be dummy,
such as in query:
select count(t2.c) from t t1 join t t2 on t1.b = t2.b where false group by t1.a;
If somehow we choose this dummy path with a Finalize Agg Path on top of
it as the final cheapest path (a very rare case), we would encounter the
"Aggref found in non-Agg plan node" error. The v7 patch fixes this
issue.
Thanks
Richard
partially_grouped_rel generated by eager aggregation might be dummy,
such as in query:
select count(t2.c) from t t1 join t t2 on t1.b = t2.b where false group by t1.a;
If somehow we choose this dummy path with a Finalize Agg Path on top of
it as the final cheapest path (a very rare case), we would encounter the
"Aggref found in non-Agg plan node" error. The v7 patch fixes this
issue.
Thanks
Richard
Attachment
- v7-0001-Introduce-RelInfoList-structure.patch
- v7-0002-Introduce-RelAggInfo-structure-to-store-info-for-grouped-paths.patch
- v7-0003-Set-up-for-eager-aggregation-by-collecting-needed-infos.patch
- v7-0004-Implement-functions-that-create-RelAggInfos-if-applicable.patch
- v7-0005-Implement-functions-that-generate-paths-for-grouped-relations.patch
- v7-0006-Build-grouped-relations-out-of-base-relations.patch
- v7-0007-Build-grouped-relations-out-of-join-relations.patch
- v7-0008-Add-test-cases.patch
- v7-0009-Add-README.patch
On Thu, Jun 13, 2024 at 4:07 PM Richard Guo <guofenglinux@gmail.com> wrote: > I spent some time testing this patchset and found a few more issues. > ... > Hence here is the v8 patchset, with fixes for all the above issues. I found an 'ORDER/GROUP BY expression not found in targetlist' error with this patchset, with the query below: create table t (a boolean); set enable_eager_aggregate to on; explain (costs off) select min(1) from t t1 left join t t2 on t1.a group by (not (not t1.a)), t1.a order by t1.a; ERROR: ORDER/GROUP BY expression not found in targetlist This happens because the two grouping items are actually the same and standard_qp_callback would remove one of them. The fully-processed groupClause is kept in root->processed_groupClause. However, when collecting grouping expressions in create_grouping_expr_infos, we are checking parse->groupClause, which is incorrect. The fix is straightforward: check root->processed_groupClause instead. Here is a new rebase with this fix. Thanks Richard
Attachment
- v9-0001-Introduce-RelInfoList-structure.patch
- v9-0002-Introduce-RelAggInfo-structure-to-store-info-for-grouped-paths.patch
- v9-0003-Set-up-for-eager-aggregation-by-collecting-needed-infos.patch
- v9-0004-Implement-functions-that-create-RelAggInfos-if-applicable.patch
- v9-0005-Implement-functions-that-generate-paths-for-grouped-relations.patch
- v9-0006-Build-grouped-relations-out-of-base-relations.patch
- v9-0007-Build-grouped-relations-out-of-join-relations.patch
- v9-0008-Add-test-cases.patch
- v9-0009-Add-README.patch
- v9-0010-Run-pgindent.patch
Richard:
Thanks for reviving this patch and for all of your work on it! Eager aggregation pushdown will be beneficial for my work and I'm hoping to see it land.
I was playing around with v9 of the patches and was specifically curious about this previous statement...
Thanks for reviving this patch and for all of your work on it! Eager aggregation pushdown will be beneficial for my work and I'm hoping to see it land.
I was playing around with v9 of the patches and was specifically curious about this previous statement...
>This patch also makes eager aggregation work with outer joins. With
>outer join, the aggregate cannot be pushed down if any column referenced
>by grouping expressions or aggregate functions is nullable by an outer
>join above the relation to which we want to apply the partiall
>aggregation. Thanks to Tom's outer-join-aware-Var infrastructure, we
>can easily identify such situations and subsequently refrain from
>pushing down the aggregates.
>outer join, the aggregate cannot be pushed down if any column referenced
>by grouping expressions or aggregate functions is nullable by an outer
>join above the relation to which we want to apply the partiall
>aggregation. Thanks to Tom's outer-join-aware-Var infrastructure, we
>can easily identify such situations and subsequently refrain from
>pushing down the aggregates.
...and this related comment in eager_aggregate.out:
>-- Ensure aggregation cannot be pushed down to the nullable side
While I'm new to this work and its subtleties, I'm wondering if this is too broad a condition.
I modified the first test query in eager_aggregate.sql to make it a LEFT JOIN and eager aggregation indeed did not happen, which is expected based on the comments upthread.
query:
SET enable_eager_aggregate=ON;
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.a, sum(t2.c) FROM eager_agg_t1 t1 LEFT JOIN eager_agg_t2 t2 ON t1.b = t2.b GROUP BY t1.a ORDER BY t1.a;
plan:
QUERY PLAN
------------------------------------------------------------
GroupAggregate
Output: t1.a, sum(t2.c)
Group Key: t1.a
-> Sort
Output: t1.a, t2.c
Sort Key: t1.a
-> Hash Right Join
Output: t1.a, t2.c
Hash Cond: (t2.b = t1.b)
-> Seq Scan on public.eager_agg_t2 t2
Output: t2.a, t2.b, t2.c
-> Hash
Output: t1.a, t1.b
-> Seq Scan on public.eager_agg_t1 t1
Output: t1.a, t1.b
(15 rows)
(NOTE: I changed the aggregate from avg(...) to sum(...) for simplicity)
But, it seems that eager aggregation for the query above can be "replicated" as:
query:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.a, sum(t2.c)
FROM eager_agg_t1 t1
LEFT JOIN (
SELECT b, sum(c) c
FROM eager_agg_t2 t2p
GROUP BY b
) t2 ON t1.b = t2.b
GROUP BY t1.a
ORDER BY t1.a;
The output of both the original query and this one match (and the plans with eager aggregation and the subquery are nearly identical if you restore the LEFT JOIN to a JOIN). I admittedly may be missing a subtlety, but does this mean that there are conditions under which eager aggregation can be pushed down to the nullable side?
>-- Ensure aggregation cannot be pushed down to the nullable side
While I'm new to this work and its subtleties, I'm wondering if this is too broad a condition.
I modified the first test query in eager_aggregate.sql to make it a LEFT JOIN and eager aggregation indeed did not happen, which is expected based on the comments upthread.
query:
SET enable_eager_aggregate=ON;
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.a, sum(t2.c) FROM eager_agg_t1 t1 LEFT JOIN eager_agg_t2 t2 ON t1.b = t2.b GROUP BY t1.a ORDER BY t1.a;
plan:
QUERY PLAN
------------------------------------------------------------
GroupAggregate
Output: t1.a, sum(t2.c)
Group Key: t1.a
-> Sort
Output: t1.a, t2.c
Sort Key: t1.a
-> Hash Right Join
Output: t1.a, t2.c
Hash Cond: (t2.b = t1.b)
-> Seq Scan on public.eager_agg_t2 t2
Output: t2.a, t2.b, t2.c
-> Hash
Output: t1.a, t1.b
-> Seq Scan on public.eager_agg_t1 t1
Output: t1.a, t1.b
(15 rows)
(NOTE: I changed the aggregate from avg(...) to sum(...) for simplicity)
But, it seems that eager aggregation for the query above can be "replicated" as:
query:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.a, sum(t2.c)
FROM eager_agg_t1 t1
LEFT JOIN (
SELECT b, sum(c) c
FROM eager_agg_t2 t2p
GROUP BY b
) t2 ON t1.b = t2.b
GROUP BY t1.a
ORDER BY t1.a;
The output of both the original query and this one match (and the plans with eager aggregation and the subquery are nearly identical if you restore the LEFT JOIN to a JOIN). I admittedly may be missing a subtlety, but does this mean that there are conditions under which eager aggregation can be pushed down to the nullable side?
-Paul-
On Sat, Jul 6, 2024 at 4:56 PM Richard Guo <guofenglinux@gmail.com> wrote:
On Thu, Jun 13, 2024 at 4:07 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I spent some time testing this patchset and found a few more issues.
> ...
> Hence here is the v8 patchset, with fixes for all the above issues.
I found an 'ORDER/GROUP BY expression not found in targetlist' error
with this patchset, with the query below:
create table t (a boolean);
set enable_eager_aggregate to on;
explain (costs off)
select min(1) from t t1 left join t t2 on t1.a group by (not (not
t1.a)), t1.a order by t1.a;
ERROR: ORDER/GROUP BY expression not found in targetlist
This happens because the two grouping items are actually the same and
standard_qp_callback would remove one of them. The fully-processed
groupClause is kept in root->processed_groupClause. However, when
collecting grouping expressions in create_grouping_expr_infos, we are
checking parse->groupClause, which is incorrect.
The fix is straightforward: check root->processed_groupClause instead.
Here is a new rebase with this fix.
Thanks
Richard
On Sun, Jul 7, 2024 at 10:45 AM Paul George <p.a.george19@gmail.com> wrote: > Thanks for reviving this patch and for all of your work on it! Eager aggregation pushdown will be beneficial for my workand I'm hoping to see it land. Thanks for looking at this patch! > The output of both the original query and this one match (and the plans with eager aggregation and the subquery are nearlyidentical if you restore the LEFT JOIN to a JOIN). I admittedly may be missing a subtlety, but does this mean thatthere are conditions under which eager aggregation can be pushed down to the nullable side? I think it's a very risky thing to push a partial aggregation down to the nullable side of an outer join, because the NULL-extended rows produced by the outer join would not be available when we perform the partial aggregation, while with a non-eager-aggregation plan these rows are available for the top-level aggregation. This may put the rows into groups in a different way than expected, or get wrong values from the aggregate functions. I've managed to compose an example: create table t (a int, b int); insert into t select 1, 1; select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by t2.a having t2.a is null; a | count ---+------- | 1 (1 row) This is the expected result, because after the outer join we have got a NULL-extended row. But if we somehow push down the partial aggregation to the nullable side of this outer join, we would get a wrong result. explain (costs off) select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by t2.a having t2.a is null; QUERY PLAN ------------------------------------------- Finalize HashAggregate Group Key: t2.a -> Nested Loop Left Join Filter: (t2.a IS NULL) -> Seq Scan on t t1 -> Materialize -> Partial HashAggregate Group Key: t2.a -> Seq Scan on t t2 Filter: (b > 1) (10 rows) select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by t2.a having t2.a is null; a | count ---+------- | 0 (1 row) I believe there are cases where pushing a partial aggregation down to the nullable side of an outer join can be safe, but I doubt that there is an easy way to identify these cases and do the push-down for them. So for now I think we'd better refrain from doing that. Thanks Richard
Hey Richard,
Looking more closely at this example
I wonder if the inability to exploit eager aggregation is more based on the fact that COUNT(*) cannot be decomposed into an aggregation of PARTIAL COUNT(*)s (apologies if my terminology is off/made up...I'm new to the codebase). In other words, is it the case that a given aggregate function already has built-in protection against the error case you correctly pointed out?
To highlight this, in the simple example below we don't see aggregate pushdown even with an INNER JOIN when the agg function is COUNT(*) but we do when it's COUNT(t2.*):
-- same setup
drop table if exists t;
create table t(a int, b int, c int);
insert into t select i % 100, i % 10, i from generate_series(1, 1000) i;
analyze t;
create table t(a int, b int, c int);
insert into t select i % 100, i % 10, i from generate_series(1, 1000) i;
analyze t;
-- query 1: COUNT(*) --> no pushdown
set enable_eager_aggregate=on;
explain (verbose, costs off) select t1.a, count(*) from t t1 join t t2 on t1.a=t2.a group by t1.a;
explain (verbose, costs off) select t1.a, count(*) from t t1 join t t2 on t1.a=t2.a group by t1.a;
QUERY PLAN
-------------------------------------------
HashAggregate
Output: t1.a, count(*)
Group Key: t1.a
-> Hash Join
Output: t1.a
Hash Cond: (t1.a = t2.a)
-> Seq Scan on public.t t1
Output: t1.a, t1.b, t1.c
-> Hash
Output: t2.a
-> Seq Scan on public.t t2
Output: t2.a
(12 rows)
-------------------------------------------
HashAggregate
Output: t1.a, count(*)
Group Key: t1.a
-> Hash Join
Output: t1.a
Hash Cond: (t1.a = t2.a)
-> Seq Scan on public.t t1
Output: t1.a, t1.b, t1.c
-> Hash
Output: t2.a
-> Seq Scan on public.t t2
Output: t2.a
(12 rows)
-- query 2: COUNT(t2.*) --> agg pushdown
set enable_eager_aggregate=on;
explain (verbose, costs off) select t1.a, count(t2.*) from t t1 join t t2 on t1.a=t2.a group by t1.a;
explain (verbose, costs off) select t1.a, count(t2.*) from t t1 join t t2 on t1.a=t2.a group by t1.a;
QUERY PLAN
-------------------------------------------------------
Finalize HashAggregate
Output: t1.a, count(t2.*)
Group Key: t1.a
-> Hash Join
Output: t1.a, (PARTIAL count(t2.*))
Hash Cond: (t1.a = t2.a)
-> Seq Scan on public.t t1
Output: t1.a, t1.b, t1.c
-> Hash
Output: t2.a, (PARTIAL count(t2.*))
-> Partial HashAggregate
Output: t2.a, PARTIAL count(t2.*)
Group Key: t2.a
-> Seq Scan on public.t t2
Output: t2.*, t2.a
(15 rows)
-------------------------------------------------------
Finalize HashAggregate
Output: t1.a, count(t2.*)
Group Key: t1.a
-> Hash Join
Output: t1.a, (PARTIAL count(t2.*))
Hash Cond: (t1.a = t2.a)
-> Seq Scan on public.t t1
Output: t1.a, t1.b, t1.c
-> Hash
Output: t2.a, (PARTIAL count(t2.*))
-> Partial HashAggregate
Output: t2.a, PARTIAL count(t2.*)
Group Key: t2.a
-> Seq Scan on public.t t2
Output: t2.*, t2.a
(15 rows)
...while it might be true that COUNT(*) ... INNER JOIN should allow eager agg pushdown (I haven't thought deeply about it, TBH), I did find this result pretty interesting.
-Paul
On Wed, Jul 10, 2024 at 1:27 AM Richard Guo <guofenglinux@gmail.com> wrote:
On Sun, Jul 7, 2024 at 10:45 AM Paul George <p.a.george19@gmail.com> wrote:
> Thanks for reviving this patch and for all of your work on it! Eager aggregation pushdown will be beneficial for my work and I'm hoping to see it land.
Thanks for looking at this patch!
> The output of both the original query and this one match (and the plans with eager aggregation and the subquery are nearly identical if you restore the LEFT JOIN to a JOIN). I admittedly may be missing a subtlety, but does this mean that there are conditions under which eager aggregation can be pushed down to the nullable side?
I think it's a very risky thing to push a partial aggregation down to
the nullable side of an outer join, because the NULL-extended rows
produced by the outer join would not be available when we perform the
partial aggregation, while with a non-eager-aggregation plan these
rows are available for the top-level aggregation. This may put the
rows into groups in a different way than expected, or get wrong values
from the aggregate functions. I've managed to compose an example:
create table t (a int, b int);
insert into t select 1, 1;
select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
a | count
---+-------
| 1
(1 row)
This is the expected result, because after the outer join we have got
a NULL-extended row.
But if we somehow push down the partial aggregation to the nullable
side of this outer join, we would get a wrong result.
explain (costs off)
select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
QUERY PLAN
-------------------------------------------
Finalize HashAggregate
Group Key: t2.a
-> Nested Loop Left Join
Filter: (t2.a IS NULL)
-> Seq Scan on t t1
-> Materialize
-> Partial HashAggregate
Group Key: t2.a
-> Seq Scan on t t2
Filter: (b > 1)
(10 rows)
select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
a | count
---+-------
| 0
(1 row)
I believe there are cases where pushing a partial aggregation down to
the nullable side of an outer join can be safe, but I doubt that there
is an easy way to identify these cases and do the push-down for them.
So for now I think we'd better refrain from doing that.
Thanks
Richard
On Wed, Aug 21, 2024 at 3:11 AM Richard Guo <guofenglinux@gmail.com> wrote: > Attached is the updated version of the patchset that fixes this bug > and includes further code refactoring. Here are some initial, high-level thoughts about this patch set. 1. As far as I can see, there's no real performance testing on this thread. I expect that it's possible to show an arbitrarily large gain for the patch by finding a case where partial aggregation is way better than anything we currently know, but that's not very interesting. What I think would be useful to do is find a corpus of existing queries on an existing data set and try them with and without the patch and see which query plans change and whether they're actually better. For example, maybe TPC-H or the subset of TPC-DS that we can actually run would be a useful starting point. One could then also measure how much the planning time increases with the patch to get a sense of what the overhead of enabling this feature would be. Even if it's disabled by default, people aren't going to want to enable it if it causes planning times to become much longer on many queries for which there is no benefit. 2. I think there might be techniques we could use to limit planning effort at an earlier stage when the approach doesn't appear promising. For example, if the proposed grouping column is already unique, the exercise is pointless (I think). Ideally we'd like to detect that without even creating the grouped_rel. But the proposed grouping column might also be *mostly* unique. For example, consider a table with a million rows and a column 500,000 distinct values. I suspect it will be difficult for partial aggregation to work out to a win in a case like this, because I think that the cost of performing the partial aggregation will not reduce the cost either of the final aggregation or of the intervening join steps by enough to compensate. It would be best to find a way to avoid generating a lot of rels and paths in cases where there's really not much hope of a win. One could, perhaps, imagine going further with this by postponing eager aggregation planning until after regular paths have been built, so that we have good cardinality estimates. Suppose the query joins a single fact table to a series of dimension tables. The final plan thus uses the fact table as the driving table and joins to the dimension tables one by one. Do we really need to consider partial aggregation at every level? Perhaps just where there's been a significant row count reduction since the last time we tried it, but at the next level the row count will increase again? Maybe there are other heuristics we could use in addition or instead. 3. In general, we are quite bad at estimating what will happen to the row count after an aggregation, and we have no real idea what the distribution of values will be. That might be a problem for this patch, because it seems like the decisions we will make about where to perform the partial aggregation might end up being quite random. At the top of the join tree, I'll need to compare directly aggregating the best join path with various paths that involve a finalize aggregation step at the top and a partial aggregation step further down. But my cost estimates and row counts for the partial aggregate steps seem like they will often be quite poor, which means that the plans that use those partial aggregate steps might also be quite poor. Even if they're not, I fear that comparing the cost of those PartialAggregate-Join(s)-FinalizeAggregate paths to the direct Aggregate path will look too much like comparing random numbers. We need to know whether the combination of the FinalizeAggregate step and the PartialAggregate step will be more or less expensive than a plain old Aggregate, but how can we tell that if we don't have accurate cardinality estimates? Thanks for working on this. -- Robert Haas EDB: http://www.enterprisedb.com
Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:
On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation. While there were no major changes, the code should
> now be simpler.
I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.
Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.
Rectenly, I do some benchmark tests, mainly on tpch and tpcds.
tpch tests have no plan diff, so I do not continue to test on tpch.
tpcds(10GB) tests have 22 plan diff as below:
4.sql, 5.sql, 8.sql,11.sql,19.sql,23.sql,31.sql, 33.sql,39.sql,45.sql,46.sql,47.sql,53.sql,
56.sql,57.sql,60.sql,63.sql,68.sql,74.sql,77.sql,80.sql,89.sql
I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
performance regress.
I will continue to do benchmark on this feature.
Tender Wang
On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote: > Rectenly, I do some benchmark tests, mainly on tpch and tpcds. > tpch tests have no plan diff, so I do not continue to test on tpch. Interesting to know. > tpcds(10GB) tests have 22 plan diff as below: > 4.sql, 5.sql, 8.sql,11.sql,19.sql,23.sql,31.sql, 33.sql,39.sql,45.sql,46.sql,47.sql,53.sql, > 56.sql,57.sql,60.sql,63.sql,68.sql,74.sql,77.sql,80.sql,89.sql OK. > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql). > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little > performance regress. Yeah, this is one of the things I was worried about in my previous reply to Richard. It would be worth Richard, or someone, probing into exactly why that's happening. My fear is that we just don't have good enough estimates to make good decisions, but there might well be another explanation. > I will continue to do benchmark on this feature. > > [1] https://github.com/tenderwg/eager_agg Thanks! -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 23, 2024 at 11:59 PM Robert Haas <robertmhaas@gmail.com> wrote: > Here are some initial, high-level thoughts about this patch set. Thank you for your review and feedback! It helps a lot in moving this work forward. > 1. As far as I can see, there's no real performance testing on this > thread. I expect that it's possible to show an arbitrarily large gain > for the patch by finding a case where partial aggregation is way > better than anything we currently know, but that's not very > interesting. What I think would be useful to do is find a corpus of > existing queries on an existing data set and try them with and without > the patch and see which query plans change and whether they're > actually better. For example, maybe TPC-H or the subset of TPC-DS that > we can actually run would be a useful starting point. One could then > also measure how much the planning time increases with the patch to > get a sense of what the overhead of enabling this feature would be. > Even if it's disabled by default, people aren't going to want to > enable it if it causes planning times to become much longer on many > queries for which there is no benefit. Right. I haven’t had time to run any benchmarks yet, but that is something I need to do. > 2. I think there might be techniques we could use to limit planning > effort at an earlier stage when the approach doesn't appear promising. > For example, if the proposed grouping column is already unique, the > exercise is pointless (I think). Ideally we'd like to detect that > without even creating the grouped_rel. But the proposed grouping > column might also be *mostly* unique. For example, consider a table > with a million rows and a column 500,000 distinct values. I suspect it > will be difficult for partial aggregation to work out to a win in a > case like this, because I think that the cost of performing the > partial aggregation will not reduce the cost either of the final > aggregation or of the intervening join steps by enough to compensate. > It would be best to find a way to avoid generating a lot of rels and > paths in cases where there's really not much hope of a win. > > One could, perhaps, imagine going further with this by postponing > eager aggregation planning until after regular paths have been built, > so that we have good cardinality estimates. Suppose the query joins a > single fact table to a series of dimension tables. The final plan thus > uses the fact table as the driving table and joins to the dimension > tables one by one. Do we really need to consider partial aggregation > at every level? Perhaps just where there's been a significant row > count reduction since the last time we tried it, but at the next level > the row count will increase again? > > Maybe there are other heuristics we could use in addition or instead. Yeah, one of my concerns with this work is that it can use significantly more CPU time and memory during planning once enabled. It would be great if we have some efficient heuristics to limit the effort. I'll work on that next and see what happens. > 3. In general, we are quite bad at estimating what will happen to the > row count after an aggregation, and we have no real idea what the > distribution of values will be. That might be a problem for this > patch, because it seems like the decisions we will make about where to > perform the partial aggregation might end up being quite random. At > the top of the join tree, I'll need to compare directly aggregating > the best join path with various paths that involve a finalize > aggregation step at the top and a partial aggregation step further > down. But my cost estimates and row counts for the partial aggregate > steps seem like they will often be quite poor, which means that the > plans that use those partial aggregate steps might also be quite poor. > Even if they're not, I fear that comparing the cost of those > PartialAggregate-Join(s)-FinalizeAggregate paths to the direct > Aggregate path will look too much like comparing random numbers. We > need to know whether the combination of the FinalizeAggregate step and > the PartialAggregate step will be more or less expensive than a plain > old Aggregate, but how can we tell that if we don't have accurate > cardinality estimates? Yeah, I'm concerned about this too. In addition to the inaccuracies in aggregation estimates, our estimates for joins are sometimes not very accurate either. All this are likely to result in regressions with eager aggregation in some cases. Currently I don't have a good answer to this problem. Maybe we can run some benchmarks first and investigate the regressions discovered on a case-by-case basis to better understand the specific issues. Thanks Richard
On Wed, Aug 28, 2024 at 11:57 AM Tender Wang <tndrwang@gmail.com> wrote: > Rectenly, I do some benchmark tests, mainly on tpch and tpcds. > tpch tests have no plan diff, so I do not continue to test on tpch. > tpcds(10GB) tests have 22 plan diff as below: > 4.sql, 5.sql, 8.sql,11.sql,19.sql,23.sql,31.sql, 33.sql,39.sql,45.sql,46.sql,47.sql,53.sql, > 56.sql,57.sql,60.sql,63.sql,68.sql,74.sql,77.sql,80.sql,89.sql > > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql). > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little > performance regress. > > I will continue to do benchmark on this feature. Thank you for running the benchmarks. That really helps a lot. Thanks Richard
On Wed, Aug 28, 2024 at 9:01 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote: > > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql). > > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little > > performance regress. > > Yeah, this is one of the things I was worried about in my previous > reply to Richard. It would be worth Richard, or someone, probing into > exactly why that's happening. My fear is that we just don't have good > enough estimates to make good decisions, but there might well be > another explanation. It's great that we have a query to probe into. Your guess is likely correct: it may be caused by poor estimates. Tender, would you please help provide the outputs of EXPLAIN (COSTS ON, ANALYZE) on 19.sql with and without eager aggregation? > > I will continue to do benchmark on this feature. Thanks again for running the benchmarks. Thanks Richard
Richard Guo <guofenglinux@gmail.com> 于2024年8月29日周四 10:46写道:
On Wed, Aug 28, 2024 at 9:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote:
> > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
> > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
> > performance regress.
>
> Yeah, this is one of the things I was worried about in my previous
> reply to Richard. It would be worth Richard, or someone, probing into
> exactly why that's happening. My fear is that we just don't have good
> enough estimates to make good decisions, but there might well be
> another explanation.
It's great that we have a query to probe into. Your guess is likely
correct: it may be caused by poor estimates.
Tender, would you please help provide the outputs of
EXPLAIN (COSTS ON, ANALYZE)
on 19.sql with and without eager aggregation?
Yeah, in [1], 19_off.out and 19_on.out are the output of explain(costs off, analyze).
I will do EXPLAIN(COSTS ON, ANALYZE) tests and upload them later today.
Tender Wang
Richard Guo <guofenglinux@gmail.com> 于2024年8月29日周四 10:46写道:
On Wed, Aug 28, 2024 at 9:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote:
> > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
> > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
> > performance regress.
>
> Yeah, this is one of the things I was worried about in my previous
> reply to Richard. It would be worth Richard, or someone, probing into
> exactly why that's happening. My fear is that we just don't have good
> enough estimates to make good decisions, but there might well be
> another explanation.
It's great that we have a query to probe into. Your guess is likely
correct: it may be caused by poor estimates.
Tender, would you please help provide the outputs of
EXPLAIN (COSTS ON, ANALYZE)
on 19.sql with and without eager aggregation?
I upload EXPLAIN(COSTS ON, ANALYZE) test to [1].
I ran the same query three times, and I chose the third time result.
You can check 19_off_explain.out and 19_on_explain.out.
Tender Wang
On Wed, Aug 28, 2024 at 10:26 PM Richard Guo <guofenglinux@gmail.com> wrote: > Yeah, I'm concerned about this too. In addition to the inaccuracies > in aggregation estimates, our estimates for joins are sometimes not > very accurate either. All this are likely to result in regressions > with eager aggregation in some cases. Currently I don't have a good > answer to this problem. Maybe we can run some benchmarks first and > investigate the regressions discovered on a case-by-case basis to better > understand the specific issues. While it's true that we can make mistakes during join estimation, I believe aggregate estimation tends to be far worse. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 28, 2024 at 11:38 PM Tender Wang <tndrwang@gmail.com> wrote: > I upload EXPLAIN(COSTS ON, ANALYZE) test to [1]. > I ran the same query three times, and I chose the third time result. > You can check 19_off_explain.out and 19_on_explain.out. So, in 19_off_explain.out, we got this: -> Finalize GroupAggregate (cost=666986.48..667015.35 rows=187 width=142) (actual time=272.649..334.318 rows=900 loops=1) -> Gather Merge (cost=666986.48..667010.21 rows=187 width=142) (actual time=272.644..333.847 rows=901 loops=1) -> Partial GroupAggregate (cost=665986.46..665988.60 rows=78 width=142) (actual time=266.379..267.476 rows=300 loops=3) -> Sort (cost=665986.46..665986.65 rows=78 width=116) (actual time=266.367..266.583 rows=5081 loops=3) And in 19_on_explan.out, we got this: -> Finalize GroupAggregate (cost=666987.03..666989.77 rows=19 width=142) (actual time=285.018..357.374 rows=900 loops=1) -> Gather Merge (cost=666987.03..666989.25 rows=19 width=142) (actual time=285.000..352.793 rows=15242 loops=1) -> Sort (cost=665987.01..665987.03 rows=8 width=142) (actual time=273.391..273.580 rows=5081 loops=3) -> Nested Loop (cost=665918.00..665986.89 rows=8 width=142) (actual time=252.667..269.719 rows=5081 loops=3) -> Nested Loop (cost=665917.85..665985.43 rows=8 width=157) (actual time=252.656..264.755 rows=5413 loops=3) -> Partial GroupAggregate (cost=665917.43..665920.10 rows=82 width=150) (actual time=252.643..255.627 rows=5413 loops=3) -> Sort (cost=665917.43..665917.64 rows=82 width=124) (actual time=252.636..252.927 rows=5413 loops=3) So, the patch was expected to cause the number of rows passing through the Gather Merge to decrease from 197 to 19, but actually caused the number of rows passing through the Gather Merge to increase from 901 to 15242. When the PartialAggregate was positioned at the top of the join tree, it reduced the number of rows from 5081 to 300; but when it was pushed down below two joins, it didn't reduce the row count at all, and the subsequent two joins reduced it by less than 10%. Now, you could complain about the fact that the Parallel Hash Join isn't well-estimated here, but my question is: why does the planner think that the PartialAggregate should go specifically here? In both plans, the PartialAggregate isn't expected to change the row count. And if that is true, then it's going to be cheapest to do it at the point where the joins have reduced the row count to the minimum value. Here, that would be at the top of the plan tree, where we have only 5081 estimated rows, but instead, the patch chooses to do it as soon as we have all of the grouping columns, when we. still have 5413 rows. I don't understand why that path wins on cost, unless it's just that the paths compare fuzzily the same, in which case it kind of goes to my earlier point about not really having the statistics to know which way is actually going to be better. -- Robert Haas EDB: http://www.enterprisedb.com
Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:
On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation. While there were no major changes, the code should
> now be simpler.
I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.
Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.
The v11-0002 git am failed on HEAD(6c2b5edecc).
Applying: Implement Eager Aggregation
error: patch failed: src/test/regress/parallel_schedule:119
error: src/test/regress/parallel_schedule: patch does not apply
Patch failed at 0001 Implement Eager Aggregation
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Thanks,
Tender Wang
Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:
On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation. While there were no major changes, the code should
> now be simpler.
I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.
Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.
I review the v11 patch set, and here are a few of my thoughts:
1. in setup_eager_aggregation(), before calling create_agg_clause_infos(), it does
some checks if eager aggregation is available. Can we move those checks into a function,
for example, can_eager_agg(), like can_partial_agg() does?
2. I found that outside of joinrel.c we all use IS_DUMMY_REL, but in joinrel.c, Tom always uses
is_dummy_rel(). Other commiters use IS_DUMMY_REL.
3. The attached patch does not consider FDW when creating a path for grouped_rel or grouped_join.
Do we need to think about FDW?
I haven't finished reviewing the patch set. I will continue to learn this feature.
Thanks,
Tender Wang
Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:
On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation. While there were no major changes, the code should
> now be simpler.
I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.
Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.
1. In make_one_rel(), we have the below codes:
/*
* Build grouped base relations for each base rel if possible.
*/
setup_base_grouped_rels(root);
* Build grouped base relations for each base rel if possible.
*/
setup_base_grouped_rels(root);
As far as I know, each base rel only has one grouped base relation, if possible.
The comments may be changed to "Build a grouped base relation for each base rel if possible."
2. According to the comments of generate_grouped_paths(), we may generate paths for a grouped
relation on top of paths of join relation. So the ”rel_plain" argument in generate_grouped_paths() may be
confused. "plain" usually means "base rel" . How about Re-naming rel_plain to input_rel?
3. In create_partial_grouping_paths(), The partially_grouped_rel could have been already created due to eager
aggregation. If partially_grouped_rel exists, its reltarget has been created. So do we need below logic?
/*
* Build target list for partial aggregate paths. These paths cannot just
* emit the same tlist as regular aggregate paths, because (1) we must
* include Vars and Aggrefs needed in HAVING, which might not appear in
* the result tlist, and (2) the Aggrefs must be set in partial mode.
*/
partially_grouped_rel->reltarget =
make_partial_grouping_target(root, grouped_rel->reltarget,
extra->havingQual);
* Build target list for partial aggregate paths. These paths cannot just
* emit the same tlist as regular aggregate paths, because (1) we must
* include Vars and Aggrefs needed in HAVING, which might not appear in
* the result tlist, and (2) the Aggrefs must be set in partial mode.
*/
partially_grouped_rel->reltarget =
make_partial_grouping_target(root, grouped_rel->reltarget,
extra->havingQual);
--
Thanks,
Tender Wang
Tender Wang <tndrwang@gmail.com> 于2024年9月4日周三 11:48写道:
Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation. While there were no major changes, the code should
> now be simpler.
I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.
Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.The v11-0002 git am failed on HEAD(6c2b5edecc).tender@iZ2ze6la2dizi7df9q3xheZ:/workspace/postgres$ git am v11-0002-Implement-Eager-Aggregation.patch
Applying: Implement Eager Aggregation
error: patch failed: src/test/regress/parallel_schedule:119
error: src/test/regress/parallel_schedule: patch does not apply
Patch failed at 0001 Implement Eager Aggregation
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.To restore the original branch and stop patching, run "git am --abort".
Since MERGE/SPLIT partition has been reverted, the tests *partition_merge* and *partition_split* should be removed
from parallel_schedule. After doing the above, the 0002 patch can be applied.
Thanks,
Tender Wang
On Wed, Aug 28, 2024 at 9:01 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote: > > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql). > > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little > > performance regress. > > Yeah, this is one of the things I was worried about in my previous > reply to Richard. It would be worth Richard, or someone, probing into > exactly why that's happening. My fear is that we just don't have good > enough estimates to make good decisions, but there might well be > another explanation. Sorry it takes some time to switch back to this thread. I revisited the part about cost estimates for grouped paths in this patch, and I found a big issue: the row estimate for a join path could be significantly inaccurate if there is a grouped join path beneath it. The reason is that it is very tricky to set the size estimates for a grouped join relation. For a non-grouped join relation, we know that all its paths have the same rowcount estimate (well, in theory). But this is not true for a grouped join relation. Suppose we have a grouped join relation for t1/t2 join. There might be two paths for it: Aggregate -> Join -> Scan on t1 -> Scan on t2 Or Join -> Scan on t1 -> Aggregate -> Scan on t2 These two paths can have very different rowcount estimates, and we have no way of knowing which one to set for this grouped join relation, because we do not know which path would be picked in the final plan. This issue can be illustrated with the query below. create table t (a int, b int, c int); insert into t select i%10, i%10, i%10 from generate_series(1,1000)i; analyze t; set enable_eager_aggregate to on; explain (costs on) select sum(t2.c) from t t1 join t t2 on t1.a = t2.a join t t3 on t2.b = t3.b group by t3.a; QUERY PLAN --------------------------------------------------------------------------------------- Finalize HashAggregate (cost=6840.60..6840.70 rows=10 width=12) Group Key: t3.a -> Nested Loop (cost=1672.00..1840.60 rows=1000000 width=12) Join Filter: (t2.b = t3.b) -> Partial HashAggregate (cost=1672.00..1672.10 rows=10 width=12) Group Key: t2.b -> Hash Join (cost=28.50..1172.00 rows=100000 width=8) Hash Cond: (t1.a = t2.a) -> Seq Scan on t t1 (cost=0.00..16.00 rows=1000 width=4) -> Hash (cost=16.00..16.00 rows=1000 width=12) -> Seq Scan on t t2 (cost=0.00..16.00 rows=1000 width=12) -> Materialize (cost=0.00..21.00 rows=1000 width=8) -> Seq Scan on t t3 (cost=0.00..16.00 rows=1000 width=8) (13 rows) Look at the Nested Loop node: -> Nested Loop (cost=1672.00..1840.60 rows=1000000 width=12) How can a 10-row outer path joining a 1000-row inner path generate 1000000 rows? This is because we are using the plan of the first path described above, and the rowcount estimate of the second path. What a kluge! To address this issue, one solution I’m considering is to recalculate the row count estimate for a grouped join path using its outer and inner paths. While this may seem expensive, it might not be that bad since we will cache the results of the selectivity calculation. In fact, this is already the approach we take for parameterized join paths (see get_parameterized_joinrel_size). Any thoughts on this? Thanks Richard
On Thu, Sep 5, 2024 at 9:40 AM Tender Wang <tndrwang@gmail.com> wrote: > 1. in setup_eager_aggregation(), before calling create_agg_clause_infos(), it does > some checks if eager aggregation is available. Can we move those checks into a function, > for example, can_eager_agg(), like can_partial_agg() does? We can do this, but I'm not sure this would be better. > 2. I found that outside of joinrel.c we all use IS_DUMMY_REL, but in joinrel.c, Tom always uses > is_dummy_rel(). Other commiters use IS_DUMMY_REL. They are essentially the same: IS_DUMMY_REL() is a macro that wraps is_dummy_rel(). I think they are interchangeable, and I don’t have a preference for which one is better. > 3. The attached patch does not consider FDW when creating a path for grouped_rel or grouped_join. > Do we need to think about FDW? We may add support for foreign relations in the future, but for now, I think we'd better not expand the scope too much until we ensure that everything is working correctly. Thanks Richard
On Wed, Sep 11, 2024 at 10:52 AM Tender Wang <tndrwang@gmail.com> wrote: > 1. In make_one_rel(), we have the below codes: > /* > * Build grouped base relations for each base rel if possible. > */ > setup_base_grouped_rels(root); > > As far as I know, each base rel only has one grouped base relation, if possible. > The comments may be changed to "Build a grouped base relation for each base rel if possible." Yeah, each base rel has only one grouped rel. However, there is a comment nearby stating 'consider_parallel flags for each base rel', which confuses me about whether it should be singular or plural in this context. Perhaps someone more proficient in English could clarify this. > 2. According to the comments of generate_grouped_paths(), we may generate paths for a grouped > relation on top of paths of join relation. So the ”rel_plain" argument in generate_grouped_paths() may be > confused. "plain" usually means "base rel" . How about Re-naming rel_plain to input_rel? I don't think 'plain relation' necessarily means 'base relation'. In this context I think it can mean 'non-grouped relation'. But maybe I'm wrong. > 3. In create_partial_grouping_paths(), The partially_grouped_rel could have been already created due to eager > aggregation. If partially_grouped_rel exists, its reltarget has been created. So do we need below logic? > > /* > * Build target list for partial aggregate paths. These paths cannot just > * emit the same tlist as regular aggregate paths, because (1) we must > * include Vars and Aggrefs needed in HAVING, which might not appear in > * the result tlist, and (2) the Aggrefs must be set in partial mode. > */ > partially_grouped_rel->reltarget = > make_partial_grouping_target(root, grouped_rel->reltarget, > extra->havingQual); Yeah, maybe we can avoid building the target list here for partially_grouped_rel that is generated by eager aggregation. Thanks Richard
On Fri, Sep 13, 2024 at 3:48 PM Tender Wang <tndrwang@gmail.com> wrote: > Since MERGE/SPLIT partition has been reverted, the tests *partition_merge* and *partition_split* should be removed > from parallel_schedule. After doing the above, the 0002 patch can be applied. Yeah, that's what I need to do. Thanks Richard
On Sat, Oct 5, 2024 at 6:23 PM Richard Guo <guofenglinux@gmail.com> wrote: > > On Fri, Sep 27, 2024 at 11:53 AM Richard Guo <guofenglinux@gmail.com> wrote: > > Here is an updated version of this patch that fixes the rowcount > > estimate issue along this routine. (see set_joinpath_size.) > > I have worked on inventing some heuristics to limit the planning > effort of eager aggregation. One simple yet effective approach I'm > thinking of is to consider a grouped path as NOT useful if its row > reduction ratio falls below a predefined minimum threshold. Currently > I'm using 0.5 as the threshold, but I'm open to other values. I ran the TPC-DS benchmark at scale 10 and observed eager aggregation applied in several queries, including q4, q8, q11, q23, q31, q33, and q77. Notably, the regression in q19 that Tender identified with v11 has disappeared in v13. Here’s a comparison of Execution Time and Planning Time for the seven queries with eager aggregation disabled versus enabled (best of 3). Execution Time: EAGER-AGG-OFF EAGER-AGG-ON q4 105787.963 ms 34807.938 ms q8 1407.454 ms 1654.923 ms q11 67899.213 ms 18670.086 ms q23 45945.849 ms 42990.652 ms q31 10463.536 ms 10244.175 ms q33 2186.928 ms 2217.228 ms q77 2360.565 ms 2416.674 ms Planning Time: EAGER-AGG-OFF EAGER-AGG-ON q4 2.334 ms 2.602 ms q8 0.685 ms 0.647 ms q11 0.935 ms 1.094 ms q23 2.666 ms 2.582 ms q31 1.051 ms 1.206 ms q33 1.248 ms 1.796 ms q77 0.967 ms 0.962 ms There are good performance improvements in q4 and q11 (3~4 times). For the other queries, execution times remain largely unchanged, falling within the margin of error, with no notable regressions observed. For the planning time, I do not see notable regressions for any of the seven queries. It seems that the new cost estimates and the new heuristic are working pretty well. Thanks Richard
On Sat, Oct 5, 2024 at 6:23 PM Richard Guo <guofenglinux@gmail.com> wrote: > > On Fri, Sep 27, 2024 at 11:53 AM Richard Guo <guofenglinux@gmail.com> wrote: > > Here is an updated version of this patch that fixes the rowcount > > estimate issue along this routine. (see set_joinpath_size.) > in the function setup_eager_aggregation, can we be more conservative about cases where eager aggregation can be applied. I see the following case where eager aggregation is not OK. we can return earlier, so we won't call create_grouping_expr_infos(root); create_agg_clause_infos(root); , so we can avoid unintended consequences. 1. root->parse->resultRelation > 0 just be 100% sure we are only dealing with SELECT, or we can add Assert at the end of setup_eager_aggregation. 2. join type is FULL JOIN, (i am not sure about other Semijoins and anti-semijoins types). 3. root->parse->windowClause != NIL I am not sure whether enable_eager_aggregate can be useful when the LIMIT clause is there, the code comment not mentioned. I am also not sure about the Locking Clause, since the code is not mentioned. EXPLAIN (COSTS OFF, settings, verbose) SELECT avg(t2.c) FROM (select * from eager_agg_t1 for update) t1 JOIN (select * from eager_agg_t2 for update) t2 ON t1.b = t2.b GROUP BY t1.a; can eager aggregate apply to above query? in struct PlannerInfo. /* list of AggClauseInfos */ List *agg_clause_list; /* list of GroupExprInfos */ List *group_expr_list; /* list of plain Vars contained in targetlist and havingQual */ List *tlist_vars; we can comment that that agg_clause_list, tlist_vars are unique. lack doc entry in doc/src/sgml/config.sgml we can put after varlistentry enable_bitmapscan we can at least mention that enable_eager_aggregate, The default value is <literal>off</literal>. There are no tests related to aggregate with filter clauses. currently seems to support it. some of the "foreach" can be rewritten to foreach_node see https://git.postgresql.org/cgit/postgresql.git/commit/?id=14dd0f27d7cd56ffae9ecdbe324965073d01a9ff
/* * Eager aggregation is only possible if equality of grouping keys, as * defined by the equality operator, implies bitwise equality. * Otherwise, if we put keys with different byte images into the same * group, we may lose some information that could be needed to * evaluate upper qual clauses. * * For example, the NUMERIC data type is not supported because values * that fall into the same group according to the equality operator * (e.g. 0 and 0.0) can have different scale. */ tce = lookup_type_cache(exprType((Node *) tle->expr), TYPECACHE_BTREE_OPFAMILY); if (!OidIsValid(tce->btree_opf) || !OidIsValid(tce->btree_opintype)) return; equalimageproc = get_opfamily_proc(tce->btree_opf, tce->btree_opintype, tce->btree_opintype, BTEQUALIMAGE_PROC); if (!OidIsValid(equalimageproc) || !DatumGetBool(OidFunctionCall1Coll(equalimageproc, tce->typcollation, ObjectIdGetDatum(tce->btree_opintype)))) return; I am confused by BTEQUALIMAGE_PROC. * To facilitate B-Tree deduplication, an operator class may choose to * offer a forth amproc procedure (BTEQUALIMAGE_PROC). For full details, * see doc/src/sgml/btree.sgml. the above is comments about BTEQUALIMAGE_PROC in src/include/access/nbtree.h equalimage Optionally, a btree operator family may provide equalimage (“equality implies image equality”) support functions, registered under support function number 4. These functions allow the core code to determine when it is safe to apply the btree deduplication optimization. Currently, equalimage functions are only called when building or rebuilding an index. the above is BTEQUALIMAGE_PROC on https://www.postgresql.org/docs/current/btree.html#BTREE-SUPPORT-FUNCS integers support eager aggregate. select amproc.*, amproclefttype::regtype from pg_amproc amproc join pg_opfamily opf on amproc.amprocfamily = opf.oid where amproc.amprocnum = 4 and amproc.amproclefttype = amproc.amprocrighttype and opf.opfmethod = 403 and amproc.amprocrighttype = 'int'::regtype; returns oid | amprocfamily | amproclefttype | amprocrighttype | amprocnum | amproc | amproclefttype -------+--------------+----------------+-----------------+-----------+--------------+---------------- 10052 | 1976 | 23 | 23 | 4 | btequalimage | integer but btequalimage returns true unconditionally. So overall I doubt here BTEQUALIMAGE_PROC flag usage is correct.
On Fri, Oct 18, 2024 at 12:44 PM jian he <jian.universality@gmail.com> wrote: > 1. root->parse->resultRelation > 0 > just be 100% sure we are only dealing with SELECT, or we can add > Assert at the end of setup_eager_aggregation. Can GROUP BY clauses be used in INSERT/UPDATE/DELETE/MERGE statements? If not, I think there is no need to check 'resultRelation > 0', as setup_eager_aggregation already checks for GROUP BY clauses. > 2. join type is FULL JOIN, (i am not sure about other Semijoins and > anti-semijoins types). The presence of a FULL JOIN does not preclude the use of eager aggregation. We still can push a partial aggregation down to a level that is above the FULL JOIN. > 3. root->parse->windowClause != NIL Why does the presence of windowClause prevent the use of eager aggregation? > lack doc entry in doc/src/sgml/config.sgml > we can put after varlistentry enable_bitmapscan > we can at least mention that > enable_eager_aggregate, The default value is <literal>off</literal>. Yeah, that's what I need to do. Thanks Richard
On Fri, Oct 18, 2024 at 10:22 PM jian he <jian.universality@gmail.com> wrote: > So overall I doubt here BTEQUALIMAGE_PROC flag usage is correct. The BTEQUALIMAGE_PROC flag is used to prevent eager aggregation for types whose equality operators do not imply bitwise equality, such as NUMERIC. After a second thought, I think it should be OK to just check the equality operator specified by the SortGroupClause for btree equality. I’m not very sure about this point, though, and would appreciate any inputs. Thanks Richard
On Wed, Sep 25, 2024 at 3:03 AM Richard Guo <guofenglinux@gmail.com> wrote: > On Wed, Sep 11, 2024 at 10:52 AM Tender Wang <tndrwang@gmail.com> wrote: > > 1. In make_one_rel(), we have the below codes: > > /* > > * Build grouped base relations for each base rel if possible. > > */ > > setup_base_grouped_rels(root); > > > > As far as I know, each base rel only has one grouped base relation, if possible. > > The comments may be changed to "Build a grouped base relation for each base rel if possible." > > Yeah, each base rel has only one grouped rel. However, there is a > comment nearby stating 'consider_parallel flags for each base rel', > which confuses me about whether it should be singular or plural in > this context. Perhaps someone more proficient in English could > clarify this. It's not confusing the way you have it, but I think an English teacher wouldn't like it, because part of the sentence is singular ("each base rel") and the other part is plural ("grouped base relations"). Tender's proposed rewrite fixes that. Another way to fix it is to write "Build group relations for base rels where possible". > > 2. According to the comments of generate_grouped_paths(), we may generate paths for a grouped > > relation on top of paths of join relation. So the ”rel_plain" argument in generate_grouped_paths() may be > > confused. "plain" usually means "base rel" . How about Re-naming rel_plain to input_rel? > > I don't think 'plain relation' necessarily means 'base relation'. In > this context I think it can mean 'non-grouped relation'. But maybe > I'm wrong. We use the term "plain relation" in several different ways. In the header comments for addFkRecurseReferenced, it means a non-partitioned relation. In the struct comments for RangeTblEntry, it means any sort of named thing in pg_class that you can scan, so either a partitioned or unpartitioned table but not a join or a table function or something. AFAICT, the most common meaning of "plain relation" is a pg_class entry where relkind==RELKIND_RELATION. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 24, 2024 at 11:20 PM Richard Guo <guofenglinux@gmail.com> wrote: > The reason is that it is very tricky to set the size estimates for a > grouped join relation. For a non-grouped join relation, we know that > all its paths have the same rowcount estimate (well, in theory). But > this is not true for a grouped join relation. Suppose we have a > grouped join relation for t1/t2 join. There might be two paths for > it: What exactly do you mean by "well, in theory" here? My understanding of how things work today is that every relation is supposed to produce a specific set of rows and every unparameterized path must produce that set of rows. The order of the rows may vary but the set of rows may not. With your proposed design here, that's no longer true. Instead, what's promised is that the row sets will become equivalent after a later FinalizeAggregate step. In a sense, this is like parameterization or partial paths. Suppose I have: SELECT * FROM foo, bar WHERE foo.x = bar.x; While every unparameterized path for bar has the same row count, there's also the possibility of performing an index scan on bar.x parameterized by foo.x, and that path will have a far lower row count than the unparameterized paths. Instead of producing all the same rows as every other path, the parameterized path promises only that if run repeatedly, with all relevant values of foo.x, you'll eventually get all the same rows you would have gotten from the unparameterized path. Because of this difference, parameterized paths need special handling in many different parts of the code. And the same thing is true of partial paths. They also do not promise to generate all the same rows -- instead, they promise that when run simultaneously across multiple workers, the total set of rows returned across all invocations will be equal to what a normal path would have produced. Here again, there's a need for special handling because these paths behave differently than standard paths. I think what you're doing here is roughly equivalent to either of these two cases. It's more like the parameterized path case. Instead of having a path for a relation which is parameterized by some input parameter, you have a path for a relation, say bar, which is partially aggregated by some grouping column. But there's no guarantee of how much partial aggregation has been done. In your example, PartialAgg(t1 JOIN t2) is "more aggregated" than t1 JOIN PartialAgg(t2), so the row counts are different. This makes me quite nervous. You can't compare a parameterized path to an unparameterized path, but you can compare it to another parameterized path if the parameterizations are the same. You can't compare a partial path to a non-partial path, but you can compare partial paths to each other. But with this work, unparameterized, non-partial paths in the same RelOptInfo don't seem like they are truly comparable. Maybe that's OK, but I'm not sure that it isn't going to break other things. You might for example imagine a design where PartialAgg(t1 JOIN t2) and t1 JOIN PartialAgg(t2) get separate RelOptInfos. After all, there are probably multiple ways to generate paths for each of those things, and paths in each category can be compared to each other apples to apples. What's less clear is whether it's fair to compare across the two categories, and how many assumptions will be broken by doing so. I'm not sure that it's right to have separate RelOptInfos; we definitely don't want to create more RelOptInfos than necessary. At the same time, if we mix together all of those paths into a single RelOptInfo, we need to be confident that we're neither going to break anything nor introduce too many special cases into hot code paths. For instance, set_joinpath_size() represents an unwelcome complexity increase that could impact performance generally, even apart from the cases this patch intends to handle. It's tempting to wonder if there's some way that we can avoid generating paths for both PartialAgg(t1 JOIN t2) and t1 JOIN PartialAgg(t2). Either the former has lower cardinality, or the latter does. It seems likely that the lower-cardinality set is the winning strategy. Even if the path has higher cost to generate, we save work at every subsequent join level and at the final aggregation step. Are there counterexamples where it's better to use a path from the higher-cardinality set? By the way, the work of figuring out what target list should be produced by partial grouping is done by init_grouping_targets(), but the comments seem to take it for granted that I know what result we're trying to produce, and I don't. I think some more high-level explanation of the goals of this code would be useful. It seems to me that if I'm looking at a path for an ungrouped relation and it produces a certain target list, then every column of that target list is needed somewhere. If those columns are group keys, cool: we pass those through. If they're inputs to the aggregates, cool: we feed them to the aggregates. But if they are neither, then what? In the patch, you either group on those columns or add them to the possibly_dependent list depending on the result of is_var_needed_by_join(). I can believe that there are some cases where we can group on such columns and others where we can't, but find it difficult to believe that this test reliably distinguishes between those two cases. If it does, I don't understand why it does. Don't I need to know something about how those columns are used in the upper joins? Like, if those columns are connected by a chain of binary-equality operators back to the user's choice of grouping columns, that sounds good, but this test doesn't distinguish between that case and an upper join on the < operator. create_grouping_expr_infos() does reason based on whether there's an equal-image operator available, but AIUI that's only reasoning about the group columns the user mentioned, not the sort of implicit grouping columns that init_grouping_targets() is creating. I spent a lot of time thinking today about what makes it safe to push down grouping or not. I'm not sure that I have a solid answer to that question even yet. But it seems to me that there are at least two cases that clearly won't fly. One is the case where the partial aggregation would merge rows that need to be included in the final aggregation with rows that should be filtered out. If the partially-grouped relation simply has a filter qual, that's fine, because it will be evaluated before the aggregation. But there might be a qual that has to be evaluated later, either because (a) it involves another rel, like this_rel.x + that_rel.y > 10 or (b) it appears in the ON clause of an outer join and thus needs to be deferred to the level of the OJ, e.g. A LEFT JOIN B ON a.x = b.x AND b.y = 42. I wonder if you can comment on how these cases are handled. Perhaps this coding around functional dependencies has something to do with it, but it isn't clear to me. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Oct 5, 2024 at 11:30 PM Richard Guo <guofenglinux@gmail.com> wrote: > Here’s a comparison of Execution Time and Planning Time for the seven > queries with eager aggregation disabled versus enabled (best of 3). > > Execution Time: > > EAGER-AGG-OFF EAGER-AGG-ON > > q4 105787.963 ms 34807.938 ms > q8 1407.454 ms 1654.923 ms > q11 67899.213 ms 18670.086 ms > q23 45945.849 ms 42990.652 ms > q31 10463.536 ms 10244.175 ms > q33 2186.928 ms 2217.228 ms > q77 2360.565 ms 2416.674 ms Could you attach the EXPLAIN ANALYZE output for these queries, with and without the patch? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Oct 29, 2024 at 3:57 AM Richard Guo <guofenglinux@gmail.com> wrote: > On Fri, Oct 18, 2024 at 10:22 PM jian he <jian.universality@gmail.com> wrote: > > So overall I doubt here BTEQUALIMAGE_PROC flag usage is correct. > > The BTEQUALIMAGE_PROC flag is used to prevent eager aggregation for > types whose equality operators do not imply bitwise equality, such as > NUMERIC. > > After a second thought, I think it should be OK to just check the > equality operator specified by the SortGroupClause for btree equality. > I’m not very sure about this point, though, and would appreciate any > inputs. Well, the key thing here is the reasoning, which you don't really spell out either here or in the patch. The patch's justification for the use of BTEQUALIMAGE_PROC Is that, if we use non-equal-image operators, "we may lose some information that could be needed to evaluate upper qual clauses." But there's no explanation of why that would happen, and I think that makes it difficult to have a good discussion. Here's an example from the optimizer README: # 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) # = A leftjoin (B leftjoin C on (Pbc)) on (Pab) # # Identity 3 only holds if predicate Pbc must fail for all-null B rows # (that is, Pbc is strict for at least one column of B). If Pbc is not # strict, the first form might produce some rows with nonnull C columns # where the second form would make those entries null. This isn't the clearest justification for a rule that I've ever read, but it's something. Someone reading this can think about whether the two sentences of justification are a correct argument that Pbc must be strict in order for the identity to hold. I think you should be trying to offer similar (or better) justifications, and perhaps similar identities as well. It's a little bit hard to think about how to write the identities for this patch, although you could start with aggregate(X) = finalizeagg(partialagg(X)) and then explain how the partialagg can be commuted with certain joins and what is required for it to do so. The argument needs to be detailed enough that we can evaluate whether it's correct or not. Personally, I'm still stuck on the point I wrote about yesterday. When you partially aggregate a set of rows, the resulting row serves as a proxy for the original set of rows. So, all of those rows must have the same destiny. As I mentioned then, if you ever partially aggregate a row that should ultimately be filtered out with one that shouldn't, that breaks everything. But also, I need all the rows that feed into each partial group to be part of the same set of output rows. For instance, suppose I run a report like this: SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l WHERE o.order_id = l.order_id GROUP BY 1; If I'm thinking of executing this as FinalizeAgg(order JOIN PartialAgg(order_lines)), what must be true in order for that to be a valid execution plan? I certainly can't aggregate on some unrelated column, say part_number. If I do, then first of all I don't know how to get an order_id column out so that I can still do the join. But even if that were no issue, you might have two rows with different values of the order_id column and the same value for part_number, and that the partial groups that you've created on the order_lines table don't line up properly with the groups that you need to create on the orders table. Some particular order_id might need some of the rows that went into a certain part_number group and not others; that's no good. After some thought, I think the process of deduction goes something like this. We ultimately need to aggregate on a column in the orders table, but we want to partially aggregate the order_lines table. So we need a set of columns in the order_lines table that can act as a proxy for o.order_number. Our proxy should be all the columns that appear in the join clauses between orders and order_lines. That way, all the rows in a single partially aggregate row will have the "same" order_id according to the operator class implementing the = operator, so for a given row in the "orders" table, either every row in the group has o.order_id = l.order_id or none of them do; hence they're all part of the same set of output rows. It doesn't matter, for example, whether o.order_number is unique. If it isn't, then we'll flatten together some different rows from the orders table -- but each of those rows will match all the rows in partial groupings where o.order_id = partially_grouped_l.order_id and none of the rows where that isn't the case, so the optimization is still valid. Likewise it doesn't matter whether o.order_id is unique: if order_id = 1 occurs twice, then both of the rows will match the partially aggregated group with order_id = 1, but that's fine. The only thing that's a problem is if the same row from orders was going to match some but not all of the partially aggregate rows from some order_lines group, and that won't happen here. Now consider this example: SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l WHERE o.order_id = l.order_id AND o.something < l.something GROUP BY 1; Here, we cannot partially group order_lines on just l.order_id, because we might have a row in the orders table with order_id = 1 and something = 1 and two rows in the order_lines table that both have order_id = 1 but one has something = 0 and the other has something = 2. The orders row needs to join to one of those but not the other, so we can't put them in the same partial group. However, it seems like it would be legal to group order_lines on (order_id, something), provided that the operator classes we use for the grouping operation matches the operator classes of the operators in the join clause - i.e. we group on order_id using the operator class of = and on something using the operator class of <. If the operator classes don't match in this way, then it could happen that the grouping operator thinks the values are the same but the join operator thinks they're different. (Everything is also OK if the grouping operator tests bitwise-equality, because then nothing can ever get merged that shouldn't; but other notions of equality are also fine as long as they're compatible with the operator actually used.) Now let's consider this example, using an imaginary user-defined operator: SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l WHERE o.order_id = l.order_id AND o.something ?!?! l.something GROUP BY 1; Here, we can partially aggregate order_lines by (order_id, something) as long as order_id is grouped using bitwise equality OR the same operator class as the = operator used in the query, and something has to use bitwise equality. What about this: SELECT o.order_number, SUM(l.quantity) FROM orders o LEFT JOIN order_lines l ON o.order_id = l.order_id AND l.something = 1 GROUP BY 1; It's really important that we don't try to aggregate on just l.order_id here. Some rows in the group might have l.something = 1 and others not. It would be legal (but probably not efficient) to aggregate order_lines on (order_id, something). So it seems to me that the general rule here is that when the table for which we need a surrogate key is directly joined to the table where the original grouping column is located, we need to group on all columns involved in the join clause, using either compatible operators or bitwise equality operators. As a practical matter, we probably only want to do the optimization when all of the join clauses are equijoins. Then we don't need to worry about bitwise equality at all, AFAICS; we just group using the operator class that contains the operator specified by the user. What if there are more than two tables involved, like this: SELECT o.order_number, MAX(n.creation_time) FROM orders o, order_lines l, order_line_notes n WHERE o.order_id = l.order_id AND o.note_id = n.note_id GROUP BY 1; Here, there's no direct connection between the table with the original grouping column and the table we want to aggregate. Using the rule mentioned above, we can deduce that l.order_id is a valid grouping column for order_lines. Applying the same rule again, we can then deduce that n.note_id is a valid grouping column for note_id. We could either group note_id on n and then perform the remaining joins; or we could join order_lines and note_id and then partially aggregate the result on l.order_id. What if there are multiple grouping columns, like this: SELECT foo.x, bar.y, SUM(baz.z) FROM foo, bar, baz WHERE foo.a = baz.a AND bar.b = baz.b GROUP BY 1, 2; In this case, we need to find a surrogate grouping column for foo.x and also a surrogate grouping column for bar.y. We can group baz on a, b; but not just on a or just on b. Finally, I think this example is interesting: SELECT p.title, SUM(c.stuff) FROM parent p, link l, child c WHERE p.x = l.x AND l.y = c.y AND p.z = c.z GROUP BY 1; Based on the rule that I articulated earlier, you might think we can just look at the join clauses between parent and child and group the child on c.z. However, that's not correct -- we'd have to group on both c.x and c.z. I'm not sure how to adjust the rule so that it produces the right answer here. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Oct 29, 2024 at 3:51 AM Richard Guo <guofenglinux@gmail.com> wrote: > > 2. join type is FULL JOIN, (i am not sure about other Semijoins and > > anti-semijoins types). > > The presence of a FULL JOIN does not preclude the use of eager > aggregation. We still can push a partial aggregation down to a level > that is above the FULL JOIN. I think you can also push a partial aggregation step through a FULL JOIN. Consider this: SELECT p.name, string_agg(c.name, ', ') FROM parents p FULL JOIN children c ON p.id = c.parent_id GROUP BY 1; I don't see why it matters here that this is a FULL JOIN. If it were an inner join, we'd have one group for every parent that has at least one child. Since it's a full join, we'll also get one group for every parent with no children, and also one group for the null parent. But that should work fine with a partially aggregated c. -- Robert Haas EDB: http://www.enterprisedb.com
hi. still trying to understand v13. found a bug. minimum test : drop table if exists t1, t2; CREATE TABLE t1 (a int, b int, c int); CREATE TABLE t2 (a int, b int, c int); SET enable_eager_aggregate TO on; explain(costs off, settings) SELECT avg(t2.a), t1.c FROM t1 JOIN t2 ON t1.b = t2.b GROUP BY t1.c having grouping(t1.c) > 0; create_agg_clause_infos foreach(lc, tlist_exprs) { Expr *expr = (Expr *) lfirst(lc); if (IsA(expr, GroupingFunc)) return; } if (root->parse->havingQual != NULL) { List *having_exprs; having_exprs = pull_var_clause((Node *) root->parse->havingQual, PVC_INCLUDE_AGGREGATES | PVC_RECURSE_PLACEHOLDERS); if (having_exprs != NIL) { tlist_exprs = list_concat(tlist_exprs, having_exprs); list_free(having_exprs); } } havingQual can have GroupingFunc. if that happens, then segmentation fault.
On Tue, Oct 29, 2024 at 9:59 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 25, 2024 at 3:03 AM Richard Guo <guofenglinux@gmail.com> wrote: > > On Wed, Sep 11, 2024 at 10:52 AM Tender Wang <tndrwang@gmail.com> wrote: > > > 1. In make_one_rel(), we have the below codes: > > > /* > > > * Build grouped base relations for each base rel if possible. > > > */ > > > setup_base_grouped_rels(root); > > > > > > As far as I know, each base rel only has one grouped base relation, if possible. > > > The comments may be changed to "Build a grouped base relation for each base rel if possible." > > > > Yeah, each base rel has only one grouped rel. However, there is a > > comment nearby stating 'consider_parallel flags for each base rel', > > which confuses me about whether it should be singular or plural in > > this context. Perhaps someone more proficient in English could > > clarify this. > > It's not confusing the way you have it, but I think an English teacher > wouldn't like it, because part of the sentence is singular ("each base > rel") and the other part is plural ("grouped base relations"). > Tender's proposed rewrite fixes that. Another way to fix it is to > write "Build group relations for base rels where possible". Thank you for the suggestion. The new wording looks much better grammatically. It seems to me that we should address the nearby comment too, which goes like "consider_parallel flags for each base rel", as each rel has only one consider_parallel flag. > > > 2. According to the comments of generate_grouped_paths(), we may generate paths for a grouped > > > relation on top of paths of join relation. So the ”rel_plain" argument in generate_grouped_paths() may be > > > confused. "plain" usually means "base rel" . How about Re-naming rel_plain to input_rel? > > > > I don't think 'plain relation' necessarily means 'base relation'. In > > this context I think it can mean 'non-grouped relation'. But maybe > > I'm wrong. > > We use the term "plain relation" in several different ways. In the > header comments for addFkRecurseReferenced, it means a non-partitioned > relation. In the struct comments for RangeTblEntry, it means any sort > of named thing in pg_class that you can scan, so either a partitioned > or unpartitioned table but not a join or a table function or > something. AFAICT, the most common meaning of "plain relation" is a > pg_class entry where relkind==RELKIND_RELATION. Agreed. Thanks Richard
On Wed, Oct 30, 2024 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 24, 2024 at 11:20 PM Richard Guo <guofenglinux@gmail.com> wrote: > > The reason is that it is very tricky to set the size estimates for a > > grouped join relation. For a non-grouped join relation, we know that > > all its paths have the same rowcount estimate (well, in theory). But > > this is not true for a grouped join relation. Suppose we have a > > grouped join relation for t1/t2 join. There might be two paths for > > it: > > What exactly do you mean by "well, in theory" here? My understanding > of how things work today is that every relation is supposed to produce > a specific set of rows and every unparameterized path must produce > that set of rows. The order of the rows may vary but the set of rows > may not. With your proposed design here, that's no longer true. > Instead, what's promised is that the row sets will become equivalent > after a later FinalizeAggregate step. In a sense, this is like > parameterization or partial paths. Yeah, you're correct that currently each relation is expected to produce the same specific set of rows. When I say "well, in theory" I mean that for a join relation, all its unparameterized paths should theoretically have the same row count estimate. However, in practice, because there are more than one way to make a joinrel for more than two base relations, and the selectivity estimation routines don’t handle all cases equally well, we might get different row count estimates depending on the pair provided. And yes, with the grouped relations proposed in this patch, the situation changes. For a grouped join relation, different paths can produce very different row sets, depending on where the partial aggregation is placed in the path tree. This is also why we need to recalculate the row count estimate for a grouped join path using its outer and inner paths provided, rather than using path->parent->rows directly. This is very like the parameterized path case. > I think what you're doing here is roughly equivalent to either of > these two cases. It's more like the parameterized path case. Instead > of having a path for a relation which is parameterized by some input > parameter, you have a path for a relation, say bar, which is partially > aggregated by some grouping column. But there's no guarantee of how > much partial aggregation has been done. In your example, PartialAgg(t1 > JOIN t2) is "more aggregated" than t1 JOIN PartialAgg(t2), so the row > counts are different. This makes me quite nervous. You can't compare a > parameterized path to an unparameterized path, but you can compare it > to another parameterized path if the parameterizations are the same. > You can't compare a partial path to a non-partial path, but you can > compare partial paths to each other. But with this work, > unparameterized, non-partial paths in the same RelOptInfo don't seem > like they are truly comparable. Maybe that's OK, but I'm not sure that > it isn't going to break other things. Perhaps we could introduce a GroupPathInfo into the Path structure to store common information for a grouped path, such as the location of the partial aggregation (which could be the relids of the relation on top of which we place the partial aggregation) and the estimated rowcount for this grouped path, similar to how ParamPathInfo functions for parameterized paths. Then we should be able to compare the grouped paths of the same location apples to apples. I haven’t thought this through in detail yet, though. > It's tempting to wonder if there's some way that we can avoid > generating paths for both PartialAgg(t1 JOIN t2) and t1 JOIN > PartialAgg(t2). Either the former has lower cardinality, or the latter > does. It seems likely that the lower-cardinality set is the winning > strategy. Even if the path has higher cost to generate, we save work > at every subsequent join level and at the final aggregation step. Are > there counterexamples where it's better to use a path from the > higher-cardinality set? This is very appealing if we can retain only the lower-cardinality path, as it would greatly simplify the overall design. I haven't looked for counterexamples yet, but I plan to do so later. > By the way, the work of figuring out what target list should be > produced by partial grouping is done by init_grouping_targets(), but > the comments seem to take it for granted that I know what result we're > trying to produce, and I don't. I think some more high-level > explanation of the goals of this code would be useful. It seems to me > that if I'm looking at a path for an ungrouped relation and it > produces a certain target list, then every column of that target list > is needed somewhere. If those columns are group keys, cool: we pass > those through. If they're inputs to the aggregates, cool: we feed them > to the aggregates. But if they are neither, then what? In the patch, > you either group on those columns or add them to the > possibly_dependent list depending on the result of > is_var_needed_by_join(). I can believe that there are some cases where > we can group on such columns and others where we can't, but find it > difficult to believe that this test reliably distinguishes between > those two cases. If it does, I don't understand why it does. Don't I > need to know something about how those columns are used in the upper > joins? Like, if those columns are connected by a chain of > binary-equality operators back to the user's choice of grouping > columns, that sounds good, but this test doesn't distinguish between > that case and an upper join on the < operator. > create_grouping_expr_infos() does reason based on whether there's an > equal-image operator available, but AIUI that's only reasoning about > the group columns the user mentioned, not the sort of implicit > grouping columns that init_grouping_targets() is creating. Yeah, this patch does not get it correct here. Basically the logic is that for the partial aggregation pushed down to a non-aggregated relation, we need to consider all columns of that relation involved in upper join clauses and include them in the grouping keys. Currently, the patch only checks whether a column is involved in upper join clauses but does not verify how the column is used. We need to ensure that the operator used in the join clause is at least compatible with the grouping operator; otherwise, the grouping operator might interpret the values as the same while the join operator sees them as different. Thanks Richard
On Thu, Oct 31, 2024 at 12:25 AM Robert Haas <robertmhaas@gmail.com> wrote: > Well, the key thing here is the reasoning, which you don't really > spell out either here or in the patch. The patch's justification for > the use of BTEQUALIMAGE_PROC Is that, if we use non-equal-image > operators, "we may lose some information that could be needed to > evaluate upper qual clauses." But there's no explanation of why that > would happen, and I think that makes it difficult to have a good > discussion. > > Here's an example from the optimizer README: > > # 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) > # = A leftjoin (B leftjoin C on (Pbc)) on (Pab) > # > # Identity 3 only holds if predicate Pbc must fail for all-null B rows > # (that is, Pbc is strict for at least one column of B). If Pbc is not > # strict, the first form might produce some rows with nonnull C columns > # where the second form would make those entries null. > > This isn't the clearest justification for a rule that I've ever read, > but it's something. Someone reading this can think about whether the > two sentences of justification are a correct argument that Pbc must be > strict in order for the identity to hold. > > I think you should be trying to offer similar (or better) > justifications, and perhaps similar identities as well. It's a little > bit hard to think about how to write the identities for this patch, > although you could start with aggregate(X) = > finalizeagg(partialagg(X)) and then explain how the partialagg can be > commuted with certain joins and what is required for it to do so. The > argument needs to be detailed enough that we can evaluate whether it's > correct or not. > > Personally, I'm still stuck on the point I wrote about yesterday. When > you partially aggregate a set of rows, the resulting row serves as a > proxy for the original set of rows. So, all of those rows must have > the same destiny. As I mentioned then, if you ever partially aggregate > a row that should ultimately be filtered out with one that shouldn't, > that breaks everything. But also, I need all the rows that feed into > each partial group to be part of the same set of output rows. For > instance, suppose I run a report like this: > > SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l > WHERE o.order_id = l.order_id GROUP BY 1; > > If I'm thinking of executing this as FinalizeAgg(order JOIN > PartialAgg(order_lines)), what must be true in order for that to be a > valid execution plan? I certainly can't aggregate on some unrelated > column, say part_number. If I do, then first of all I don't know how > to get an order_id column out so that I can still do the join. But > even if that were no issue, you might have two rows with different > values of the order_id column and the same value for part_number, and > that the partial groups that you've created on the order_lines table > don't line up properly with the groups that you need to create on the > orders table. Some particular order_id might need some of the rows > that went into a certain part_number group and not others; that's no > good. > > After some thought, I think the process of deduction goes something > like this. We ultimately need to aggregate on a column in the orders > table, but we want to partially aggregate the order_lines table. So we > need a set of columns in the order_lines table that can act as a proxy > for o.order_number. Our proxy should be all the columns that appear in > the join clauses between orders and order_lines. That way, all the > rows in a single partially aggregate row will have the "same" order_id > according to the operator class implementing the = operator, so for a > given row in the "orders" table, either every row in the group has > o.order_id = l.order_id or none of them do; hence they're all part of > the same set of output rows. > > It doesn't matter, for example, whether o.order_number is unique. If > it isn't, then we'll flatten together some different rows from the > orders table -- but each of those rows will match all the rows in > partial groupings where o.order_id = partially_grouped_l.order_id and > none of the rows where that isn't the case, so the optimization is > still valid. Likewise it doesn't matter whether o.order_id is unique: > if order_id = 1 occurs twice, then both of the rows will match the > partially aggregated group with order_id = 1, but that's fine. The > only thing that's a problem is if the same row from orders was going > to match some but not all of the partially aggregate rows from some > order_lines group, and that won't happen here. > > Now consider this example: > > SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l > WHERE o.order_id = l.order_id AND o.something < l.something GROUP BY > 1; > > Here, we cannot partially group order_lines on just l.order_id, > because we might have a row in the orders table with order_id = 1 and > something = 1 and two rows in the order_lines table that both have > order_id = 1 but one has something = 0 and the other has something = > 2. The orders row needs to join to one of those but not the other, so > we can't put them in the same partial group. However, it seems like it > would be legal to group order_lines on (order_id, something), provided > that the operator classes we use for the grouping operation matches > the operator classes of the operators in the join clause - i.e. we > group on order_id using the operator class of = and on something using > the operator class of <. If the operator classes don't match in this > way, then it could happen that the grouping operator thinks the values > are the same but the join operator thinks they're different. > (Everything is also OK if the grouping operator tests > bitwise-equality, because then nothing can ever get merged that > shouldn't; but other notions of equality are also fine as long as > they're compatible with the operator actually used.) > > Now let's consider this example, using an imaginary user-defined operator: > > SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l > WHERE o.order_id = l.order_id AND o.something ?!?! l.something GROUP > BY 1; > > Here, we can partially aggregate order_lines by (order_id, something) > as long as order_id is grouped using bitwise equality OR the same > operator class as the = operator used in the query, and something has > to use bitwise equality. > > What about this: > > SELECT o.order_number, SUM(l.quantity) FROM orders o LEFT JOIN > order_lines l ON o.order_id = l.order_id AND l.something = 1 GROUP BY > 1; > > It's really important that we don't try to aggregate on just > l.order_id here. Some rows in the group might have l.something = 1 and > others not. It would be legal (but probably not efficient) to > aggregate order_lines on (order_id, something). > > So it seems to me that the general rule here is that when the table > for which we need a surrogate key is directly joined to the table > where the original grouping column is located, we need to group on all > columns involved in the join clause, using either compatible operators > or bitwise equality operators. As a practical matter, we probably only > want to do the optimization when all of the join clauses are > equijoins. Then we don't need to worry about bitwise equality at all, > AFAICS; we just group using the operator class that contains the > operator specified by the user. > > What if there are more than two tables involved, like this: > > SELECT o.order_number, MAX(n.creation_time) FROM orders o, order_lines > l, order_line_notes n WHERE o.order_id = l.order_id AND o.note_id = > n.note_id GROUP BY 1; > > Here, there's no direct connection between the table with the original > grouping column and the table we want to aggregate. Using the rule > mentioned above, we can deduce that l.order_id is a valid grouping > column for order_lines. Applying the same rule again, we can then > deduce that n.note_id is a valid grouping column for note_id. We could > either group note_id on n and then perform the remaining joins; or we > could join order_lines and note_id and then partially aggregate the > result on l.order_id. > > What if there are multiple grouping columns, like this: > > SELECT foo.x, bar.y, SUM(baz.z) FROM foo, bar, baz WHERE foo.a = baz.a > AND bar.b = baz.b GROUP BY 1, 2; > > In this case, we need to find a surrogate grouping column for foo.x > and also a surrogate grouping column for bar.y. We can group baz on a, > b; but not just on a or just on b. > > Finally, I think this example is interesting: > > SELECT p.title, SUM(c.stuff) FROM parent p, link l, child c WHERE p.x > = l.x AND l.y = c.y AND p.z = c.z GROUP BY 1; > > Based on the rule that I articulated earlier, you might think we can > just look at the join clauses between parent and child and group the > child on c.z. However, that's not correct -- we'd have to group on > both c.x and c.z. I'm not sure how to adjust the rule so that it > produces the right answer here. Thank you for the thorough deduction process; this is something I should have done before proposing the patch. As we discussed off-list, what I need to do next is to establish a theory that proves the transformation proposed in this patch is correct in all cases. What I have in mind now is that to push a partial aggregation down to a relation, we need to group by all the columns of that relation involved in the upper join clauses, using compatible operators. This is essential to ensure that an aggregated row from the partial aggregation matches the other side of the join if and only if each row in the partial group does, thereby ensuring that all rows in the same partial group have the same 'destiny'. Thanks Richard
On Thu, Oct 31, 2024 at 9:16 PM jian he <jian.universality@gmail.com> wrote: > > hi. > still trying to understand v13. found a bug. > > minimum test : > drop table if exists t1, t2; > CREATE TABLE t1 (a int, b int, c int); > CREATE TABLE t2 (a int, b int, c int); > SET enable_eager_aggregate TO on; > explain(costs off, settings) SELECT avg(t2.a), t1.c FROM t1 JOIN t2 ON > t1.b = t2.b GROUP BY t1.c having grouping(t1.c) > 0; > > havingQual can have GroupingFunc. > if that happens, then segmentation fault. Nice catch. Thanks. create_agg_clause_infos does check for GROUPING() expressions, but not in the right place. Will fix it in the next version. Thanks Richard
On Fri, Nov 1, 2024 at 2:21 AM Richard Guo <guofenglinux@gmail.com> wrote: > ... an aggregated row from the partial > aggregation matches the other side of the join if and only if each row > in the partial group does, thereby ensuring that all rows in the same > partial group have the same 'destiny'. Ah, I really like this turn of phrase! I think it's clearer and simpler than what I said. And I think it implies that we don't need to explicitly deduce surrogate grouping keys. For example if we have A JOIN B JOIN C JOIN D JOIN E JOIN F, grouped by columns from A, we don't need to work out surrogate grouping keys for B and then C and then D and then E and then F. We can just look at F's join clauses and that tells us how to group, independent of anything else. But is there any hole in that approach? I think the question is whether the current row could be used in some way that doesn't show up in the join clauses. I can't think of any way for that to happen, really. I believe that any outerjoin-delayed quals will show up as join clauses, and stuff like ORDER BY and HAVING will happen after the aggregation (at least logically) so it should be fine. Windowed functions and ordered aggregates may be a blocker to the optimization, though: if the window function needs access to the unaggregated rows, or even just needs to know how many rows there are, then we'd better not aggregate them before the window function runs; and if the aggregate is ordered, we can only partially aggregate the data if it is already ordered in a way that is compatible with the final, desired ordering. Another case we might need to watch out for is RLS. RLS wants to apply all the security quals before any non-leakproof functions, and pushing down the aggregation might push an aggregate function past security quals. Perhaps there are other cases to worry about as well; this is all I can think of at the moment. But regardless of those kinds of cases, the basic idea that we want the partially aggregate rows to join if and only if the unaggregated rows would have joined seems exactly correct to me, and that provides theoretical justification for deriving the surrogate grouping key directly from the join quals. Woot! -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 29, 2024 at 10:26 AM Richard Guo <guofenglinux@gmail.com> wrote: > > > > 2. I think there might be techniques we could use to limit planning > > effort at an earlier stage when the approach doesn't appear promising. > > For example, if the proposed grouping column is already unique, the > > exercise is pointless (I think). Ideally we'd like to detect that > > without even creating the grouped_rel. But the proposed grouping > > column might also be *mostly* unique. For example, consider a table > > with a million rows and a column 500,000 distinct values. I suspect it > > will be difficult for partial aggregation to work out to a win in a > > case like this, because I think that the cost of performing the > > partial aggregation will not reduce the cost either of the final > > aggregation or of the intervening join steps by enough to compensate. > > It would be best to find a way to avoid generating a lot of rels and > > paths in cases where there's really not much hope of a win. > > > > One could, perhaps, imagine going further with this by postponing > > eager aggregation planning until after regular paths have been built, > > so that we have good cardinality estimates. Suppose the query joins a > > single fact table to a series of dimension tables. The final plan thus > > uses the fact table as the driving table and joins to the dimension > > tables one by one. Do we really need to consider partial aggregation > > at every level? Perhaps just where there's been a significant row > > count reduction since the last time we tried it, but at the next level > > the row count will increase again? > > > > Maybe there are other heuristics we could use in addition or instead. > > Yeah, one of my concerns with this work is that it can use > significantly more CPU time and memory during planning once enabled. > It would be great if we have some efficient heuristics to limit the > effort. I'll work on that next and see what happens. > in v13, latest version. we can /* ... and initialize these targets */ if (!init_grouping_targets(root, rel, target, agg_input, &group_clauses, &group_exprs)) return NULL; if (rel->reloptkind == RELOPT_BASEREL && group_exprs != NIL) { foreach_node(Var, var, group_exprs) { if(var->varno == rel->relid && has_unique_index(rel, var->varattno)) return NULL; } } since in init_grouping_targets we already Asserted that group_exprs is a list of Var. -------------------------------------------------------------------------------- also in create_rel_agg_info, estimate_num_groups result->group_exprs = group_exprs; result->grouped_rows = estimate_num_groups(root, result->group_exprs, rel->rows, NULL, NULL); /* * The grouped paths for the given relation are considered useful iff * the row reduction ratio is greater than EAGER_AGGREGATE_RATIO. */ agg_info->agg_useful = (agg_info->grouped_rows <= rel->rows * (1 - EAGER_AGGREGATE_RATIO)); If the associated Var in group_exprs is too many, then result->grouped_rows will be less accurate, therefore agg_info->agg_useful will be less accurate. should we limit the number of Var associated with Var group_exprs. for example: SET enable_eager_aggregate TO on; drop table if exists eager_agg_t1,eager_agg_t2, eager_agg_t3; CREATE TABLE eager_agg_t1 (a int, b int, c double precision); CREATE TABLE eager_agg_t2 (a int, b int, c double precision); INSERT INTO eager_agg_t1 SELECT i % 100, i, i FROM generate_series(1, 5)i; INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(1, 5)i; INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(-4, -2)i; explain(costs off, verbose, settings) SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1; explain(costs off, verbose, settings) SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1; QUERY PLAN -------------------------------------------------------------------------------------- Finalize HashAggregate Output: t1.a, avg(t2.c) Group Key: t1.a -> Merge Join Output: t1.a, (PARTIAL avg(t2.c)) Merge Cond: ((abs(((t2.b % 10) + t2.a))) = (abs(t1.b))) -> Sort Output: t2.b, t2.a, (PARTIAL avg(t2.c)), (abs(((t2.b % 10) + t2.a))) Sort Key: (abs(((t2.b % 10) + t2.a))) -> Partial HashAggregate Output: t2.b, t2.a, PARTIAL avg(t2.c), abs(((t2.b % 10) + t2.a)) Group Key: t2.b, t2.a -> Seq Scan on public.eager_agg_t2 t2 Output: t2.a, t2.b, t2.c -> Sort Output: t1.a, t1.b, (abs(t1.b)) Sort Key: (abs(t1.b)) -> Seq Scan on public.eager_agg_t1 t1 Output: t1.a, t1.b, abs(t1.b) Settings: enable_eager_aggregate = 'on' Query Identifier: -734044107933323262
On Fri, Nov 1, 2024 at 9:42 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Nov 1, 2024 at 2:21 AM Richard Guo <guofenglinux@gmail.com> wrote: > > ... an aggregated row from the partial > > aggregation matches the other side of the join if and only if each row > > in the partial group does, thereby ensuring that all rows in the same > > partial group have the same 'destiny'. > > Ah, I really like this turn of phrase! I think it's clearer and > simpler than what I said. And I think it implies that we don't need to > explicitly deduce surrogate grouping keys. For example if we have A > JOIN B JOIN C JOIN D JOIN E JOIN F, grouped by columns from A, we > don't need to work out surrogate grouping keys for B and then C and > then D and then E and then F. We can just look at F's join clauses and > that tells us how to group, independent of anything else. > > But is there any hole in that approach? I think the question is > whether the current row could be used in some way that doesn't show up > in the join clauses. I can't think of any way for that to happen, > really. I believe that any outerjoin-delayed quals will show up as > join clauses, and stuff like ORDER BY and HAVING will happen after the > aggregation (at least logically) so it should be fine. Windowed > functions and ordered aggregates may be a blocker to the optimization, > though: if the window function needs access to the unaggregated rows, > or even just needs to know how many rows there are, then we'd better > not aggregate them before the window function runs; and if the > aggregate is ordered, we can only partially aggregate the data if it > is already ordered in a way that is compatible with the final, desired > ordering. Another case we might need to watch out for is RLS. RLS > wants to apply all the security quals before any non-leakproof > functions, and pushing down the aggregation might push an aggregate > function past security quals. Perhaps there are other cases to worry > about as well; this is all I can think of at the moment. Yeah, ordered aggregates could be a blocker. I think it might be best to prevent the use of eager aggregation if root->numOrderedAggs > 0 for now. I've been thinking about the window functions case, as Jian He also mentioned it some time ago. It seems that the window function's argument(s), as well as the PARTITION BY expression(s), are supposed to appear in the GROUP BY clause or be used in an aggregate function. And window functions are applied after the aggregation. So it seems that there is no problem with window functions. But maybe I'm wrong. I hadn't considered the RLS case before, but I think you're right. To simplify things, maybe for now we can just prevent pushing down the aggregation if the query applies some RLS policy, by checking query->hasRowSecurity. > But regardless of those kinds of cases, the basic idea that we want > the partially aggregate rows to join if and only if the unaggregated > rows would have joined seems exactly correct to me, and that provides > theoretical justification for deriving the surrogate grouping key > directly from the join quals. Woot! Thank you for the confirmation! Thanks Richard
On Wed, Nov 6, 2024 at 3:22 AM Richard Guo <guofenglinux@gmail.com> wrote: > Yeah, ordered aggregates could be a blocker. I think it might be best > to prevent the use of eager aggregation if root->numOrderedAggs > 0 > for now. > > I've been thinking about the window functions case, as Jian He also > mentioned it some time ago. It seems that the window function's > argument(s), as well as the PARTITION BY expression(s), are supposed > to appear in the GROUP BY clause or be used in an aggregate function. > And window functions are applied after the aggregation. So it seems > that there is no problem with window functions. But maybe I'm wrong. > > I hadn't considered the RLS case before, but I think you're right. To > simplify things, maybe for now we can just prevent pushing down the > aggregation if the query applies some RLS policy, by checking > query->hasRowSecurity. Particularly for the RLS case, I think we should be reluctant to disable the optimization entirely just because there might be a problem. We have existing infrastructure to keep security quals from being applied too late, and I suspect it's mostly applicable to this situation. Therefore, I suspect it might not be very much work to allow this optimization even when RLS is in use, as long as it wouldn't actually cause a violation of the RLS rules. If, upon investigation, you find some reason why we can't assess accurately whether pushing down a specific aggregate is a problem, then the approach that you propose is reasonable, but I think the question should be investigated. I don't like the idea of giving up on RLS-using queries completely without even trying to figure out how difficult it would be to do the right thing. I have similar but weaker feelings about ordered aggregates. Consider: explain select t1.id, array_agg(t2.v order by t3.o) from t1, t2, t3 where t1.id = t2.id and t2.id = t3.id group by 1; We can't partially aggregate t2, but we could partially aggregate t2 join t3. So this case is a lot like: explain select t1.id, array_agg(t2.v + t3.o) from t1, t2, t3 where t1.id = t2.id and t2.id = t3.id group by 1; I don't know whether the patch handles the second case correctly right now, but that certainly seems like a case that has to work. We must be able to determine in such a case that the partial aggregate has to be above the t2-t3 join. And if we can determine that, then why can't basically the same logic handle the first case? There are certainly some differences. The first case not only needs the aggregate to be above the t2-t3 join but also needs the input data to be sorted, so we don't get the right behavior for ordered aggregates just by using the contents of the ORDER BY clause to decide at what level the partial aggregate can be applied. On the other hand, if we're looking at paths for (T2 JOIN T3) to build paths for PartialAgg(T2 join T3), the stipulation that we need to use ordered paths or sorting doesn't make the code very much more complicated. I'm open to the conclusion that this is too much complexity but I'd rather not dismiss it instantly. Regarding window functions, you've said a few times now that you don't see the problem, but the more I think about it, the more obvious it seems to me that there are BIG problems. Consider this example from the documentation: SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary; I get a query plan like this: WindowAgg (cost=83.46..104.37 rows=1200 width=72) -> Sort (cost=83.37..86.37 rows=1200 width=40) Sort Key: depname -> Seq Scan on empsalary (cost=0.00..22.00 rows=1200 width=40) Already we see warning signs here. The WindowAgg node needs the input rows to be ordered, because it's going to average the salary for each group of rows with the same depname. So we have the same kinds of issues that we do for ordered aggregates, at the very least. But window aggregates are not just ordering-sensitive. They are also empowered to look at other rows in the frame. Consider the following example: create table names (n text); insert into names values ('Tom'), ('Dick'), ('Harry'); select n, lag(n, 1) over () from names; The result is: n | lag -------+------ Tom | Dick | Tom Harry | Dick I think it is pretty obvious that if any form of partial aggregation had been applied here, it would be impossible to correctly evaluate lag(). Or am I missing something? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Nov 6, 2024 at 11:43 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 6, 2024 at 3:22 AM Richard Guo <guofenglinux@gmail.com> wrote: > > Yeah, ordered aggregates could be a blocker. I think it might be best > > to prevent the use of eager aggregation if root->numOrderedAggs > 0 > > for now. > > > > I've been thinking about the window functions case, as Jian He also > > mentioned it some time ago. It seems that the window function's > > argument(s), as well as the PARTITION BY expression(s), are supposed > > to appear in the GROUP BY clause or be used in an aggregate function. > > And window functions are applied after the aggregation. So it seems > > that there is no problem with window functions. But maybe I'm wrong. > > > > I hadn't considered the RLS case before, but I think you're right. To > > simplify things, maybe for now we can just prevent pushing down the > > aggregation if the query applies some RLS policy, by checking > > query->hasRowSecurity. > > Particularly for the RLS case, I think we should be reluctant to > disable the optimization entirely just because there might be a > problem. We have existing infrastructure to keep security quals from > being applied too late, and I suspect it's mostly applicable to this > situation. Therefore, I suspect it might not be very much work to > allow this optimization even when RLS is in use, as long as it > wouldn't actually cause a violation of the RLS rules. If, upon > investigation, you find some reason why we can't assess accurately > whether pushing down a specific aggregate is a problem, then the > approach that you propose is reasonable, but I think the question > should be investigated. I don't like the idea of giving up on > RLS-using queries completely without even trying to figure out how > difficult it would be to do the right thing. That makes sense. I shouldn’t be lazy and simply disable this optimization for the RLS case. I'm not familiar with the RLS stuff but I’ll take some time to investigate it further. > I have similar but weaker feelings about ordered aggregates. Consider: > > explain select t1.id, array_agg(t2.v order by t3.o) from t1, t2, t3 > where t1.id = t2.id and t2.id = t3.id group by 1; > > We can't partially aggregate t2, but we could partially aggregate t2 > join t3. So this case is a lot like: > > explain select t1.id, array_agg(t2.v + t3.o) from t1, t2, t3 where > t1.id = t2.id and t2.id = t3.id group by 1; > > I don't know whether the patch handles the second case correctly right > now, but that certainly seems like a case that has to work. We must be > able to determine in such a case that the partial aggregate has to be > above the t2-t3 join. And if we can determine that, then why can't > basically the same logic handle the first case? There are certainly > some differences. The first case not only needs the aggregate to be > above the t2-t3 join but also needs the input data to be sorted, so we > don't get the right behavior for ordered aggregates just by using the > contents of the ORDER BY clause to decide at what level the partial > aggregate can be applied. On the other hand, if we're looking at paths > for (T2 JOIN T3) to build paths for PartialAgg(T2 join T3), the > stipulation that we need to use ordered paths or sorting doesn't make > the code very much more complicated. I'm open to the conclusion that > this is too much complexity but I'd rather not dismiss it instantly. It seems to me that a partially aggregated row might need to be combined with other partially aggregated rows after the join, if they belong to the same t1.id group. IIUC, this implies that we cannot perform partial aggregation on ordered input before the join, otherwise we may get incorrect results during the final aggregation phase. > Regarding window functions, you've said a few times now that you don't > see the problem, but the more I think about it, the more obvious it > seems to me that there are BIG problems. Consider this example from > the documentation: > > SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) > FROM empsalary; > > I get a query plan like this: > > WindowAgg (cost=83.46..104.37 rows=1200 width=72) > -> Sort (cost=83.37..86.37 rows=1200 width=40) > Sort Key: depname > -> Seq Scan on empsalary (cost=0.00..22.00 rows=1200 width=40) > > Already we see warning signs here. The WindowAgg node needs the input > rows to be ordered, because it's going to average the salary for each > group of rows with the same depname. So we have the same kinds of > issues that we do for ordered aggregates, at the very least. But > window aggregates are not just ordering-sensitive. They are also > empowered to look at other rows in the frame. Consider the following > example: > > create table names (n text); > insert into names values ('Tom'), ('Dick'), ('Harry'); > select n, lag(n, 1) over () from names; > > The result is: > > n | lag > -------+------ > Tom | > Dick | Tom > Harry | Dick > > I think it is pretty obvious that if any form of partial aggregation > had been applied here, it would be impossible to correctly evaluate > lag(). Or am I missing something? Hmm, currently we only consider grouped aggregation for eager aggregation. For grouped aggregation, the window function's arguments, as well as the PARTITION BY expressions, must appear in the GROUP BY clause. That is to say, the depname column in the first query, or the n column in the second query, will not be aggregated into the partial groups. Instead, they will remain as they are as input for the WindowAgg nodes. It seems to me that this ensures that we're good with window functions. But maybe I'm wrong. Thanks Richard
On Sun, Nov 10, 2024 at 7:52 PM Richard Guo <guofenglinux@gmail.com> wrote: > > I have similar but weaker feelings about ordered aggregates. Consider: > > > > explain select t1.id, array_agg(t2.v order by t3.o) from t1, t2, t3 > > where t1.id = t2.id and t2.id = t3.id group by 1; > > > > It seems to me that a partially aggregated row might need to be > combined with other partially aggregated rows after the join, if they > belong to the same t1.id group. IIUC, this implies that we cannot > perform partial aggregation on ordered input before the join, > otherwise we may get incorrect results during the final aggregation > phase. Hmm, I think you're right. I think that if the t1.id=t2.id join is one to one, then it would work out fine, but that need not be the case. > Hmm, currently we only consider grouped aggregation for eager > aggregation. For grouped aggregation, the window function's > arguments, as well as the PARTITION BY expressions, must appear in the > GROUP BY clause. That is to say, the depname column in the first > query, or the n column in the second query, will not be aggregated > into the partial groups. Instead, they will remain as they are as > input for the WindowAgg nodes. It seems to me that this ensures > that we're good with window functions. But maybe I'm wrong. Unfortunately, I don't know what you mean by grouped aggregation. I think of grouping and aggregation as synonyms, pretty much. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 12, 2024 at 1:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Nov 10, 2024 at 7:52 PM Richard Guo <guofenglinux@gmail.com> wrote: > > Hmm, currently we only consider grouped aggregation for eager > > aggregation. For grouped aggregation, the window function's > > arguments, as well as the PARTITION BY expressions, must appear in the > > GROUP BY clause. That is to say, the depname column in the first > > query, or the n column in the second query, will not be aggregated > > into the partial groups. Instead, they will remain as they are as > > input for the WindowAgg nodes. It seems to me that this ensures > > that we're good with window functions. But maybe I'm wrong. > > Unfortunately, I don't know what you mean by grouped aggregation. I > think of grouping and aggregation as synonyms, pretty much. Ah, sorry for the confusion. By "grouped aggregation", I mean aggregation with a GROUP BY clause, where we produce a result row for each group. This contrasts with plain aggregation, where there is a single result row for the whole query. Thanks Richard