Re: POC: GROUP BY optimization - Mailing list pgsql-hackers
From | Richard Guo |
---|---|
Subject | Re: POC: GROUP BY optimization |
Date | |
Msg-id | CAMbWs4-7oUGax4qWUwdiuyPKYKDO-DtKrcPedmjbPZyC-uYYrg@mail.gmail.com Whole thread Raw |
In response to | Re: POC: GROUP BY optimization (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: POC: GROUP BY optimization
|
List | pgsql-hackers |
On Mon, Jan 15, 2024 at 3:56 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Jan 15, 2024 at 8:42 AM Richard Guo <guofenglinux@gmail.com> wrote:
> On Mon, Jan 15, 2024 at 8:20 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>
>> Thank you for providing the test case relevant for this code change.
>> The revised patch incorporating this change is attached. Now the
>> patchset looks good to me. I'm going to push it if there are no
>> objections.
>
> Seems I'm late for the party. Can we hold for several more days? I'd
> like to have a review on this patch.
Sure, NP. I'll hold it till your review.
Thanks. Appreciate that.
I have briefly reviewed this patch and here are some comments.
* When trying to match the ordering of GROUP BY to that of ORDER BY in
get_useful_group_keys_orderings, this patch checks against the length of
path->pathkeys. This does not make sense. I guess this is a typo and
the real intention is to check against root->sort_pathkeys.
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path)
root->num_groupby_pathkeys);
if (n > 0 &&
- (enable_incremental_sort || n == list_length(path->pathkeys)))
+ (enable_incremental_sort || n == list_length(root->sort_pathkeys)))
However, I think this is also incorrect. When incremental sort is
disabled, it is reasonable to consider reordering the GROUP BY keys only
if the number of matching pathkeys is equal to the length of
root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'.
* When the original ordering of GROUP BY keys matches the path's
pathkeys or ORDER BY keys, get_useful_group_keys_orderings would
generate duplicate PathKeyInfos. Consequently, this duplication would
lead to the creation and addition of identical paths multiple times.
This is not great. Consider the query below:
create table t (a int, b int);
create index on t (a, b);
set enable_hashagg to off;
explain select count(*) from t group by a, b;
QUERY PLAN
----------------------------------------------------------------------------------
GroupAggregate (cost=0.15..97.27 rows=226 width=16)
Group Key: a, b
-> Index Only Scan using t_a_b_idx on t (cost=0.15..78.06 rows=2260 width=8)
(3 rows)
The same path with group keys {a, b} is created and added twice.
* Part of the work performed in this patch overlaps with that of
preprocess_groupclause. They are both trying to adjust the ordering of
the GROUP BY keys to match ORDER BY. I wonder if it would be better to
perform this work only once.
* When reordering the GROUP BY keys, I think the following checks need
some improvements.
+ /*
+ * Since 1349d27 pathkey coming from underlying node can be in the
+ * root->group_pathkeys but not in the processed_groupClause. So, we
+ * should be careful here.
+ */
+ sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref,
+ *group_clauses);
+ if (!sgc)
+ /* The grouping clause does not cover this pathkey */
+ break;
I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is
valid before calling get_sortgroupref_clause_noerr with it. Also, how
can we guarantee that the returned GROUP BY item is sortable? Should we
check that with OidIsValid(sgc->sortop)?
Thanks
Richard
* When trying to match the ordering of GROUP BY to that of ORDER BY in
get_useful_group_keys_orderings, this patch checks against the length of
path->pathkeys. This does not make sense. I guess this is a typo and
the real intention is to check against root->sort_pathkeys.
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path)
root->num_groupby_pathkeys);
if (n > 0 &&
- (enable_incremental_sort || n == list_length(path->pathkeys)))
+ (enable_incremental_sort || n == list_length(root->sort_pathkeys)))
However, I think this is also incorrect. When incremental sort is
disabled, it is reasonable to consider reordering the GROUP BY keys only
if the number of matching pathkeys is equal to the length of
root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'.
* When the original ordering of GROUP BY keys matches the path's
pathkeys or ORDER BY keys, get_useful_group_keys_orderings would
generate duplicate PathKeyInfos. Consequently, this duplication would
lead to the creation and addition of identical paths multiple times.
This is not great. Consider the query below:
create table t (a int, b int);
create index on t (a, b);
set enable_hashagg to off;
explain select count(*) from t group by a, b;
QUERY PLAN
----------------------------------------------------------------------------------
GroupAggregate (cost=0.15..97.27 rows=226 width=16)
Group Key: a, b
-> Index Only Scan using t_a_b_idx on t (cost=0.15..78.06 rows=2260 width=8)
(3 rows)
The same path with group keys {a, b} is created and added twice.
* Part of the work performed in this patch overlaps with that of
preprocess_groupclause. They are both trying to adjust the ordering of
the GROUP BY keys to match ORDER BY. I wonder if it would be better to
perform this work only once.
* When reordering the GROUP BY keys, I think the following checks need
some improvements.
+ /*
+ * Since 1349d27 pathkey coming from underlying node can be in the
+ * root->group_pathkeys but not in the processed_groupClause. So, we
+ * should be careful here.
+ */
+ sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref,
+ *group_clauses);
+ if (!sgc)
+ /* The grouping clause does not cover this pathkey */
+ break;
I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is
valid before calling get_sortgroupref_clause_noerr with it. Also, how
can we guarantee that the returned GROUP BY item is sortable? Should we
check that with OidIsValid(sgc->sortop)?
Thanks
Richard
pgsql-hackers by date: