Thread: remove_useless_groupby_columns is too enthusiastic

remove_useless_groupby_columns is too enthusiastic

From

Tom Lane

Date:

12 July 2022, 20:31:13

I looked into the complaint at [1] about the planner being much
stupider when one side of a JOIN USING is referenced than the other
side.  It seemed to me that that shouldn't be happening, because
the relevant decisions are made on the basis of EquivalenceClasses
and both USING columns should be in the same EquivalenceClass.
I was right about that ... but the damage is being done far upstream
of any EquivalenceClass work.  It turns out that the core of the
issue is that the query looks like

SELECT ... t1 JOIN t2 USING (x)
  GROUP BY x, t2.othercol ORDER BY t2.othercol LIMIT n

In the "okay" case, we resolve "GROUP BY x" as GROUP BY t1.x.
Later on, we are able to realize that ordering by t1.x is
equivalent to ordering by t2.x (because same EquivalenceClass),
and that it's equally good to consider the GROUP BY items in
either order, and that ordering by t2.othercol, t2.x would
allow us to perform the grouping using a GroupAggregate on
data that's already sorted to meet the ORDER BY requirement.
Since there happens to be an index on t2.othercol, t2.x,
we can implement the query with no explicit sort, which wins
big thanks to the small LIMIT.

In the not-okay case, we resolve "GROUP BY x" as GROUP BY t2.x.
Then remove_useless_groupby_columns notices that x is t2's
primary key, so it figures that grouping by t2.othercol is
redundant and throws away that element of GROUP BY.  Now there
is no apparent connection between the GROUP BY and ORDER BY
lists, defeating the optimizations that would lead to a good
choice of plan --- in fact, we conclude early on that that
index's sort ordering is entirely useless to the query :-(

I tried the attached quick-hack patch that just prevents
remove_useless_groupby_columns from removing anything that
appears in ORDER BY.  That successfully fixes the complained-of
case, and it doesn't change any existing regression test results.
I'm not sure whether there are any cases that it makes significantly
worse.

(I also kind of wonder if the fundamental problem here is that
remove_useless_groupby_columns is being done at the wrong time,
and we ought to do it later when we have more semantic info.
But I'm not volunteering to rewrite it completely.)

Anyway, remove_useless_groupby_columns has been there since 9.6
and we've not previously seen reports of cases that it makes worse,
so this seems like a corner-case problem.  Hence I wouldn't risk
back-patching this change.  It seems worth considering for HEAD
though, so I'll stick it in the September CF.

            regards, tom lane

[1] https://www.postgresql.org/message-id/17544-ebd06b00b8836a04%40postgresql.org

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 06ad856eac..b2716abcc7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2733,12 +2733,15 @@ remove_useless_groupby_columns(PlannerInfo *root)

             /*
              * New list must include non-Vars, outer Vars, and anything not
-             * marked as surplus.
+             * marked as surplus.  In addition, keep anything that appears in
+             * the ORDER BY clause, because otherwise we may falsely make it
+             * look like the GROUP BY and ORDER BY clauses are incompatible.
              */
             if (!IsA(var, Var) ||
                 var->varlevelsup > 0 ||
                 !bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
-                               surplusvars[var->varno]))
+                               surplusvars[var->varno]) ||
+                list_member(parse->sortClause, sgc))
                 new_groupby = lappend(new_groupby, sgc);
         }

Re: remove_useless_groupby_columns is too enthusiastic

From

David Rowley

Date:

13 July 2022, 07:06:51

On Wed, 13 Jul 2022 at 05:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I tried the attached quick-hack patch that just prevents
> remove_useless_groupby_columns from removing anything that
> appears in ORDER BY.  That successfully fixes the complained-of
> case, and it doesn't change any existing regression test results.
> I'm not sure whether there are any cases that it makes significantly
> worse.

In addition to this, we also do a pre-verification step to see if the
ORDER BY has anything in it that the GROUP BY does not?

I don't think there's any harm in removing the GROUP BY item if the
pre-check finds something extra in ORDER BY since
preprocess_groupclause() is going to fail in that case anyway.

David

Re: remove_useless_groupby_columns is too enthusiastic

From

Richard Guo

Date:

14 July 2022, 12:57:15

On Wed, Jul 13, 2022 at 1:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I tried the attached quick-hack patch that just prevents
remove_useless_groupby_columns from removing anything that
appears in ORDER BY. That successfully fixes the complained-of
case, and it doesn't change any existing regression test results.
I'm not sure whether there are any cases that it makes significantly
worse.

If there happens to be an index on t2.othercol, t2.x, the patch would
definitely win since we can perform a GroupAggregate with no explicit
sort. If there is no such index, considering the redundant sorting work
due to the excess columns, do we still win?

I'm testing with the query below:

create table t (a int primary key, b int, c int);
insert into t select i, i%1000, i from generate_series(1,1000000)i;
analyze t;
create index t_b_a_idx on t (b, a);

select sum(c) from t group by a, b order by b limit 10;

If we have index 't_b_a_idx', there would be big performance
improvement. Without the index, I can see some performance drop (I'm not

using parallel query mechanism).

(I also kind of wonder if the fundamental problem here is that
remove_useless_groupby_columns is being done at the wrong time,
and we ought to do it later when we have more semantic info.

Concur with that.

Thanks
Richard

Re: remove_useless_groupby_columns is too enthusiastic

From

David Rowley

Date:

16 September 2022, 03:22:08

On Wed, 13 Jul 2022 at 05:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I tried the attached quick-hack patch that just prevents
> remove_useless_groupby_columns from removing anything that
> appears in ORDER BY.  That successfully fixes the complained-of
> case, and it doesn't change any existing regression test results.
> I'm not sure whether there are any cases that it makes significantly
> worse.

What I am concerned about with this patch is that for Hash Aggregate,
we'll always want to remove the useless group by clauses to minimise
the amount of hashing and equal comparisons that we must do. So the
patch makes that case worse in favour of possibly making Group
Aggregate cases better. I don't think that's going to be a great
trade-off as Hash Aggregate is probably more commonly used than Group
Aggregate, especially so when the number of rows in each group is
large and there is no LIMIT clause to favour a cheap startup plan.

Maybe the fix for this should be:

1. Add a new bool "isredundant_groupby" field to SortGroupClause,
2. Rename remove_useless_groupby_columns() to
mark_redundant_groupby_columns() and have it set the
isredundant_groupby instead of removing items from the list,
3. Adjust get_useful_group_keys_orderings() so that it returns
additional PathKeyInfo with the isredundant_groupby items removed,
4. Adjust the code in add_paths_to_grouping_rel() so that it always
uses the minimal set of SortGroupClauses for Hash Aggregate paths,

Perhaps a valid concern with the above is all the additional Paths
we'd consider if we did #3. But maybe that's not so bad as that's not
a multiplicate problem like it would be adding additional Paths to
base and joinrels.

We'd probably still want to keep preprocess_groupclause() as
get_useful_group_keys_orderings() is not exhaustive in its search.

David

Re: remove_useless_groupby_columns is too enthusiastic

From

Michael Paquier

Date:

12 October 2022, 08:42:02

On Fri, Sep 16, 2022 at 12:22:08PM +1200, David Rowley wrote:
> We'd probably still want to keep preprocess_groupclause() as
> get_useful_group_keys_orderings() is not exhaustive in its search.

This thread has been idle for a few weeks now with a review posted, so
I have marked the patch as RwF.
--
Michael

Attachment

signature.asc