Thread: Remove useless GROUP BY columns considering unique index

Remove useless GROUP BY columns considering unique index

From
Zhang Mingli
Date:
Hi, 

This idea first came from remove_useless_groupby_columns does not need to record constraint dependencie[0] which points out that
unique index whose columns all have NOT NULL constraints  could also take the work with primary key when removing useless GROUP BY columns.
I study it and implement the idea.

Ex:

create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));

 explain (costs off) select * from t2 group by a,b,c;
 QUERY PLAN
 ----------------------
 HashAggregate
 Group Key: c
 -> Seq Scan on t2

The plan drop column a, b as I did a little more.

For the query, as t2 has primary key (a, b), before this patch, we could drop column c because {a, b} are PK.
And  we have an unique  index(c) with NOT NULL constraint now, we could drop column {a, b}, just keep {c}.

While we have multiple choices, group by a, b (c is removed  by PK) and group by c (a, b are removed by unique not null index)
And  I implement it to choose the one with less columns so that we can drop as more columns as possible. 
I think it’s good for planner to save some cost like Sort less columns. 
There may be better one for some reason like: try to keep PK for planner?
I’m not sure about that and it seems not worth much complex.

The NOT NULL constraint may also be computed from primary keys, ex: 
create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(a));
Primary key(a, b) ensure a is NOT NULL and we have a unique index(a), but it will introduce more complex to check if a unique index could be used.
I also doubt it worths doing that..
So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
-- Test primary key beats unique not null index.
explain (costs off) select * from t3 group by a,b,c,d;
 QUERY PLAN 
----------------------
 HashAggregate
 Group Key: a, b
 -> Seq Scan on t3
(3 rows)

create temp table t4 (a int, b int not null, c int not null, d int not null, primary key (a, b), unique(b, c), unique(d));
-- Test unique not null index with less columns wins.
explain (costs off) select * from t4 group by a,b,c,d;
 QUERY PLAN 
----------------------
 HashAggregate
 Group Key: d
 -> Seq Scan on t4
(3 rows)

The unique Indices could have overlaps with primary keys and indices themselves. 

create temp table t5 (a int not null, b int not null, c int not null, d int not null, unique (a, b), unique(b, c), unique(a, c, d));
-- Test unique not null indices have overlap.
explain (costs off) select * from t5 group by a,b,c,d;
 QUERY PLAN 
----------------------
 HashAggregate
 Group Key: a, b
 -> Seq Scan on t5
(3 rows)

Attachment

Re: Remove useless GROUP BY columns considering unique index

From
jian he
Date:
On Fri, Dec 29, 2023 at 11:05 PM Zhang Mingli <zmlpostgres@gmail.com> wrote:
>
> Hi,
>
> This idea first came from remove_useless_groupby_columns does not need to record constraint dependencie[0] which
pointsout that 
> unique index whose columns all have NOT NULL constraints  could also take the work with primary key when removing
uselessGROUP BY columns. 
> I study it and implement the idea.
>
>
[0]https://www.postgresql.org/message-id/flat/CAApHDvrdYa%3DVhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL%2BPYMVN8OnQ%40mail.gmail.com
>

cosmetic issues:
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+-- materialized view
+--
+create temp table t1 (a int, b int, c int, primary key (a, b), unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a,
b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null,
primary key (a, b), unique(c, d));
+create temp table t4 (a int, b int not null, c int not null, d int
not null, primary key (a, b), unique(b, c), unique(d));
+create temp table t5 (a int not null, b int not null, c int not null,
d int not null, unique (a, b), unique(b, c), unique(a, c, d));
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- Test unique not null index beats primary key.
+explain (costs off) select * from t2 group by a,b,c;
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null index with less columns wins.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t5 group by a,b,c,d;
+
+drop table t1;
+drop table t2;
+drop table t3;
+drop table t4;
+

line `materailzed view` is unnecessary?
you forgot to drop table t5.

create temp table p_t2 (
  a int not null,
  b int not null,
  c int not null,
  d int,
  unique(a), unique(a,b),unique(a,b,c)
) partition by list(a);
create temp table p_t2_1 partition of p_t2 for values in(1);
create temp table p_t2_2 partition of p_t2 for values in(2);
explain (costs off, verbose) select * from p_t2 group by a,b,c,d;
your patch output:
                          QUERY PLAN
--------------------------------------------------------------
 HashAggregate
   Output: p_t2.a, p_t2.b, p_t2.c, p_t2.d
   Group Key: p_t2.a
   ->  Append
         ->  Seq Scan on pg_temp.p_t2_1
               Output: p_t2_1.a, p_t2_1.b, p_t2_1.c, p_t2_1.d
         ->  Seq Scan on pg_temp.p_t2_2
               Output: p_t2_2.a, p_t2_2.b, p_t2_2.c, p_t2_2.d
(8 rows)

so it seems to work with partitioned tables, maybe you should add some
test cases with partition tables.


- * XXX This is only used to create derived tables, so NO INHERIT constraints
- * are always skipped.
+ * XXX When used to create derived tables, set skip_no_inherit tp true,
+ * so that NO INHERIT constraints will be skipped.
  */
 List *
-RelationGetNotNullConstraints(Oid relid, bool cooked)
+RelationGetNotNullConstraints(Oid relid, bool cooked, bool skip_no_inherit)
 {
  List   *notnulls = NIL;
  Relation constrRel;
@@ -815,7 +815,7 @@ RelationGetNotNullConstraints(Oid relid, bool cooked)

  if (conForm->contype != CONSTRAINT_NOTNULL)
  continue;
- if (conForm->connoinherit)
+ if (skip_no_inherit && conForm->connoinherit)
  continue;

I don't think you need to refactor RelationGetNotNullConstraints.
however i found it hard to explain it, (which one is parent, which one
is children is very confusing for me).
so i use the following script to test it:
DROP TABLE ATACC1, ATACC2;
CREATE TABLE ATACC1 (a int);
CREATE TABLE ATACC2 (b int,c int, unique(c)) INHERITS (ATACC1);
ALTER TABLE ATACC1 ADD NOT NULL a;
-- ALTER TABLE ATACC1 ADD NOT NULL a NO INHERIT;
ALTER TABLE ATACC2 ADD unique(a);
explain (costs off, verbose) select * from ATACC2 group by a,b,c;
-------------------------

create table t_0_3 (a int not null, b int not null, c int not null, d
int not null, unique (a, b), unique(b, c) DEFERRABLE, unique(d));
explain (costs off, verbose) select * from t_0_3 group by a,b,c,d;
           QUERY PLAN
--------------------------------
 HashAggregate
   Output: a, b, c, d
   Group Key: t_0_3.a, t_0_3.b
   ->  Seq Scan on public.t_0_3
         Output: a, b, c, d
(5 rows)

the above part seems not right, it should be `Group Key: t_0_3.d`
so i changed to:

/* Skip constraints that are not UNIQUE */
+ if (con->contype != CONSTRAINT_UNIQUE)
+ continue;
+
+ /* Skip unique constraints that are condeferred */
+ if (con->condeferrable && con->condeferred)
+ continue;

I guess you probably have noticed, in the function
remove_useless_groupby_columns,
get_primary_key_attnos followed by get_min_unique_not_null_attnos.
Also, get_min_unique_not_null_attnos main code is very similar to
get_primary_key_attnos.

They both do `pg_constraint = table_open(ConstraintRelationId,
AccessShareLock);`
you might need to explain why we must open pg_constraint twice.
either it's cheap to do it or another reason.

If scanning twice is expensive, we may need to refactor get_primary_key_attnos.
get_primary_key_attnos only called in check_functional_grouping,
remove_useless_groupby_columns.

I added the patch to commitfest:
https://commitfest.postgresql.org/46/4751/



Re: Remove useless GROUP BY columns considering unique index

From
David Rowley
Date:
On Sat, 30 Dec 2023 at 04:05, Zhang Mingli <zmlpostgres@gmail.com> wrote:
> So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL
constraint.
> And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less
columnsmean that more group by columns could be removed. 

This patch no longer applies.  We no longer catalogue NOT NULL
constraints, which this patch is coded to rely upon. (Likely it could
just look at pg_attribute.attnotnull instead)

Aside from that, I'm not sure about the logic to prefer removing
functionally dependent columns using the constraint with the least
columns.  If you had the PRIMARY KEY on two int columns and a unique
index on a 1MB varlena column, I think using the primary key would be
a better choice to remove functionally dependent columns of.  Maybe
it's ok to remove any functionally dependant columns on the primary
key first and try removing functionally dependent columns on unique
constraints and a 2nd step (or maybe only if none can be removed using
the PK?)

Also, why constraints rather than unique indexes?  You'd need to check
for expression indexes and likely ignore those due to no ability to
detect NOT NULL for expressions.

Also, looking at the patch to how you've coded
get_min_unique_not_null_attnos(), it looks like you're very optimistic
about that being a constraint that has columns mentioned in the GROUP
BY clause. It looks like it  won't work if the UNIQUE constraint with
the least columns gets no mention in the GROUP BY clause. That'll
cause performance regressions from what we have today when the primary
key is mentioned and columns can be removed using that but a UNIQUE
constraint exists which has no corresponding GROUP BY columns and
fewer unique columns than the PK. That's not good and it'll need to be
fixed.

Set to waiting on author.

David



Re: Remove useless GROUP BY columns considering unique index

From
Peter Eisentraut
Date:
On 12.09.24 03:43, David Rowley wrote:
> On Sat, 30 Dec 2023 at 04:05, Zhang Mingli <zmlpostgres@gmail.com> wrote:
>> So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL
constraint.
>> And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less
columnsmean that more group by columns could be removed.
 
> 
> This patch no longer applies.  We no longer catalogue NOT NULL
> constraints, which this patch is coded to rely upon.

Work is ongoing to revive the patch that catalogs not-null constraints: 
<https://commitfest.postgresql.org/49/5224/>.  This patch should 
probably wait for that patch at the moment.

> (Likely it could just look at pg_attribute.attnotnull instead)

That won't work because you can't record dependencies on that.  (This is 
one of the reasons for cataloging not-null constraints as real constraints.)




Re: Remove useless GROUP BY columns considering unique index

From
David Rowley
Date:
On Wed, 18 Sept 2024 at 19:28, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 12.09.24 03:43, David Rowley wrote:
> > (Likely it could just look at pg_attribute.attnotnull instead)
>
> That won't work because you can't record dependencies on that.  (This is
> one of the reasons for cataloging not-null constraints as real constraints.)

I'm not seeing any need to record constraint dependencies for this
optimisation. It would be different for detecting functional
dependencies in a view using a unique constraint+not null constraints
for ungrouped columns, but that's not what this is. This is just a
planner optimisation. The plan can be invalidated by a relcache
invalidation, which will happen if someone does ALTER TABLE DROP NOT
NULL.

For reference, see 5b736e9cf.

David



Re: Remove useless GROUP BY columns considering unique index

From
Zhang Mingli
Date:
Hi, all


I haven't paid attention to this topic in a long time, thanks all for the advices, I will study them then update.

Thanks again.


Zhang Mingli
www.hashdata.xyz
On Sep 18, 2024 at 15:50 +0800, David Rowley <dgrowleyml@gmail.com>, wrote:
On Wed, 18 Sept 2024 at 19:28, Peter Eisentraut <peter@eisentraut.org> wrote:

On 12.09.24 03:43, David Rowley wrote:
(Likely it could just look at pg_attribute.attnotnull instead)

That won't work because you can't record dependencies on that. (This is
one of the reasons for cataloging not-null constraints as real constraints.)

I'm not seeing any need to record constraint dependencies for this
optimisation. It would be different for detecting functional
dependencies in a view using a unique constraint+not null constraints
for ungrouped columns, but that's not what this is. This is just a
planner optimisation. The plan can be invalidated by a relcache
invalidation, which will happen if someone does ALTER TABLE DROP NOT
NULL.

For reference, see 5b736e9cf.

David

Re: Remove useless GROUP BY columns considering unique index

From
David Rowley
Date:
On Wed, 27 Nov 2024 at 19:51, jian he <jian.universality@gmail.com> wrote:
> v3 attached. in v3:

1. I find the following quite strange:

postgres=# create table abcd (a int primary key, b int not null, c int
not null, d int not null, unique(b));
CREATE TABLE
postgres=# explain select b,c from abcd group by b,c;
                          QUERY PLAN
--------------------------------------------------------------
 HashAggregate  (cost=37.75..56.25 rows=1850 width=8)
   Group Key: b, c
   ->  Seq Scan on abcd  (cost=0.00..28.50 rows=1850 width=8)
(3 rows)


postgres=# alter table abcd drop constraint abcd_pkey;
ALTER TABLE
postgres=# explain select b,c from abcd group by b,c;
                          QUERY PLAN
--------------------------------------------------------------
 HashAggregate  (cost=33.12..51.62 rows=1850 width=8)
   Group Key: b
   ->  Seq Scan on abcd  (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

Why does the patch only try to remove GROUP BY columns using unique
indexes when there's no primary key?

I don't really see why primary key should be treated specially. Why
does the patch not just unify the logic and collect up all unique
non-expression index, non-partial indexes where all columns are NOT
NULL. You could maybe just add a special case to skip the NOT NULL
checking for indexes with indisprimary set.

2. In get_unique_not_null_attnos(), not_null_attnos could be a
Bitmapset with attnums offset by FirstLowInvalidHeapAttributeNumber.
That'll allow you to do a bms_is_member() call rather than a (possibly
expensive) list_member_int() call.

3. The logic in remove_useless_groupby_columns() looks much more
complex than it needs to be. It would be much more simple to find the
matching unique index with the least number of columns and use that
one. Just have a counter to record the bms_num_members() of the
columns of the best match so far and replace it when you find an index
with fewer columns. Please see adjust_group_pathkeys_for_groupagg()
for inspiration.

We also need to think about if the unique index with the least columns
is always the best one to use. Perhaps the index with the least
columns is a varlena column and there's some index with, say, two
INT4s. It would be much cheaper to hash a couple of INT4s than some
long varlena column. Maybe it's ok just to leave an XXX comment
explaining that we might want to think about doing that one day.
Alternatively, summing up the pg_statistic.stawidth values could work.
Likely it's better to make the patch work correctly before thinking
about that part.

The patch could also use a spell check:

"a input" (an input)
"soure" source??
"GROUOP BY"

David



Re: Remove useless GROUP BY columns considering unique index

From
Andrei Lepikhov
Date:
On 11/29/24 09:39, David Rowley wrote:
> On Fri, 29 Nov 2024 at 15:02, David Rowley <dgrowleyml@gmail.com> wrote:
>> I've attached an updated patch that gets rid of the
>> get_unique_not_null_attnos() function completely and uses the
>> RelOptInfo.indexlist and RelOptInfo.notnullattnums.
> 
> I forgot to do a local commit before sending v8. Fixed in the attached v9.
1. Thread reference in the patch comment doesn't work.
2. May it be possible to move remove_useless_groupby_columns in the 
second commit? It would be easier to review the differences.

-- 
regards, Andrei Lepikhov



Re: Remove useless GROUP BY columns considering unique index

From
David Rowley
Date:
On Fri, 29 Nov 2024 at 20:04, jian he <jian.universality@gmail.com> wrote:
> I found that unique expression index can also be used for groupby
> column removal.
> so I implemented it, aslo added two tests on it.

> what do you think?

I think it's likely just not common enough to be worthwhile, plus,
unfortunately, I don't think this optimisation is possible with what
we have today. Also, if it were possible you'd need to check the GROUP
BY expressions match the indexed expressions. You've not done that.

The reason I don't think is possible is that we have no infrastructure
that allows us to tag functions or operators so that non-null input(s)
mean non-null outputs.  We only have strict, which means null input
means null output. That's the opposite of what we'd need. It might
only be possible with a NULLS NOT DISTINCT index.

David



Re: Remove useless GROUP BY columns considering unique index

From
Andrei Lepikhov
Date:
On 29/11/2024 14:28, David Rowley wrote:
> On Fri, 29 Nov 2024 at 19:36, Andrei Lepikhov <lepihov@gmail.com> wrote:
>> 1. Thread reference in the patch comment doesn't work.
>> 2. May it be possible to move remove_useless_groupby_columns in the
>> second commit? It would be easier to review the differences.
> 
> For #1, I rewrote the commit message.
> 
> I've split into two patches for ease of review. See attached.
Thanks!
Patch 0001 seems OK. The differentiation of 'planmain' and 'initsplan' 
is not clear for me, but code movement seems correct. I should say that 
comment in initsplan.c:
'... initialization routines'
looks strange. Is this module about initialisation now?

Patch 0002 looks helpful and performant. I propose to check 'relid > 0' 
to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has 
been found.
It may restrict future optimisations like [1], where the upper query 
block can employ the 'uniqueness' of grouped clauses in selectivity 
estimations. But I think it is OK for now.

NOTES:
1. Uniqueness is proved by a UNIQUE index. It might be a bit more 
sophisticated to probe not only grouping qual but also employ 
equivalence classes. In that case, we could process something like that:

CREATE TABLE test (
   x int NOT NULL, y int NOT NULL, z int NOT NULL, w int);
CREATE UNIQUE INDEX ON test(y,z);
SELECT t2.z FROM test t2, test t1 WHERE t1.y=t2.y
GROUP BY t1.y,t2.z,t2.w;

2. The same idea could be used with DISTINCT statement. Right now we have:
SELECT DISTINCT y,z,w FROM test;

  HashAggregate
    Output: y, z, w
    Group Key: test.y, test.z, test.w
    ->  Seq Scan on public.test
          Output: x, y, z, w

[1] 
https://www.postgresql.org/message-id/flat/50fe6779-ee2d-4256-bc64-cd661bc4029a%40gmail.com

-- 
regards, Andrei Lepikhov




Re: Remove useless GROUP BY columns considering unique index

From
jian he
Date:
On Fri, Nov 29, 2024 at 5:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> The reason I don't think is possible is that we have no infrastructure
> that allows us to tag functions or operators so that non-null input(s)
> mean non-null outputs.  We only have strict, which means null input
> means null output. That's the opposite of what we'd need. It might
> only be possible with a NULLS NOT DISTINCT index.
>
Thank you for pointing out  "non-null input(s)  mean non-null outputs" .
I didn't think about it at all.


regarding v10.
you placed remove_useless_groupby_columns right after add_base_rels_to_query
makes so much sense.
so we can be safely use cached RelOptInfo->indexlist, RelOptInfo->notnullattnums
overall it didn't find any issue.



Re: Remove useless GROUP BY columns considering unique index

From
David Rowley
Date:
On Mon, 2 Dec 2024 at 17:22, jian he <jian.universality@gmail.com> wrote:
> regarding v10.
> you placed remove_useless_groupby_columns right after add_base_rels_to_query
> makes so much sense.
> so we can be safely use cached RelOptInfo->indexlist, RelOptInfo->notnullattnums
> overall it didn't find any issue.

Thanks for looking.

After a bit more adjustment, I've pushed both patches.

I felt it was best to have the commit that moved
remove_useless_groupby_columns also change the call site of that
function.

David



Re: Remove useless GROUP BY columns considering unique index

From
David Rowley
Date:
On Mon, 2 Dec 2024 at 17:18, Andrei Lepikhov <lepihov@gmail.com> wrote:
> Patch 0002 looks helpful and performant. I propose to check 'relid > 0'
> to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has
> been found.

I did end up adding another fast path there, but I felt like checking
relid > 0 wasn't as good as it could be as that would have only
short-circuited when we don't see any Vars of level 0 in the GROUP BY.
It seemed cheap enough to short-circuit when none of the relations
mentioned in the GROUP BY have multiple columns mentioned.

> NOTES:
> 1. Uniqueness is proved by a UNIQUE index. It might be a bit more
> sophisticated to probe not only grouping qual but also employ
> equivalence classes. In that case, we could process something like that:
>
> CREATE TABLE test (
>    x int NOT NULL, y int NOT NULL, z int NOT NULL, w int);
> CREATE UNIQUE INDEX ON test(y,z);
> SELECT t2.z FROM test t2, test t1 WHERE t1.y=t2.y
> GROUP BY t1.y,t2.z,t2.w;

It might be worth doing something like that. It looks like we could
delay remove_useless_groupby_columns() until standard_qp_callback
anyway. Further modifications to the GROUP BY can occur there. It
might make sense to replace the call to
make_pathkeys_for_sortclauses_extended() in standard_qp_callback()
with a version of remove_useless_groupby_columns() which does both
tasks plus the one you mention.

However, I don't know what the logic would be exactly for the case you
mention as it seems possible if we start swapping columns out for
another EquivalenceMember that we might cause some regression. For
example, if you had:

create table t1(a int, b int, primary key(a,b);
create table t2(x int, y int);

select ... from t1 inner join t2 on t1.a=t2.x and t1.b = t2.y group by
t2.x,t1.b;

when how do you decide if the GROUP BY should become t1.a,t1.b or
t2.x,t2.y? It's not clear to me that using t1's columns is always
better than using t2's. I imagine using a mix is never better, but I'm
unsure how you'd decide which ones to use.

David



Re: Remove useless GROUP BY columns considering unique index

From
Andrei Lepikhov
Date:
On 12/12/24 10:09, David Rowley wrote:
> On Mon, 2 Dec 2024 at 17:18, Andrei Lepikhov <lepihov@gmail.com> wrote:
>> Patch 0002 looks helpful and performant. I propose to check 'relid > 0'
>> to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has
>> been found.
> 
> I did end up adding another fast path there, but I felt like checking
> relid > 0 wasn't as good as it could be as that would have only
> short-circuited when we don't see any Vars of level 0 in the GROUP BY.
> It seemed cheap enough to short-circuit when none of the relations
> mentioned in the GROUP BY have multiple columns mentioned.
Your solution seems much better my proposal. Thanks to apply it!

> when how do you decide if the GROUP BY should become t1.a,t1.b or
> t2.x,t2.y? It's not clear to me that using t1's columns is always
> better than using t2's. I imagine using a mix is never better, but I'm
> unsure how you'd decide which ones to use.
Depends on how to calculate that 'better'. Right now, GROUP-BY employs 
two strategies to reduce path cost: 1) ORDER-BY statement (avoid final 
sorting); 2) To fit incoming subtree pathkeys (avoid grouping presorting).
My idea comes close with [1], where the cost depends on the estimated 
number of groups in the first grouping column because cost_sort predicts 
the number of comparison operator calls based on statistics. In this 
case, the choice between (x,y) and (a,b) will depend on the ndistinct of 
'x' and 'a'.
In general, it was the idea to debate, more for further development than 
for the patch in this thread.

[1] Consider the number of columns in the sort cost model
https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com

-- 
regards, Andrei Lepikhov