Re: Remove useless GROUP BY columns considering unique index - Mailing list pgsql-hackers

From David Rowley
Subject Re: Remove useless GROUP BY columns considering unique index
Date
Msg-id CAApHDvpbZ2Njs=m-8+rnEr_UjyAyECVH__kjab8SqRcGA0S=Ew@mail.gmail.com
Whole thread Raw
In response to Remove useless GROUP BY columns considering unique index  (Zhang Mingli <zmlpostgres@gmail.com>)
List pgsql-hackers
On Wed, 27 Nov 2024 at 19:51, jian he <jian.universality@gmail.com> wrote:
> v3 attached. in v3:

1. I find the following quite strange:

postgres=# create table abcd (a int primary key, b int not null, c int
not null, d int not null, unique(b));
CREATE TABLE
postgres=# explain select b,c from abcd group by b,c;
                          QUERY PLAN
--------------------------------------------------------------
 HashAggregate  (cost=37.75..56.25 rows=1850 width=8)
   Group Key: b, c
   ->  Seq Scan on abcd  (cost=0.00..28.50 rows=1850 width=8)
(3 rows)


postgres=# alter table abcd drop constraint abcd_pkey;
ALTER TABLE
postgres=# explain select b,c from abcd group by b,c;
                          QUERY PLAN
--------------------------------------------------------------
 HashAggregate  (cost=33.12..51.62 rows=1850 width=8)
   Group Key: b
   ->  Seq Scan on abcd  (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

Why does the patch only try to remove GROUP BY columns using unique
indexes when there's no primary key?

I don't really see why primary key should be treated specially. Why
does the patch not just unify the logic and collect up all unique
non-expression index, non-partial indexes where all columns are NOT
NULL. You could maybe just add a special case to skip the NOT NULL
checking for indexes with indisprimary set.

2. In get_unique_not_null_attnos(), not_null_attnos could be a
Bitmapset with attnums offset by FirstLowInvalidHeapAttributeNumber.
That'll allow you to do a bms_is_member() call rather than a (possibly
expensive) list_member_int() call.

3. The logic in remove_useless_groupby_columns() looks much more
complex than it needs to be. It would be much more simple to find the
matching unique index with the least number of columns and use that
one. Just have a counter to record the bms_num_members() of the
columns of the best match so far and replace it when you find an index
with fewer columns. Please see adjust_group_pathkeys_for_groupagg()
for inspiration.

We also need to think about if the unique index with the least columns
is always the best one to use. Perhaps the index with the least
columns is a varlena column and there's some index with, say, two
INT4s. It would be much cheaper to hash a couple of INT4s than some
long varlena column. Maybe it's ok just to leave an XXX comment
explaining that we might want to think about doing that one day.
Alternatively, summing up the pg_statistic.stawidth values could work.
Likely it's better to make the patch work correctly before thinking
about that part.

The patch could also use a spell check:

"a input" (an input)
"soure" source??
"GROUOP BY"

David



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE
Next
From: Tatsuo Ishii
Date:
Subject: Re: Fix for Extra Parenthesis in pgbench progress message