Home > mailing lists

Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Re: POC: GROUP BY optimization
Date	May 27, 2024 12:41:25
Msg-id	CAPpHfdvbxK3nszdk6YXWpV7p+P+pGOQhpKTxJ3xME3uD+qZmXg@mail.gmail.com Whole thread
In response to	Re: POC: GROUP BY optimization (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	About 0001:,Having overviewed it, I don't see any issues (but I'm the author), except grammatical ones - but I'm not a native to judge it.,Also, the sentence 'turning GROUP BY clauses into pathkeys' is unclear to me. It may be better to write something like: 'building pathkeys by the list of grouping clauses'.,,0002:,The part under USE_ASSERT_CHECKING looks good to me. But the code in group_keys_reorder_by_pathkeys looks suspicious: of course, we do some doubtful work without any possible way to reproduce, but if we envision some duplicated elements in the group_clauses, we should avoid usage of the list_concat_unique_ptr. What's more, why do you not exit from foreach_ptr immediately after SortGroupClause has been found? I think the new_group_clauses should be consistent with the new_group_pathkeys.,,0003:,Looks good,,0004:,I was also thinking about reintroducing the preprocess_groupclause because with the re-arrangement of GROUP-BY clauses according to incoming pathkeys, it doesn't make sense to have a user-defined order—at least while cost_sort doesn't differ costs for alternative column orderings.,So, I'm okay with the code. But why don't you use the same approach with foreach_ptr as before?
List	pgsql-hackers

Tree view

Hi!

On Tue, Apr 23, 2024 at 1:19 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Thu, Apr 18, 2024 at 1:57 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > Thank you for the fixes you've proposed.  I didn't look much into
> > details yet, but I think the main concern Tom expressed in [1] is
> > whether the feature is reasonable at all.  I think at this stage the
> > most important thing is to come up with convincing examples showing
> > how huge performance benefits it could cause.  I will return to this
> > later today and will try to provide some convincing examples.
>
> I took a fresh look at 0452b461b, and have the following thoughts:
> 1) Previously, preprocess_groupclause() tries to reorder GROUP BY
> clauses to match the required ORDER BY order.  It only reorders if
> GROUP BY pathkeys are the prefix of ORDER BY pathkeys or vice versa.
> So, both of them need to be satisfied by one ordering.  0452b461b also
> tries to match GROUP BY clauses to ORDER BY clauses, but takes into
> account an incremental sort.  Actually, instead of removing
> preprocess_groupclause(), we could just teach it to take incremental
> sort into account.
> 2) The get_useful_group_keys_orderings() function takes into account 3
> orderings of pathkeys and clauses: original order as written in GROUP
> BY, matching ORDER BY clauses as much as possible, and matching the
> input path as much as possible.  Given that even before 0452b461b,
> preprocess_groupclause() could change the original order of GROUP BY
> clauses, so we don't need to distinguish the first two.  We could just
> consider output of new preprocess_groupclause() taking into account an
> incremental sort and the ordering matching the input path as much as
> possible.  This seems like significant simplification.
>
> Let me also describe my thoughts about the justification of the
> feature itself.  As Tom pointed upthread, Sort + Grouping is generally
> unlikely faster than Hash Aggregate.  The main point of this feature
> is being able to match the order of GROUP BY clauses to the order of
> the input path.  That input path could be Index Scan or Subquery.
> Let's concentrate on Index Scan.  Index Scan can give us the required
> order, so we can do grouping without Sort or with significantly
> cheaper Incremental Sort.  That could be much faster than Hash
> Aggregate.  But if we scan the full table (or its significant
> fraction), Index Scan is typically much more expensive than Sequential
> Scan because of random page access.  However, there are cases when
> Index Scan could be cheaper.
> 1) If the heap row is wide and the index contains all the required
> columns, Index Only Scan can be cheaper than Sequential Scan because
> of lesser volume.
> 2) If the query predicate is selective enough and matches the index,
> Index Scan might be significantly cheaper.  One may say that if the
> query predicate is selective enough then there are not that many
> matching rows, so aggregating them either way isn't a big deal anyway.
> However, given that data volumes are growing tremendously, it's not
> hard to imagine that the index selected a small fraction of table
> rows, but they are still enough to be costly for aggregating.
> Therefore, I think there are significant cases where matching GROUP BY
> clauses to the order of the input path could give a substantial
> improvement over Hash Aggregate.
>
> While there are some particular use-cases by Jian He, I hope that
> above could give some rationale.

I've assembled patches in this thread into one patchset.
0001 The patch fixing asymmetry in setting EquivalenceClass.ec_sortref
by Andrei [1].  I've revised comments and wrote the commit message.
0002 The patch for handling duplicates of SortGroupClause.  I didn't
get the sense of Andrei implementation.  It seems to care about
duplicate pointers in group clauses list.  But the question is the
equal SortGroupClause's comprising different pointers.  I think we
should group duplicate SortGroupClause's together as
preprocess_groupclause() used to do.  Reimplemented patch to do so.
0003 Rename PathKeyInfo to GroupByOrdering by Andres [3].  I only
revised comments and wrote the commit message.
0004 Turn back preprocess_groupclause() for the reason I described upthread [4].

Any thoughts?

Links.
1. https://www.postgresql.org/message-id/17037754-f187-4138-8285-0e2bfebd0dea%40postgrespro.ru
2. https://www.postgresql.org/message-id/a663f0f6-cbf6-49aa-af2e-234dc6768a07%40postgrespro.ru
3. https://www.postgresql.org/message-id/db0fc3a4-966c-4cec-a136-94024d39212d%40postgrespro.ru
4. https://www.postgresql.org/message-id/CAPpHfdvyWLMGwvxaf%3D7KAp-z-4mxbSH8ti2f6mNOQv5metZFzg%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Attachment

pgsql-hackers by date:

From: Ranier Vilela
Date: 27 May 2024, 12:25:45
Subject: Fix calloc check if oom (PQcancelCreate)

From: Robert Haas
Date: 27 May 2024, 13:10:13
Subject: Re: relfilenode statistics

Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

Attachment

Previous

Next