Re: POC: GROUP BY optimization - Mailing list pgsql-hackers
Hi! On Tue, Apr 23, 2024 at 1:19 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Thu, Apr 18, 2024 at 1:57 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Thank you for the fixes you've proposed. I didn't look much into > > details yet, but I think the main concern Tom expressed in [1] is > > whether the feature is reasonable at all. I think at this stage the > > most important thing is to come up with convincing examples showing > > how huge performance benefits it could cause. I will return to this > > later today and will try to provide some convincing examples. > > I took a fresh look at 0452b461b, and have the following thoughts: > 1) Previously, preprocess_groupclause() tries to reorder GROUP BY > clauses to match the required ORDER BY order. It only reorders if > GROUP BY pathkeys are the prefix of ORDER BY pathkeys or vice versa. > So, both of them need to be satisfied by one ordering. 0452b461b also > tries to match GROUP BY clauses to ORDER BY clauses, but takes into > account an incremental sort. Actually, instead of removing > preprocess_groupclause(), we could just teach it to take incremental > sort into account. > 2) The get_useful_group_keys_orderings() function takes into account 3 > orderings of pathkeys and clauses: original order as written in GROUP > BY, matching ORDER BY clauses as much as possible, and matching the > input path as much as possible. Given that even before 0452b461b, > preprocess_groupclause() could change the original order of GROUP BY > clauses, so we don't need to distinguish the first two. We could just > consider output of new preprocess_groupclause() taking into account an > incremental sort and the ordering matching the input path as much as > possible. This seems like significant simplification. > > Let me also describe my thoughts about the justification of the > feature itself. As Tom pointed upthread, Sort + Grouping is generally > unlikely faster than Hash Aggregate. The main point of this feature > is being able to match the order of GROUP BY clauses to the order of > the input path. That input path could be Index Scan or Subquery. > Let's concentrate on Index Scan. Index Scan can give us the required > order, so we can do grouping without Sort or with significantly > cheaper Incremental Sort. That could be much faster than Hash > Aggregate. But if we scan the full table (or its significant > fraction), Index Scan is typically much more expensive than Sequential > Scan because of random page access. However, there are cases when > Index Scan could be cheaper. > 1) If the heap row is wide and the index contains all the required > columns, Index Only Scan can be cheaper than Sequential Scan because > of lesser volume. > 2) If the query predicate is selective enough and matches the index, > Index Scan might be significantly cheaper. One may say that if the > query predicate is selective enough then there are not that many > matching rows, so aggregating them either way isn't a big deal anyway. > However, given that data volumes are growing tremendously, it's not > hard to imagine that the index selected a small fraction of table > rows, but they are still enough to be costly for aggregating. > Therefore, I think there are significant cases where matching GROUP BY > clauses to the order of the input path could give a substantial > improvement over Hash Aggregate. > > While there are some particular use-cases by Jian He, I hope that > above could give some rationale. I've assembled patches in this thread into one patchset. 0001 The patch fixing asymmetry in setting EquivalenceClass.ec_sortref by Andrei [1]. I've revised comments and wrote the commit message. 0002 The patch for handling duplicates of SortGroupClause. I didn't get the sense of Andrei implementation. It seems to care about duplicate pointers in group clauses list. But the question is the equal SortGroupClause's comprising different pointers. I think we should group duplicate SortGroupClause's together as preprocess_groupclause() used to do. Reimplemented patch to do so. 0003 Rename PathKeyInfo to GroupByOrdering by Andres [3]. I only revised comments and wrote the commit message. 0004 Turn back preprocess_groupclause() for the reason I described upthread [4]. Any thoughts? Links. 1. https://www.postgresql.org/message-id/17037754-f187-4138-8285-0e2bfebd0dea%40postgrespro.ru 2. https://www.postgresql.org/message-id/a663f0f6-cbf6-49aa-af2e-234dc6768a07%40postgrespro.ru 3. https://www.postgresql.org/message-id/db0fc3a4-966c-4cec-a136-94024d39212d%40postgrespro.ru 4. https://www.postgresql.org/message-id/CAPpHfdvyWLMGwvxaf%3D7KAp-z-4mxbSH8ti2f6mNOQv5metZFzg%40mail.gmail.com ------ Regards, Alexander Korotkov Supabase
Attachment
pgsql-hackers by date: