Re: POC: GROUP BY optimization - Mailing list pgsql-hackers
From | Dmitry Dolgov |
---|---|
Subject | Re: POC: GROUP BY optimization |
Date | |
Msg-id | 20200620143010.z3af7wa4v22h4lu4@localhost Whole thread Raw |
In response to | Re: POC: GROUP BY optimization (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: POC: GROUP BY optimization
|
List | pgsql-hackers |
> On Sat, May 16, 2020 at 04:56:09PM +0200, Tomas Vondra wrote: > > So I don't think there will be a single "interesting" grouping pathkeys > (i.e. root->group_pathkeys), but a collection of pathkeys. And we'll > need to build grouping paths for all of those, and leave the planner to > eventually pick the one giving us the cheapest plan ... > > A "brute-force" approach would be to generate all possible orderings of > group_pathkeys, but that's probably not going to fly - it might easily > cause an explosion in number of paths we track, etc. So we'll need to > pick/prioritize orderings that are more likely to give us efficient > plans, and ORDER BY seems like a good example because it means we won't > need an explicit sort. Yes, I see. I've already rebased the original version of patch and now working on your suggestions. In the meantime one question: > But there are also decisions that can be made only after we build the > grouping paths. For example, we may have both GROUP BY and ORDER BY, and > there is no "always correct" way to combine those. In some cases it may > be correct to use the same pathkeys, in other cases it's better to use > different ones (which will require an extra Sort, with additional cost). Do I understand correctly, your idea is that in some cases it's cheaper to use different order for GROUP BY than in ORDER BY even with an extra sorting? In the current patch implementation there is an assumption that says it's always better to match the order of pathkeys for both GROUP BY/ORDER BY (which means that the only degree of freedom there is to reorder the tail, which in turn makes both "unsorted input" and "partially sorted input" cases from your original email essentially the same). Can you show such an example when this assumption is invalid? > I've only quickly skimmed the old thread, but IIRC there were two main > challenges in getting the optimization right: > > 1) deciding which orderings are interesting / worth additional work > > I think we need to consider these orderings, in addition to the one > specified in GROUP BY: Yes, looks like the current patch implementation together with preprocess_groupclause already implements this, although maybe not that flexible. > 2) costing the alternative orderings > > I think we've already discussed various ways to leverage as much > available info as possible (extended stats, MCVs, ...) but I think the > patch only does some of it. Right, there were couple of ideas what to do in case of a few groups which are too big they can invalidate current assumptions about costs. E.g. do not apply reordering if we detected such situation, or use "conservative" approach and take the biggest group for estimations.
pgsql-hackers by date: