Re: POC: GROUP BY optimization - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: GROUP BY optimization |
Date | |
Msg-id | CA+Tgmob=CzWyiPLPtnTJhJynpvN6ocH=PZH8_bBQWEFB2ijvyw@mail.gmail.com Whole thread Raw |
In response to | Re: POC: GROUP BY optimization (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: POC: GROUP BY optimization
|
List | pgsql-hackers |
On Sat, May 16, 2020 at 10:56 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > The local effects are trivial - it's for example reordering the pathkeys > to make the explicit sort as cheap as possible. This thread already > discussed a number of things to base this on - ndistinct for columns, > cost of comparison function, ... In any case, this is something we can > decide locally, when building the grouping paths. > > The global effects are much harder to tackle, because the decision can't > be made locally when building the grouping paths. It requires changes > both below and above the point where we build grouping paths. > > An example of a decision we need to make before we even get to building > a grouping path is which index paths to build. Currently we only build > index paths with "useful" pathkeys, and without tweaking that we'll > never even see the index in add_paths_to_grouping_rel(). > > But there are also decisions that can be made only after we build the > grouping paths. For example, we may have both GROUP BY and ORDER BY, and > there is no "always correct" way to combine those. In some cases it may > be correct to use the same pathkeys, in other cases it's better to use > different ones (which will require an extra Sort, with additional cost). > > So I don't think there will be a single "interesting" grouping pathkeys > (i.e. root->group_pathkeys), but a collection of pathkeys. And we'll > need to build grouping paths for all of those, and leave the planner to > eventually pick the one giving us the cheapest plan ... I agree with all of this and I think it's really good analysis. Part of the reason why the planner isn't that sophisticated in this area is that, for a long time, we didn't use paths at this level, and so it was much harder to write any kind of viable patch to consider alternatives. With Tom's planner path-ification word there should be a lot more potential for optimization here, but, as you say, we need to do that by leveraging the existing costing machinery, not just via simple heuristics. It also strikes me that one of the problems in this area is that the statistics we currently gather don't seem to be entirely useful or reliable for aggregate planning. I wonder if there are extensions to the extended statistics mechanism, or even just more things we should gather during a routine ANALYZE, that would enable us to estimate things better here. The most obvious thing is that n_distinct is often wildly inaccurate, but it's deeper than that. For instance, we need some way to estimate how many groups you're going to get when you filter on a and then group by b that doesn't assume uniform distribution. And we need also need something that doesn't just output a number of groups, but gives you some inkling of the sizes of those groups: 1 giant group and a bunch of little ones isn't the same as a bunch of equally sized groups. I don't know that these are things we really have much chance of figuring out with the currently-available information, though. Sorry if this is hijacking the thread a bit; I don't mean to discourage work on this specific patch. I'm just wondering if we need to think a little bigger to see our way to a good solution. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: