Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

From Robert Haas
Subject Re: POC: GROUP BY optimization
Date
Msg-id CA+Tgmob=CzWyiPLPtnTJhJynpvN6ocH=PZH8_bBQWEFB2ijvyw@mail.gmail.com
Whole thread Raw
In response to Re: POC: GROUP BY optimization  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: POC: GROUP BY optimization
List pgsql-hackers
On Sat, May 16, 2020 at 10:56 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> The local effects are trivial - it's for example reordering the pathkeys
> to make the explicit sort as cheap as possible. This thread already
> discussed a number of things to base this on - ndistinct for columns,
> cost of comparison function, ... In any case, this is something we can
> decide locally, when building the grouping paths.
>
> The global effects are much harder to tackle, because the decision can't
> be made locally when building the grouping paths. It requires changes
> both below and above the point where we build grouping paths.
>
> An example of a decision we need to make before we even get to building
> a grouping path is which index paths to build. Currently we only build
> index paths with "useful" pathkeys, and without tweaking that we'll
> never even see the index in add_paths_to_grouping_rel().
>
> But there are also decisions that can be made only after we build the
> grouping paths. For example, we may have both GROUP BY and ORDER BY, and
> there is no "always correct" way to combine those. In some cases it may
> be correct to use the same pathkeys, in other cases it's better to use
> different ones (which will require an extra Sort, with additional cost).
>
> So I don't think there will be a single "interesting" grouping pathkeys
> (i.e. root->group_pathkeys), but a collection of pathkeys. And we'll
> need to build grouping paths for all of those, and leave the planner to
> eventually pick the one giving us the cheapest plan ...

I agree with all of this and I think it's really good analysis. Part
of the reason why the planner isn't that sophisticated in this area is
that, for a long time, we didn't use paths at this level, and so it
was much harder to write any kind of viable patch to consider
alternatives. With Tom's planner path-ification word there should be a
lot more potential for optimization here, but, as you say, we need to
do that by leveraging the existing costing machinery, not just via
simple heuristics. It also strikes me that one of the problems in this
area is that the statistics we currently gather don't seem to be
entirely useful or reliable for aggregate planning. I wonder if there
are extensions to the extended statistics mechanism, or even just more
things we should gather during a routine ANALYZE, that would enable us
to estimate things better here. The most obvious thing is that
n_distinct is often wildly inaccurate, but it's deeper than that. For
instance, we need some way to estimate how many groups you're going to
get when you filter on a and then group by b that doesn't assume
uniform distribution. And we need also need something that doesn't
just output a number of groups, but gives you some inkling of the
sizes of those groups: 1 giant group and a bunch of little ones isn't
the same as a bunch of equally sized groups. I don't know that these
are things we really have much chance of figuring out with the
currently-available information, though.

Sorry if this is hijacking the thread a bit; I don't mean to
discourage work on this specific patch. I'm just wondering if we need
to think a little bigger to see our way to a good solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Missing "Up" navigation link between parts and doc root?
Next
From: Jim Woodworth
Date:
Subject: may I help with Perl?