Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: POC: GROUP BY optimization
Date
Msg-id 20200516145609.vm7nrqy7frj4ha6r@development
Whole thread Raw
In response to Re: POC: GROUP BY optimization  (Dmitry Dolgov <9erthalion6@gmail.com>)
Responses Re: POC: GROUP BY optimization
Re: POC: GROUP BY optimization
List pgsql-hackers
On Sat, May 16, 2020 at 02:24:31PM +0200, Dmitry Dolgov wrote:
>> On Fri, May 15, 2020 at 01:52:20AM +0200, Tomas Vondra wrote:
>>
>> I wonder if anyone has plans to try again with this optimization in v14
>> cycle? The patches no longer apply thanks to the incremental sort patch,
>> but I suppose fixing that should not be extremely hard.
>>
>> The 2020-07 CF is still a couple weeks away, but it'd be good to know if
>> there are any plans to revive this. I'm willing to spend some time on
>> reviewing / testing this, etc.
>
>Yes, if you believe that this patch has potential, I would love to pick
>it up again.
>

I think it's worth another try. I've been reminded of this limitation
when working on the incremental sort, which significantly increases the
number of orderings that we can sort efficiently. But we can't possibly
leverage that unless it matches the GROUP BY ordering.

The attached WIP patch somewhat addresses this by trying to reorder the
group_pathkeys so allow leveraging sort of existing ordering (even just
partial, with incremental sort).

>> I've only quickly skimmed the old thread, but IIRC there were two main
>> challenges in getting the optimization right:
>>
>>
>> 1) deciding which orderings are interesting / worth additional work
>>
>> I think we need to consider these orderings, in addition to the one
>> specified in GROUP BY:
>>
>> 1) as specified in ORDER BY (if different from 1)
>
>What is the idea behind considering this ordering?

I'll try to answer this in general, i.e. what I think this patch needs
to consider. Hopefully that'll answer why we should look at ORDER BY
pathkeys too ...

Reordering the pathkeys has both local and global effects, and we need
to consider all of that to make the optimization reliable, otherwise
we'll inevitably end up with trivial regressions.


The local effects are trivial - it's for example reordering the pathkeys
to make the explicit sort as cheap as possible. This thread already
discussed a number of things to base this on - ndistinct for columns,
cost of comparison function, ... In any case, this is something we can
decide locally, when building the grouping paths.


The global effects are much harder to tackle, because the decision can't
be made locally when building the grouping paths. It requires changes
both below and above the point where we build grouping paths.

An example of a decision we need to make before we even get to building
a grouping path is which index paths to build. Currently we only build
index paths with "useful" pathkeys, and without tweaking that we'll
never even see the index in add_paths_to_grouping_rel().

But there are also decisions that can be made only after we build the
grouping paths. For example, we may have both GROUP BY and ORDER BY, and
there is no "always correct" way to combine those. In some cases it may
be correct to use the same pathkeys, in other cases it's better to use
different ones (which will require an extra Sort, with additional cost).


So I don't think there will be a single "interesting" grouping pathkeys
(i.e. root->group_pathkeys), but a collection of pathkeys. And we'll
need to build grouping paths for all of those, and leave the planner to
eventually pick the one giving us the cheapest plan ...

A "brute-force" approach would be to generate all possible orderings of
group_pathkeys, but that's probably not going to fly - it might easily
cause an explosion in number of paths we track, etc. So we'll need to
pick/prioritize orderings that are more likely to give us efficient
plans, and ORDER BY seems like a good example because it means we won't
need an explicit sort.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Our naming of wait events is a disaster.
Next
From: Tom Lane
Date:
Subject: Re: pg_bsd_indent and -Wimplicit-fallthrough