Thread: Group by reordering optimization

Group by reordering optimization

From

Dmitry Dolgov

Date:

01 September 2020, 11:15:31

Hi,

Better late than never, to follow up on the original thread [1] I would like to
continue the discussion with the another version of the patch for group by
reordering optimization. To remind, it's about reordering of group by clauses
to do sorting more efficiently. The patch is rebased and modified to address
(at least partially) the suggestions about making it consider new additional
paths instead of changing original ones. It is still pretty much
proof-of-concept version though with many blind spots, but I wanted to start
kicking it and post at least something, otherwise it will never happen. An
incremental approach so to say.

In many ways it still contains the original code from Teodor. Changes and notes:

* Instead of changing the order directly, now patch creates another patch with
modifier order of clauses. It does so for the normal sort as well as for
incremental sort. The whole thing is done in two steps: first it finds a
potentially better ordering taking into account number of groups, widths and
comparison costs; afterwards this information is used to produce a cost
estimation. This is implemented via a separate create_reordered_sort_path to
not introduce too many changes, I couldn't find any better place.

* Function get_func_cost was removed at some point, but unfortunately this
patch was implemented before that, so it's still present there.

* For simplicity I've removed support in create_partial_grouping_paths, since
they were not covered by the existing tests anyway.

* The costing part is pretty rudimentary and looks only at the first group.
It's mostly hand crafted to pass the existing tests.

The question about handling skewed data sets is not addressed yet.

[1]: https://www.postgresql.org/message-id/flat/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru

Attachment

0001-Group-by-optimization.patch

Re: Group by reordering optimization

From

Tomas Vondra

Date:

01 September 2020, 21:08:54

On Tue, Sep 01, 2020 at 01:15:31PM +0200, Dmitry Dolgov wrote:
>Hi,
>
>Better late than never, to follow up on the original thread [1] I would like to
>continue the discussion with the another version of the patch for group by
>reordering optimization. To remind, it's about reordering of group by clauses
>to do sorting more efficiently. The patch is rebased and modified to address
>(at least partially) the suggestions about making it consider new additional
>paths instead of changing original ones. It is still pretty much
>proof-of-concept version though with many blind spots, but I wanted to start
>kicking it and post at least something, otherwise it will never happen. An
>incremental approach so to say.
>
>In many ways it still contains the original code from Teodor. Changes and notes:
>
>* Instead of changing the order directly, now patch creates another patch with
>  modifier order of clauses. It does so for the normal sort as well as for
>  incremental sort. The whole thing is done in two steps: first it finds a
>  potentially better ordering taking into account number of groups, widths and
>  comparison costs; afterwards this information is used to produce a cost
>  estimation. This is implemented via a separate create_reordered_sort_path to
>  not introduce too many changes, I couldn't find any better place.
>

I haven't tested the patch with any queries, but I agree this seems like
the right approach in general.

I'm a bit worried about how complex the code in planner.c is getting -
the incremental sort patch already made it a bit too complex, and this
is just another step in that direction.  I suppose we should refactor
add_paths_to_grouping_rel() by breaking it into smaller / more readable
pieces ...

>* Function get_func_cost was removed at some point, but unfortunately this
>  patch was implemented before that, so it's still present there.
>
>* For simplicity I've removed support in create_partial_grouping_paths, since
>  they were not covered by the existing tests anyway.
>

Hmmm, OK. I think that's something we'll need to address for the final
patch, but I agree we can add it after improving the costing etc.

>* The costing part is pretty rudimentary and looks only at the first group.
>  It's mostly hand crafted to pass the existing tests.
>

OK, understood.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Group by reordering optimization

From

Peter Geoghegan

Date:

01 September 2020, 22:09:14

On Tue, Sep 1, 2020 at 2:09 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> >* Instead of changing the order directly, now patch creates another patch with
> >  modifier order of clauses. It does so for the normal sort as well as for
> >  incremental sort. The whole thing is done in two steps: first it finds a
> >  potentially better ordering taking into account number of groups, widths and
> >  comparison costs; afterwards this information is used to produce a cost
> >  estimation. This is implemented via a separate create_reordered_sort_path to
> >  not introduce too many changes, I couldn't find any better place.
> >
>
> I haven't tested the patch with any queries, but I agree this seems like
> the right approach in general.

If we're creating a new sort path anyway, then perhaps we can also
change the collation -- it might be possible to "reduce" it to the "C"
collation without breaking queries.

This is admittedly pretty hard to do well. It could definitely work
out when we have to do a sort anyway -- a sort with high cardinality
abbreviated keys will be very fast (though we can't use abbreviated
keys with libc collations right now). OTOH, it would be quite
counterproductive if we were hoping to get an incremental sort that
used some available index that happens to use the default collation
(which is not the C collation in cases where this optimization is
expected to help).

-- 
Peter Geoghegan

Re: Group by reordering optimization

From

Tomas Vondra

Date:

02 September 2020, 16:12:01

On Tue, Sep 01, 2020 at 03:09:14PM -0700, Peter Geoghegan wrote:
>On Tue, Sep 1, 2020 at 2:09 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> >* Instead of changing the order directly, now patch creates another patch with
>> >  modifier order of clauses. It does so for the normal sort as well as for
>> >  incremental sort. The whole thing is done in two steps: first it finds a
>> >  potentially better ordering taking into account number of groups, widths and
>> >  comparison costs; afterwards this information is used to produce a cost
>> >  estimation. This is implemented via a separate create_reordered_sort_path to
>> >  not introduce too many changes, I couldn't find any better place.
>> >
>>
>> I haven't tested the patch with any queries, but I agree this seems like
>> the right approach in general.
>
>If we're creating a new sort path anyway, then perhaps we can also
>change the collation -- it might be possible to "reduce" it to the "C"
>collation without breaking queries.
>
>This is admittedly pretty hard to do well. It could definitely work
>out when we have to do a sort anyway -- a sort with high cardinality
>abbreviated keys will be very fast (though we can't use abbreviated
>keys with libc collations right now). OTOH, it would be quite
>counterproductive if we were hoping to get an incremental sort that
>used some available index that happens to use the default collation
>(which is not the C collation in cases where this optimization is
>expected to help).
>

Even if reducing collations like this was possible (I have no idea how
tricky it is, my knowledge of collations is pretty minimal and from what
I know I'm not dying to learn more), I suggest we consider that out of
scope for this particular patch.

There are multiple open issues already - deciding which pathkeys are
interesting, reasonable costing, etc. Once those issues are solved, we
can consider tweaking collations as an additional optimizations.

Or maybe we can consider it entirely separately, i.e. why would it
matter if we re-order the GROUP BY keys? The collation reduction can
just as well help even if we use the same pathkeys.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Group by reordering optimization

From

Dmitry Dolgov

Date:

16 September 2020, 14:20:39

> On Tue, Sep 01, 2020 at 11:08:54PM +0200, Tomas Vondra wrote:
> On Tue, Sep 01, 2020 at 01:15:31PM +0200, Dmitry Dolgov wrote:
> > Hi,
> > 
> > Better late than never, to follow up on the original thread [1] I would like to
> > continue the discussion with the another version of the patch for group by
> > reordering optimization. To remind, it's about reordering of group by clauses
> > to do sorting more efficiently. The patch is rebased and modified to address
> > (at least partially) the suggestions about making it consider new additional
> > paths instead of changing original ones. It is still pretty much
> > proof-of-concept version though with many blind spots, but I wanted to start
> > kicking it and post at least something, otherwise it will never happen. An
> > incremental approach so to say.
> > 
> > In many ways it still contains the original code from Teodor. Changes and notes:
> > 
> > * Instead of changing the order directly, now patch creates another patch with
> >  modifier order of clauses. It does so for the normal sort as well as for
> >  incremental sort. The whole thing is done in two steps: first it finds a
> >  potentially better ordering taking into account number of groups, widths and
> >  comparison costs; afterwards this information is used to produce a cost
> >  estimation. This is implemented via a separate create_reordered_sort_path to
> >  not introduce too many changes, I couldn't find any better place.
> > 
> 
> I haven't tested the patch with any queries, but I agree this seems like
> the right approach in general.
> 
> I'm a bit worried about how complex the code in planner.c is getting -
> the incremental sort patch already made it a bit too complex, and this
> is just another step in that direction.  I suppose we should refactor
> add_paths_to_grouping_rel() by breaking it into smaller / more readable
> pieces ...

Yes, that was my impression as well. I'll try to make such refactoring
either as a separate patch or a part of the main one.

> > * Function get_func_cost was removed at some point, but unfortunately this
> >  patch was implemented before that, so it's still present there.
> > 
> > * For simplicity I've removed support in create_partial_grouping_paths, since
> >  they were not covered by the existing tests anyway.
> > 
> 
> Hmmm, OK. I think that's something we'll need to address for the final
> patch, but I agree we can add it after improving the costing etc.

Sure, I plan to return it in time. IIUC in the original patch series
this code wasn't covered with tests, so I've decided to minimize the
changes.

Re: POC: GROUP BY optimization

From

"Andrey V. Lepikhov"

Date:

28 December 2021, 08:47:35

On 9/2/20 9:12 PM, Tomas Vondra wrote:
 > We could simply use the input "tuples" value here, and then divide the
 > current and previous estimate to calculate the number of new groups.

Performing a review of this patch I made a number of changes (see 
cleanup.txt). Maybe it will be useful.
As I see, the code, which implements the main idea, is quite stable. 
Doubts localized in the cost estimation routine. Maybe try to finish 
this work by implementing an conservative strategy to a cost estimation 
of sorting?

-- 
regards,
Andrey Lepikhov
Postgres Professional