Thread: Reordering DISTINCT keys to match input path's pathkeys

Reordering DISTINCT keys to match input path's pathkeys

From

Richard Guo

Date:

23 January 2024, 05:55:54

Similar to what we did to GROUP BY keys in 0452b461bc, I think we can do
the same to DISTINCT keys, i.e. reordering DISTINCT keys to match input
path's pathkeys, which can help reduce cost by avoiding unnecessary
re-sort, or allowing us to use incremental-sort to save efforts. For
instance,

create table t (a int, b int);
create index on t (a, b);

explain (costs off) select distinct b, a from t limit 10;
QUERY PLAN
--------------------------------------------------
Limit
-> Unique
-> Index Only Scan using t_a_b_idx on t
(3 rows)

Please note that the parser has ensured that the DISTINCT pathkeys
matches the order of ORDER BY clauses. So there is no need to do this
part again.

In principle, we can perform such reordering for DISTINCT ON too, but we
need to make sure that the resulting pathkeys matches initial ORDER BY
keys, which seems not trivial. So it doesn't seem worth the effort.

Attached is a patch for this optimization. Any thoughts?

Thanks
Richard

Attachment

v1-0001-Reordering-DISTINCT-keys-to-match-input-path-s-pathkeys.patch

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Laurenz Albe

Date:

23 January 2024, 08:04:22

On Tue, 2024-01-23 at 13:55 +0800, Richard Guo wrote:
> Similar to what we did to GROUP BY keys in 0452b461bc, I think we can do
> the same to DISTINCT keys, i.e. reordering DISTINCT keys to match input
> path's pathkeys, which can help reduce cost by avoiding unnecessary
> re-sort, or allowing us to use incremental-sort to save efforts.
>
> Attached is a patch for this optimization.  Any thoughts?

I didn't scrutinize the code, but that sounds like a fine optimization.

Yours,
Laurenz Albe

Re: Reordering DISTINCT keys to match input path's pathkeys

From

David Rowley

Date:

23 January 2024, 09:03:23

On Tue, 23 Jan 2024 at 18:56, Richard Guo <guofenglinux@gmail.com> wrote:
>
> Similar to what we did to GROUP BY keys in 0452b461bc, I think we can do
> the same to DISTINCT keys, i.e. reordering DISTINCT keys to match input
> path's pathkeys, which can help reduce cost by avoiding unnecessary
> re-sort, or allowing us to use incremental-sort to save efforts.  For
> instance,

I've not caught up on the specifics of 0452b461b, but I just wanted to
highlight that there was some work done in [1] in this area.  It seems
Ankit didn't ever add that to a CF, so that might explain why it's
been lost.

Anyway, just pointing it out as there may be useful code or discussion
in the corresponding threads.

David

[1] https://postgr.es/m/da9425ae-8ff7-33d9-23b3-2a3eb605e106%40gmail.com

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Richard Guo

Date:

26 January 2024, 10:48:39

On Tue, Jan 23, 2024 at 5:03 PM David Rowley <dgrowleyml@gmail.com> wrote:

I've not caught up on the specifics of 0452b461b, but I just wanted to
highlight that there was some work done in [1] in this area. It seems
Ankit didn't ever add that to a CF, so that might explain why it's
been lost.

Anyway, just pointing it out as there may be useful code or discussion
in the corresponding threads.

Thanks for pointing it out. I looked at the patch there and noticed
several problems with it.

* That patch is incomplete and does not work as expected. It at least
needs to modify truncate_useless_pathkeys() to account for DISTINCT
clause (I think this has been mentioned in that thread).

* That patch would not consider the origin DISTINCT pathkeys if it could
do some reordering, which is not great and can generate inefficient
plans. For instance (after fixing the first problem)

create table t (a int, b int);
create index on t(a);

set enable_hashagg to off;
set enable_incremental_sort to off;
set enable_seqscan to off;

explain (costs off) select distinct b, a from t order by b, a;
QUERY PLAN
-------------------------------------------------
Sort
Sort Key: b, a
-> Unique
-> Sort
Sort Key: a, b
-> Index Scan using t_a_idx on t
(6 rows)

Using DISTINCT pathkeys {b, a} is more efficient for this plan, because
only one Sort would be required. But that patch is not able to do that,
because it does not consider the origin DISTINCT pathkeys after
reordering.

Thanks
Richard

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Richard Guo

Date:

05 February 2024, 03:18:18

cfbot reminds that this patch does not apply any more. So I've rebased
it on master, and also adjusted the test cases a bit.

Thanks
Richard

Attachment

v2-0001-Reordering-DISTINCT-keys-to-match-input-path-s-pathkeys.patch

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Richard Guo

Date:

07 June 2024, 09:46:54

On Mon, Feb 5, 2024 at 11:18 AM Richard Guo <guofenglinux@gmail.com> wrote:
> cfbot reminds that this patch does not apply any more.  So I've rebased
> it on master, and also adjusted the test cases a bit.

This patch does not apply any more, so here is a new rebase, with some
tweaks to the comments.

Thanks
Richard

Attachment

v3-0001-Reordering-DISTINCT-keys-to-match-input-path-s-pathkeys.patch

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Andrei Lepikhov

Date:

28 October 2024, 12:15:09

On 6/7/24 16:46, Richard Guo wrote:
> On Mon, Feb 5, 2024 at 11:18 AM Richard Guo <guofenglinux@gmail.com> wrote:
>> cfbot reminds that this patch does not apply any more.  So I've rebased
>> it on master, and also adjusted the test cases a bit.
> 
> This patch does not apply any more, so here is a new rebase, with some
> tweaks to the comments.
This patch needs a minor rebase again.
After skimming the code, I want to say that it looks good. But maybe to 
avoid one more *_reordering GUC - it would be better to cover all path 
key reorderings under a single GUC.

-- 
regards, Andrei Lepikhov

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Andrei Lepikhov

Date:

13 November 2024, 13:36:10

On 11/13/24 13:49, Richard Guo wrote:
> On Mon, Oct 28, 2024 at 6:15 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
>> On 6/7/24 16:46, Richard Guo wrote:
>>> This patch does not apply any more, so here is a new rebase, with some
>>> tweaks to the comments.
> 
>> This patch needs a minor rebase again.
>> After skimming the code, I want to say that it looks good. But maybe to
>> avoid one more *_reordering GUC - it would be better to cover all path
>> key reorderings under a single GUC.
> 
> Thanks for reviewing this patch.  After some consideration, I think
> it's not too complex to also apply this optimization to DISTINCT ON.
> The parser already ensures that the DISTINCT ON expressions match the
> initial ORDER BY expressions; we just need to ensure that the
> resulting pathkey list from the reordering matches the original
> distinctClause pathkeys, while leaving the remaining pathkeys
> unchanged in order.  Please see attached.
Thanks, I'll discover it later.
BTW have you ever thought about one more, cost-based reordering strategy?
For now, we can reorder GROUP-BY and distinct clauses according to two 
orderings: 1) ORDER-BY order and 2) order derived from the underlying 
query tree.
In thread [1], I try to add one more strategy that minimises the number 
of comparison operator calls. It seems that it would work the same way 
with the DISTINCT statement. Do you think it make sense in general and 
can be a possible direction of improvement for the current patch?

> 
> I'm not sure about merging these two 'reordering' GUCs into one.
> While they may look similar, they apply to very different scenarios.
> However, I'm open to other suggestions.
Sure, they enable different optimisations. But, they enable highly 
specialised optimisations. Having two GUCs looks too expensive.
Moreover, this stuff is cost-based and should work automatically. So, I 
treat these GUCs as mostly debugging or last-chance stuff used to 
disable features during severe slowdowns or bugs. It might make sense to 
group them into a single 'Clause Reordering' parameter.

[1] 
https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com

-- 
regards, Andrei Lepikhov

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Richard Guo

Date:

14 November 2024, 04:09:32

On Wed, Nov 13, 2024 at 7:36 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> On 11/13/24 13:49, Richard Guo wrote:
> > Thanks for reviewing this patch.  After some consideration, I think
> > it's not too complex to also apply this optimization to DISTINCT ON.
> > The parser already ensures that the DISTINCT ON expressions match the
> > initial ORDER BY expressions; we just need to ensure that the
> > resulting pathkey list from the reordering matches the original
> > distinctClause pathkeys, while leaving the remaining pathkeys
> > unchanged in order.  Please see attached.

> BTW have you ever thought about one more, cost-based reordering strategy?
> For now, we can reorder GROUP-BY and distinct clauses according to two
> orderings: 1) ORDER-BY order and 2) order derived from the underlying
> query tree.
> In thread [1], I try to add one more strategy that minimises the number
> of comparison operator calls. It seems that it would work the same way
> with the DISTINCT statement. Do you think it make sense in general and
> can be a possible direction of improvement for the current patch?

I haven’t had a chance to follow that thread.  From a quick look at
that patch, it seems to improve the general costing logic for sorting.
If that’s the case, I think it would be beneficial in the areas where
we use cost_sort(), including in this patch.

> > I'm not sure about merging these two 'reordering' GUCs into one.
> > While they may look similar, they apply to very different scenarios.
> > However, I'm open to other suggestions.
> Sure, they enable different optimisations. But, they enable highly
> specialised optimisations. Having two GUCs looks too expensive.
> Moreover, this stuff is cost-based and should work automatically. So, I
> treat these GUCs as mostly debugging or last-chance stuff used to
> disable features during severe slowdowns or bugs. It might make sense to
> group them into a single 'Clause Reordering' parameter.

I don't want to argue too much about this, but the two types of
optimizations are quite different in terms of implementation details.
For example, with DISTINCT ON, one key requirement is ensuring that
the resulting pathkey list matches the original distinctClause
pathkeys, and this logic doesn’t exist in the GROUP BY reordering
code.  If these GUCs are mainly for debugging, I think it's better to
keep them separate so that we can debug each optimization individually.

Thanks
Richard

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Andrei Lepikhov

Date:

22 November 2024, 11:39:02

On 11/13/24 13:49, Richard Guo wrote:
> On Mon, Oct 28, 2024 at 6:15 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
>> On 6/7/24 16:46, Richard Guo wrote:
> I'm not sure about merging these two 'reordering' GUCs into one.
> While they may look similar, they apply to very different scenarios.
> However, I'm open to other suggestions.
I finally have passed through the code. It generally looks OK. Let me 
only write a few words on tests.
I wonder if it would be possible to print only three rows instead of 10 
to prove the DISTINCT's correctness.
Also, to stabilise tests with parallel workers, I would recommend 
setting max_parallel_workers_per_gather into 1 and enabling 
debug_parallel_query.

-- 
regards, Andrei Lepikhov

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Richard Guo

Date:

26 November 2024, 03:43:13

On Fri, Nov 22, 2024 at 5:40 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> I wonder if it would be possible to print only three rows instead of 10
> to prove the DISTINCT's correctness.

There are ten distinct values in the 'distinct_tbl' test table, so
I think it’d better to print all of them to verify correctness.

> Also, to stabilise tests with parallel workers, I would recommend
> setting max_parallel_workers_per_gather into 1 and enabling
> debug_parallel_query.

Since we're testing sort-based DISTINCT, the results will always be
well-ordered, so they should be pretty stable as is.

I've applied some of the changes you suggested in your previous email
and pushed the updated patch.  Thank you for your review.

Thanks
Richard

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Richard Guo

Date:

26 November 2024, 04:10:07

On Tue, Nov 26, 2024 at 9:43 AM Richard Guo <guofenglinux@gmail.com> wrote:
> I've applied some of the changes you suggested in your previous email
> and pushed the updated patch.  Thank you for your review.

Ah, the buildfarm is complaining.  I shouldn't create tables with the
same name in two test files.  Fixing ...

Thanks
Richard

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Tom Lane

Date:

26 November 2024, 04:12:29

Richard Guo <guofenglinux@gmail.com> writes:
> I've applied some of the changes you suggested in your previous email
> and pushed the updated patch.  Thank you for your review.

The buildfarm's not too happy.  It looks like this patch decided
to use the name "distinct_tbl" in a test script that runs in
parallel with another one already using "distinct_tbl".

            regards, tom lane

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Richard Guo

Date:

26 November 2024, 04:22:36

On Tue, Nov 26, 2024 at 10:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Richard Guo <guofenglinux@gmail.com> writes:
> > I've applied some of the changes you suggested in your previous email
> > and pushed the updated patch.  Thank you for your review.
>
> The buildfarm's not too happy.  It looks like this patch decided
> to use the name "distinct_tbl" in a test script that runs in
> parallel with another one already using "distinct_tbl".

Exactly.  I shouldn't create tables with the same name in two test
files.  Will push a fix shortly.

Thanks
Richard

Re: Reordering DISTINCT keys to match input path's pathkeys

From

Richard Guo

Date:

26 November 2024, 08:56:53

On Tue, Nov 26, 2024 at 10:22 AM Richard Guo <guofenglinux@gmail.com> wrote:
> On Tue, Nov 26, 2024 at 10:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Richard Guo <guofenglinux@gmail.com> writes:
> > > I've applied some of the changes you suggested in your previous email
> > > and pushed the updated patch.  Thank you for your review.

> > The buildfarm's not too happy.  It looks like this patch decided
> > to use the name "distinct_tbl" in a test script that runs in
> > parallel with another one already using "distinct_tbl".

> Exactly.  I shouldn't create tables with the same name in two test
> files.  Will push a fix shortly.

Fixed.

Thanks
Richard