Thread: Reordering DISTINCT keys to match input path's pathkeys
Similar to what we did to GROUP BY keys in 0452b461bc, I think we can do
the same to DISTINCT keys, i.e. reordering DISTINCT keys to match input
path's pathkeys, which can help reduce cost by avoiding unnecessary
re-sort, or allowing us to use incremental-sort to save efforts. For
instance,
create table t (a int, b int);
create index on t (a, b);
explain (costs off) select distinct b, a from t limit 10;
QUERY PLAN
--------------------------------------------------
Limit
-> Unique
-> Index Only Scan using t_a_b_idx on t
(3 rows)
Please note that the parser has ensured that the DISTINCT pathkeys
matches the order of ORDER BY clauses. So there is no need to do this
part again.
In principle, we can perform such reordering for DISTINCT ON too, but we
need to make sure that the resulting pathkeys matches initial ORDER BY
keys, which seems not trivial. So it doesn't seem worth the effort.
Attached is a patch for this optimization. Any thoughts?
Thanks
Richard
the same to DISTINCT keys, i.e. reordering DISTINCT keys to match input
path's pathkeys, which can help reduce cost by avoiding unnecessary
re-sort, or allowing us to use incremental-sort to save efforts. For
instance,
create table t (a int, b int);
create index on t (a, b);
explain (costs off) select distinct b, a from t limit 10;
QUERY PLAN
--------------------------------------------------
Limit
-> Unique
-> Index Only Scan using t_a_b_idx on t
(3 rows)
Please note that the parser has ensured that the DISTINCT pathkeys
matches the order of ORDER BY clauses. So there is no need to do this
part again.
In principle, we can perform such reordering for DISTINCT ON too, but we
need to make sure that the resulting pathkeys matches initial ORDER BY
keys, which seems not trivial. So it doesn't seem worth the effort.
Attached is a patch for this optimization. Any thoughts?
Thanks
Richard
Attachment
On Tue, 2024-01-23 at 13:55 +0800, Richard Guo wrote: > Similar to what we did to GROUP BY keys in 0452b461bc, I think we can do > the same to DISTINCT keys, i.e. reordering DISTINCT keys to match input > path's pathkeys, which can help reduce cost by avoiding unnecessary > re-sort, or allowing us to use incremental-sort to save efforts. > > Attached is a patch for this optimization. Any thoughts? I didn't scrutinize the code, but that sounds like a fine optimization. Yours, Laurenz Albe
On Tue, 23 Jan 2024 at 18:56, Richard Guo <guofenglinux@gmail.com> wrote: > > Similar to what we did to GROUP BY keys in 0452b461bc, I think we can do > the same to DISTINCT keys, i.e. reordering DISTINCT keys to match input > path's pathkeys, which can help reduce cost by avoiding unnecessary > re-sort, or allowing us to use incremental-sort to save efforts. For > instance, I've not caught up on the specifics of 0452b461b, but I just wanted to highlight that there was some work done in [1] in this area. It seems Ankit didn't ever add that to a CF, so that might explain why it's been lost. Anyway, just pointing it out as there may be useful code or discussion in the corresponding threads. David [1] https://postgr.es/m/da9425ae-8ff7-33d9-23b3-2a3eb605e106%40gmail.com
On Tue, Jan 23, 2024 at 5:03 PM David Rowley <dgrowleyml@gmail.com> wrote:
I've not caught up on the specifics of 0452b461b, but I just wanted to
highlight that there was some work done in [1] in this area. It seems
Ankit didn't ever add that to a CF, so that might explain why it's
been lost.
Anyway, just pointing it out as there may be useful code or discussion
in the corresponding threads.
Thanks for pointing it out. I looked at the patch there and noticed
several problems with it.
* That patch is incomplete and does not work as expected. It at least
needs to modify truncate_useless_pathkeys() to account for DISTINCT
clause (I think this has been mentioned in that thread).
* That patch would not consider the origin DISTINCT pathkeys if it could
do some reordering, which is not great and can generate inefficient
plans. For instance (after fixing the first problem)
create table t (a int, b int);
create index on t(a);
set enable_hashagg to off;
set enable_incremental_sort to off;
set enable_seqscan to off;
explain (costs off) select distinct b, a from t order by b, a;
QUERY PLAN
-------------------------------------------------
Sort
Sort Key: b, a
-> Unique
-> Sort
Sort Key: a, b
-> Index Scan using t_a_idx on t
(6 rows)
Using DISTINCT pathkeys {b, a} is more efficient for this plan, because
only one Sort would be required. But that patch is not able to do that,
because it does not consider the origin DISTINCT pathkeys after
reordering.
Thanks
Richard
several problems with it.
* That patch is incomplete and does not work as expected. It at least
needs to modify truncate_useless_pathkeys() to account for DISTINCT
clause (I think this has been mentioned in that thread).
* That patch would not consider the origin DISTINCT pathkeys if it could
do some reordering, which is not great and can generate inefficient
plans. For instance (after fixing the first problem)
create table t (a int, b int);
create index on t(a);
set enable_hashagg to off;
set enable_incremental_sort to off;
set enable_seqscan to off;
explain (costs off) select distinct b, a from t order by b, a;
QUERY PLAN
-------------------------------------------------
Sort
Sort Key: b, a
-> Unique
-> Sort
Sort Key: a, b
-> Index Scan using t_a_idx on t
(6 rows)
Using DISTINCT pathkeys {b, a} is more efficient for this plan, because
only one Sort would be required. But that patch is not able to do that,
because it does not consider the origin DISTINCT pathkeys after
reordering.
Thanks
Richard
cfbot reminds that this patch does not apply any more. So I've rebased
it on master, and also adjusted the test cases a bit.
Thanks
Richard
it on master, and also adjusted the test cases a bit.
Thanks
Richard
Attachment
On Mon, Feb 5, 2024 at 11:18 AM Richard Guo <guofenglinux@gmail.com> wrote: > cfbot reminds that this patch does not apply any more. So I've rebased > it on master, and also adjusted the test cases a bit. This patch does not apply any more, so here is a new rebase, with some tweaks to the comments. Thanks Richard
Attachment
On 6/7/24 16:46, Richard Guo wrote: > On Mon, Feb 5, 2024 at 11:18 AM Richard Guo <guofenglinux@gmail.com> wrote: >> cfbot reminds that this patch does not apply any more. So I've rebased >> it on master, and also adjusted the test cases a bit. > > This patch does not apply any more, so here is a new rebase, with some > tweaks to the comments. This patch needs a minor rebase again. After skimming the code, I want to say that it looks good. But maybe to avoid one more *_reordering GUC - it would be better to cover all path key reorderings under a single GUC. -- regards, Andrei Lepikhov
On 11/13/24 13:49, Richard Guo wrote: > On Mon, Oct 28, 2024 at 6:15 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> On 6/7/24 16:46, Richard Guo wrote: >>> This patch does not apply any more, so here is a new rebase, with some >>> tweaks to the comments. > >> This patch needs a minor rebase again. >> After skimming the code, I want to say that it looks good. But maybe to >> avoid one more *_reordering GUC - it would be better to cover all path >> key reorderings under a single GUC. > > Thanks for reviewing this patch. After some consideration, I think > it's not too complex to also apply this optimization to DISTINCT ON. > The parser already ensures that the DISTINCT ON expressions match the > initial ORDER BY expressions; we just need to ensure that the > resulting pathkey list from the reordering matches the original > distinctClause pathkeys, while leaving the remaining pathkeys > unchanged in order. Please see attached. Thanks, I'll discover it later. BTW have you ever thought about one more, cost-based reordering strategy? For now, we can reorder GROUP-BY and distinct clauses according to two orderings: 1) ORDER-BY order and 2) order derived from the underlying query tree. In thread [1], I try to add one more strategy that minimises the number of comparison operator calls. It seems that it would work the same way with the DISTINCT statement. Do you think it make sense in general and can be a possible direction of improvement for the current patch? > > I'm not sure about merging these two 'reordering' GUCs into one. > While they may look similar, they apply to very different scenarios. > However, I'm open to other suggestions. Sure, they enable different optimisations. But, they enable highly specialised optimisations. Having two GUCs looks too expensive. Moreover, this stuff is cost-based and should work automatically. So, I treat these GUCs as mostly debugging or last-chance stuff used to disable features during severe slowdowns or bugs. It might make sense to group them into a single 'Clause Reordering' parameter. [1] https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com -- regards, Andrei Lepikhov
On Wed, Nov 13, 2024 at 7:36 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > On 11/13/24 13:49, Richard Guo wrote: > > Thanks for reviewing this patch. After some consideration, I think > > it's not too complex to also apply this optimization to DISTINCT ON. > > The parser already ensures that the DISTINCT ON expressions match the > > initial ORDER BY expressions; we just need to ensure that the > > resulting pathkey list from the reordering matches the original > > distinctClause pathkeys, while leaving the remaining pathkeys > > unchanged in order. Please see attached. > BTW have you ever thought about one more, cost-based reordering strategy? > For now, we can reorder GROUP-BY and distinct clauses according to two > orderings: 1) ORDER-BY order and 2) order derived from the underlying > query tree. > In thread [1], I try to add one more strategy that minimises the number > of comparison operator calls. It seems that it would work the same way > with the DISTINCT statement. Do you think it make sense in general and > can be a possible direction of improvement for the current patch? I haven’t had a chance to follow that thread. From a quick look at that patch, it seems to improve the general costing logic for sorting. If that’s the case, I think it would be beneficial in the areas where we use cost_sort(), including in this patch. > > I'm not sure about merging these two 'reordering' GUCs into one. > > While they may look similar, they apply to very different scenarios. > > However, I'm open to other suggestions. > Sure, they enable different optimisations. But, they enable highly > specialised optimisations. Having two GUCs looks too expensive. > Moreover, this stuff is cost-based and should work automatically. So, I > treat these GUCs as mostly debugging or last-chance stuff used to > disable features during severe slowdowns or bugs. It might make sense to > group them into a single 'Clause Reordering' parameter. I don't want to argue too much about this, but the two types of optimizations are quite different in terms of implementation details. For example, with DISTINCT ON, one key requirement is ensuring that the resulting pathkey list matches the original distinctClause pathkeys, and this logic doesn’t exist in the GROUP BY reordering code. If these GUCs are mainly for debugging, I think it's better to keep them separate so that we can debug each optimization individually. Thanks Richard
On 11/13/24 13:49, Richard Guo wrote: > On Mon, Oct 28, 2024 at 6:15 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> On 6/7/24 16:46, Richard Guo wrote: > I'm not sure about merging these two 'reordering' GUCs into one. > While they may look similar, they apply to very different scenarios. > However, I'm open to other suggestions. I finally have passed through the code. It generally looks OK. Let me only write a few words on tests. I wonder if it would be possible to print only three rows instead of 10 to prove the DISTINCT's correctness. Also, to stabilise tests with parallel workers, I would recommend setting max_parallel_workers_per_gather into 1 and enabling debug_parallel_query. -- regards, Andrei Lepikhov
On Fri, Nov 22, 2024 at 5:40 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > I wonder if it would be possible to print only three rows instead of 10 > to prove the DISTINCT's correctness. There are ten distinct values in the 'distinct_tbl' test table, so I think it’d better to print all of them to verify correctness. > Also, to stabilise tests with parallel workers, I would recommend > setting max_parallel_workers_per_gather into 1 and enabling > debug_parallel_query. Since we're testing sort-based DISTINCT, the results will always be well-ordered, so they should be pretty stable as is. I've applied some of the changes you suggested in your previous email and pushed the updated patch. Thank you for your review. Thanks Richard
On Tue, Nov 26, 2024 at 9:43 AM Richard Guo <guofenglinux@gmail.com> wrote: > I've applied some of the changes you suggested in your previous email > and pushed the updated patch. Thank you for your review. Ah, the buildfarm is complaining. I shouldn't create tables with the same name in two test files. Fixing ... Thanks Richard
Richard Guo <guofenglinux@gmail.com> writes: > I've applied some of the changes you suggested in your previous email > and pushed the updated patch. Thank you for your review. The buildfarm's not too happy. It looks like this patch decided to use the name "distinct_tbl" in a test script that runs in parallel with another one already using "distinct_tbl". regards, tom lane
On Tue, Nov 26, 2024 at 10:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Richard Guo <guofenglinux@gmail.com> writes: > > I've applied some of the changes you suggested in your previous email > > and pushed the updated patch. Thank you for your review. > > The buildfarm's not too happy. It looks like this patch decided > to use the name "distinct_tbl" in a test script that runs in > parallel with another one already using "distinct_tbl". Exactly. I shouldn't create tables with the same name in two test files. Will push a fix shortly. Thanks Richard
On Tue, Nov 26, 2024 at 10:22 AM Richard Guo <guofenglinux@gmail.com> wrote: > On Tue, Nov 26, 2024 at 10:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Richard Guo <guofenglinux@gmail.com> writes: > > > I've applied some of the changes you suggested in your previous email > > > and pushed the updated patch. Thank you for your review. > > The buildfarm's not too happy. It looks like this patch decided > > to use the name "distinct_tbl" in a test script that runs in > > parallel with another one already using "distinct_tbl". > Exactly. I shouldn't create tables with the same name in two test > files. Will push a fix shortly. Fixed. Thanks Richard