Re: Pathify RHS unique-ification for semijoin planning - Mailing list pgsql-hackers
From | Richard Guo |
---|---|
Subject | Re: Pathify RHS unique-ification for semijoin planning |
Date | |
Msg-id | CAMbWs4-yHUZypwzLaX1Y4k1x+nvQcvfZGYFrm5eFAO7NaSuSbA@mail.gmail.com Whole thread Raw |
In response to | Re: Pathify RHS unique-ification for semijoin planning (Alexandra Wang <alexandra.wang.oss@gmail.com>) |
Responses |
Re: Pathify RHS unique-ification for semijoin planning
Re: Pathify RHS unique-ification for semijoin planning |
List | pgsql-hackers |
On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang <alexandra.wang.oss@gmail.com> wrote: > Thanks for the patch! I applied your patch and played around with it. Thanks for trying out this patch. > I have a question about the following test case you added in > subselect.sql: > I was under the impression that we wanted Unique on top of a sorted > node for the inner of the SIMI join. However, the expected output uses > a HashAgg instead. Is this expected? Hmm, I don't think we have such expectation that "Sort+Unique" should be used for the unique-ification step in this query. This patch considers both hash-based and sort-based implementations, and lets the cost comparison determine which one is preferable. So I wouldn't be surprised if the planner ends up choosing the hash-based implementation in the final plan. > While looking at the code, I also had a question about the following > changes in costsize.c: > > --- a/src/backend/optimizer/path/costsize.c > +++ b/src/backend/optimizer/path/costsize.c > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path, > * The whole issue is moot if we are working from a unique-ified outer > * input, or if we know we don't need to mark/restore at all. > */ > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore) > + if (IsA(outer_path, UniquePath) || > + IsA(outer_path, AggPath) || > + path->skip_mark_restore) > > and > > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path, > * because we avoid contaminating the cache with a value that's wrong for > * non-unique-ified paths. > */ > - if (IsA(inner_path, UniquePath)) > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath)) > > I'm curious why AggPath was added in these two cases. Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have some special cases when the inner or outer relation has been unique-ified. Previously, it was sufficient to check whether the path was a UniquePath, since both hash-based and sort-based implementations were represented that way. However, with this patch, UniquePath now only represents the sort-based implementation, so we also need to check for AggPath to account for the hash-based case. > The second diff looks fine to me. However, the first diff in the > subselect test happened to be the test that I asked about. > > Here's the comparison of the two EXPLAIN ANALYZE results: > While both runs are fast (28ms vs. 17ms), the plan generated by the v5 > patch is slower in this case. This is a very interesting observation. In fact, with the original v5 patch, you can produce both plans by setting enable_hashagg on and off. set enable_hashagg to on; Incremental Sort (cost=91.95..277.59 rows=2500 width=16) (actual time=31.960..147.040 rows=90000.00 loops=1) set enable_hashagg to off; Merge Join (cost=70.14..294.34 rows=2500 width=16) (actual time=4.303..71.891 rows=90000.00 loops=1) This seems to be another case where the planner chooses a suboptimal plan due to inaccurate cost estimates. Thanks Richard
pgsql-hackers by date: