Re: Pathify RHS unique-ification for semijoin planning - Mailing list pgsql-hackers
From | Alexandra Wang |
---|---|
Subject | Re: Pathify RHS unique-ification for semijoin planning |
Date | |
Msg-id | CAK98qZ3qGjCEZAyeOCwQFY-Y4CMkPjN83xkgij1yx76R1=vBvw@mail.gmail.com Whole thread Raw |
In response to | Re: Pathify RHS unique-ification for semijoin planning (Richard Guo <guofenglinux@gmail.com>) |
List | pgsql-hackers |
Hi Richard,
On Wed, Jul 30, 2025 at 9:08 PM Richard Guo <guofenglinux@gmail.com> wrote:
On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
<alexandra.wang.oss@gmail.com> wrote:
> Thanks for the patch! I applied your patch and played around with it.
Thanks for trying out this patch.
> I have a question about the following test case you added in
> subselect.sql:
> I was under the impression that we wanted Unique on top of a sorted
> node for the inner of the SIMI join. However, the expected output uses
> a HashAgg instead. Is this expected?
Hmm, I don't think we have such expectation that "Sort+Unique" should
be used for the unique-ification step in this query. This patch
considers both hash-based and sort-based implementations, and lets the
cost comparison determine which one is preferable. So I wouldn't be
surprised if the planner ends up choosing the hash-based
implementation in the final plan.
> While looking at the code, I also had a question about the following
> changes in costsize.c:
>
> --- a/src/backend/optimizer/path/costsize.c
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
> * The whole issue is moot if we are working from a unique-ified outer
> * input, or if we know we don't need to mark/restore at all.
> */
> - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
> + if (IsA(outer_path, UniquePath) ||
> + IsA(outer_path, AggPath) ||
> + path->skip_mark_restore)
>
> and
>
> @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
> * because we avoid contaminating the cache with a value that's wrong for
> * non-unique-ified paths.
> */
> - if (IsA(inner_path, UniquePath))
> + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
>
> I'm curious why AggPath was added in these two cases.
Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
some special cases when the inner or outer relation has been
unique-ified. Previously, it was sufficient to check whether the path
was a UniquePath, since both hash-based and sort-based implementations
were represented that way. However, with this patch, UniquePath now
only represents the sort-based implementation, so we also need to
check for AggPath to account for the hash-based case.
> The second diff looks fine to me. However, the first diff in the
> subselect test happened to be the test that I asked about.
>
> Here's the comparison of the two EXPLAIN ANALYZE results:
> While both runs are fast (28ms vs. 17ms), the plan generated by the v5
> patch is slower in this case.
This is a very interesting observation. In fact, with the original v5
patch, you can produce both plans by setting enable_hashagg on and
off.
set enable_hashagg to on;
Incremental Sort (cost=91.95..277.59 rows=2500 width=16)
(actual time=31.960..147.040 rows=90000.00 loops=1)
set enable_hashagg to off;
Merge Join (cost=70.14..294.34 rows=2500 width=16)
(actual time=4.303..71.891 rows=90000.00 loops=1)
This seems to be another case where the planner chooses a suboptimal
plan due to inaccurate cost estimates.
Thanks for explaining! It makes sense now!
While reviewing the code, the following diff concerns me:
/*
* If the joinrel is parallel-safe, we may be able to consider a partial
- * merge join. However, we can't handle JOIN_UNIQUE_OUTER, because the
- * outer path will be partial, and therefore we won't be able to properly
- * guarantee uniqueness. Similarly, we can't handle JOIN_FULL, JOIN_RIGHT
- * and JOIN_RIGHT_ANTI, because they can produce false null extended rows.
+ * merge join. However, we can't handle JOIN_FULL, JOIN_RIGHT and
+ * JOIN_RIGHT_ANTI, because they can produce false null extended rows.
* Also, the resulting path must not be parameterized.
*/
if (joinrel->consider_parallel &&
- save_jointype != JOIN_UNIQUE_OUTER &&
- save_jointype != JOIN_FULL &&
- save_jointype != JOIN_RIGHT &&
- save_jointype != JOIN_RIGHT_ANTI &&
+ jointype != JOIN_FULL &&
+ jointype != JOIN_RIGHT &&
+ jointype != JOIN_RIGHT_ANTI &&
outerrel->partial_pathlist != NIL &&
bms_is_empty(joinrel->lateral_relids))
* If the joinrel is parallel-safe, we may be able to consider a partial
- * merge join. However, we can't handle JOIN_UNIQUE_OUTER, because the
- * outer path will be partial, and therefore we won't be able to properly
- * guarantee uniqueness. Similarly, we can't handle JOIN_FULL, JOIN_RIGHT
- * and JOIN_RIGHT_ANTI, because they can produce false null extended rows.
+ * merge join. However, we can't handle JOIN_FULL, JOIN_RIGHT and
+ * JOIN_RIGHT_ANTI, because they can produce false null extended rows.
* Also, the resulting path must not be parameterized.
*/
if (joinrel->consider_parallel &&
- save_jointype != JOIN_UNIQUE_OUTER &&
- save_jointype != JOIN_FULL &&
- save_jointype != JOIN_RIGHT &&
- save_jointype != JOIN_RIGHT_ANTI &&
+ jointype != JOIN_FULL &&
+ jointype != JOIN_RIGHT &&
+ jointype != JOIN_RIGHT_ANTI &&
outerrel->partial_pathlist != NIL &&
bms_is_empty(joinrel->lateral_relids))
What has changed that enabled JOIN_UNIQUE_OUTER to handle partial
paths? Or is it no longer possible for the outer path to be partial?
paths? Or is it no longer possible for the outer path to be partial?
Best,
Alex
pgsql-hackers by date: