Home > mailing lists

Re: Pathify RHS unique-ification for semijoin planning - Mailing list pgsql-hackers

From	Alexandra Wang
Subject	Re: Pathify RHS unique-ification for semijoin planning
Date	August 1 17:58:01
Msg-id	CAK98qZ3qGjCEZAyeOCwQFY-Y4CMkPjN83xkgij1yx76R1=vBvw@mail.gmail.com Whole thread Raw
In response to	Re: Pathify RHS unique-ification for semijoin planning (Richard Guo <guofenglinux@gmail.com>)
List	pgsql-hackers

Tree view

Hi Richard,

On Wed, Jul 30, 2025 at 9:08 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
<alexandra.wang.oss@gmail.com> wrote:
> Thanks for the patch! I applied your patch and played around with it.

Thanks for trying out this patch.

> I have a question about the following test case you added in
> subselect.sql:

> I was under the impression that we wanted Unique on top of a sorted
> node for the inner of the SIMI join. However, the expected output uses
> a HashAgg instead. Is this expected?

Hmm, I don't think we have such expectation that "Sort+Unique" should
be used for the unique-ification step in this query. This patch
considers both hash-based and sort-based implementations, and lets the
cost comparison determine which one is preferable. So I wouldn't be
surprised if the planner ends up choosing the hash-based
implementation in the final plan.

> While looking at the code, I also had a question about the following
> changes in costsize.c:
>
> --- a/src/backend/optimizer/path/costsize.c
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
> * The whole issue is moot if we are working from a unique-ified outer
> * input, or if we know we don't need to mark/restore at all.
> */
> - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
> + if (IsA(outer_path, UniquePath) ||
> + IsA(outer_path, AggPath) ||
> + path->skip_mark_restore)
>
> and
>
> @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
> * because we avoid contaminating the cache with a value that's wrong for
> * non-unique-ified paths.
> */
> - if (IsA(inner_path, UniquePath))
> + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
>
> I'm curious why AggPath was added in these two cases.

Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
some special cases when the inner or outer relation has been
unique-ified. Previously, it was sufficient to check whether the path
was a UniquePath, since both hash-based and sort-based implementations
were represented that way. However, with this patch, UniquePath now
only represents the sort-based implementation, so we also need to
check for AggPath to account for the hash-based case.

> The second diff looks fine to me. However, the first diff in the
> subselect test happened to be the test that I asked about.
>
> Here's the comparison of the two EXPLAIN ANALYZE results:

> While both runs are fast (28ms vs. 17ms), the plan generated by the v5
> patch is slower in this case.

This is a very interesting observation. In fact, with the original v5
patch, you can produce both plans by setting enable_hashagg on and
off.

set enable_hashagg to on;
Incremental Sort (cost=91.95..277.59 rows=2500 width=16)
(actual time=31.960..147.040 rows=90000.00 loops=1)

set enable_hashagg to off;
Merge Join (cost=70.14..294.34 rows=2500 width=16)
(actual time=4.303..71.891 rows=90000.00 loops=1)

This seems to be another case where the planner chooses a suboptimal
plan due to inaccurate cost estimates.

Thanks for explaining! It makes sense now!

While reviewing the code, the following diff concerns me:

/*
* If the joinrel is parallel-safe, we may be able to consider a partial
- * merge join. However, we can't handle JOIN_UNIQUE_OUTER, because the
- * outer path will be partial, and therefore we won't be able to properly
- * guarantee uniqueness. Similarly, we can't handle JOIN_FULL, JOIN_RIGHT
- * and JOIN_RIGHT_ANTI, because they can produce false null extended rows.
+ * merge join. However, we can't handle JOIN_FULL, JOIN_RIGHT and
+ * JOIN_RIGHT_ANTI, because they can produce false null extended rows.
* Also, the resulting path must not be parameterized.
*/
if (joinrel->consider_parallel &&
- save_jointype != JOIN_UNIQUE_OUTER &&
- save_jointype != JOIN_FULL &&
- save_jointype != JOIN_RIGHT &&
- save_jointype != JOIN_RIGHT_ANTI &&
+ jointype != JOIN_FULL &&
+ jointype != JOIN_RIGHT &&
+ jointype != JOIN_RIGHT_ANTI &&
outerrel->partial_pathlist != NIL &&
bms_is_empty(joinrel->lateral_relids))

What has changed that enabled JOIN_UNIQUE_OUTER to handle partial
paths? Or is it no longer possible for the outer path to be partial?

Best,

Alex

pgsql-hackers by date:

From: Tender Wang
Date: 01 August, 17:45:20
Subject: Re: A little cosmetic to convert_VALUES_to_ANY()

From: "Zhijie Hou (Fujitsu)"
Date: 01 August, 18:03:26
Subject: RE: Add support for specifying tables in pg_createsubscriber.

Re: Pathify RHS unique-ification for semijoin planning - Mailing list pgsql-hackers

Previous

Next