Re: Pathify RHS unique-ification for semijoin planning - Mailing list pgsql-hackers

From Alexandra Wang
Subject Re: Pathify RHS unique-ification for semijoin planning
Date
Msg-id CAK98qZ3qGjCEZAyeOCwQFY-Y4CMkPjN83xkgij1yx76R1=vBvw@mail.gmail.com
Whole thread Raw
In response to Re: Pathify RHS unique-ification for semijoin planning  (Richard Guo <guofenglinux@gmail.com>)
List pgsql-hackers
Hi Richard,

On Wed, Jul 30, 2025 at 9:08 PM Richard Guo <guofenglinux@gmail.com> wrote:
On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
<alexandra.wang.oss@gmail.com> wrote:
> Thanks for the patch! I applied your patch and played around with it.

Thanks for trying out this patch.

> I have a question about the following test case you added in
> subselect.sql:

> I was under the impression that we wanted Unique on top of a sorted
> node for the inner of the SIMI join. However, the expected output uses
> a HashAgg instead. Is this expected?

Hmm, I don't think we have such expectation that "Sort+Unique" should
be used for the unique-ification step in this query.  This patch
considers both hash-based and sort-based implementations, and lets the
cost comparison determine which one is preferable.  So I wouldn't be
surprised if the planner ends up choosing the hash-based
implementation in the final plan.

> While looking at the code, I also had a question about the following
> changes in costsize.c:
>
> --- a/src/backend/optimizer/path/costsize.c
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
>   * The whole issue is moot if we are working from a unique-ified outer
>   * input, or if we know we don't need to mark/restore at all.
>   */
> - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
> + if (IsA(outer_path, UniquePath) ||
> + IsA(outer_path, AggPath) ||
> + path->skip_mark_restore)
>
> and
>
> @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
>   * because we avoid contaminating the cache with a value that's wrong for
>   * non-unique-ified paths.
>   */
> - if (IsA(inner_path, UniquePath))
> + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
>
> I'm curious why AggPath was added in these two cases.

Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
some special cases when the inner or outer relation has been
unique-ified.  Previously, it was sufficient to check whether the path
was a UniquePath, since both hash-based and sort-based implementations
were represented that way.  However, with this patch, UniquePath now
only represents the sort-based implementation, so we also need to
check for AggPath to account for the hash-based case.

> The second diff looks fine to me. However, the first diff in the
> subselect test happened to be the test that I asked about.
>
> Here's the comparison of the two EXPLAIN ANALYZE results:

> While both runs are fast (28ms vs. 17ms), the plan generated by the v5
> patch is slower in this case.

This is a very interesting observation.  In fact, with the original v5
patch, you can produce both plans by setting enable_hashagg on and
off.

set enable_hashagg to on;
 Incremental Sort  (cost=91.95..277.59 rows=2500 width=16)
                   (actual time=31.960..147.040 rows=90000.00 loops=1)

set enable_hashagg to off;
 Merge Join  (cost=70.14..294.34 rows=2500 width=16)
             (actual time=4.303..71.891 rows=90000.00 loops=1)

This seems to be another case where the planner chooses a suboptimal
plan due to inaccurate cost estimates.
 
Thanks for explaining! It makes sense now! 

While reviewing the code, the following diff concerns me:

  /*
  * If the joinrel is parallel-safe, we may be able to consider a partial
- * merge join.  However, we can't handle JOIN_UNIQUE_OUTER, because the
- * outer path will be partial, and therefore we won't be able to properly
- * guarantee uniqueness.  Similarly, we can't handle JOIN_FULL, JOIN_RIGHT
- * and JOIN_RIGHT_ANTI, because they can produce false null extended rows.
+ * merge join.  However, we can't handle JOIN_FULL, JOIN_RIGHT and
+ * JOIN_RIGHT_ANTI, because they can produce false null extended rows.
  * Also, the resulting path must not be parameterized.
  */
  if (joinrel->consider_parallel &&
- save_jointype != JOIN_UNIQUE_OUTER &&
- save_jointype != JOIN_FULL &&
- save_jointype != JOIN_RIGHT &&
- save_jointype != JOIN_RIGHT_ANTI &&
+ jointype != JOIN_FULL &&
+ jointype != JOIN_RIGHT &&
+ jointype != JOIN_RIGHT_ANTI &&
  outerrel->partial_pathlist != NIL &&
  bms_is_empty(joinrel->lateral_relids))

What has changed that enabled JOIN_UNIQUE_OUTER to handle partial
paths? Or is it no longer possible for the outer path to be partial?

Best,
Alex

pgsql-hackers by date:

Previous
From: Tender Wang
Date:
Subject: Re: A little cosmetic to convert_VALUES_to_ANY()
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Add support for specifying tables in pg_createsubscriber.