Re: Pathify RHS unique-ification for semijoin planning - Mailing list pgsql-hackers

From Richard Guo
Subject Re: Pathify RHS unique-ification for semijoin planning
Date
Msg-id CAMbWs4-yHUZypwzLaX1Y4k1x+nvQcvfZGYFrm5eFAO7NaSuSbA@mail.gmail.com
Whole thread Raw
In response to Re: Pathify RHS unique-ification for semijoin planning  (Alexandra Wang <alexandra.wang.oss@gmail.com>)
Responses Re: Pathify RHS unique-ification for semijoin planning
Re: Pathify RHS unique-ification for semijoin planning
List pgsql-hackers
On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
<alexandra.wang.oss@gmail.com> wrote:
> Thanks for the patch! I applied your patch and played around with it.

Thanks for trying out this patch.

> I have a question about the following test case you added in
> subselect.sql:

> I was under the impression that we wanted Unique on top of a sorted
> node for the inner of the SIMI join. However, the expected output uses
> a HashAgg instead. Is this expected?

Hmm, I don't think we have such expectation that "Sort+Unique" should
be used for the unique-ification step in this query.  This patch
considers both hash-based and sort-based implementations, and lets the
cost comparison determine which one is preferable.  So I wouldn't be
surprised if the planner ends up choosing the hash-based
implementation in the final plan.

> While looking at the code, I also had a question about the following
> changes in costsize.c:
>
> --- a/src/backend/optimizer/path/costsize.c
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
>   * The whole issue is moot if we are working from a unique-ified outer
>   * input, or if we know we don't need to mark/restore at all.
>   */
> - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
> + if (IsA(outer_path, UniquePath) ||
> + IsA(outer_path, AggPath) ||
> + path->skip_mark_restore)
>
> and
>
> @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
>   * because we avoid contaminating the cache with a value that's wrong for
>   * non-unique-ified paths.
>   */
> - if (IsA(inner_path, UniquePath))
> + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
>
> I'm curious why AggPath was added in these two cases.

Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
some special cases when the inner or outer relation has been
unique-ified.  Previously, it was sufficient to check whether the path
was a UniquePath, since both hash-based and sort-based implementations
were represented that way.  However, with this patch, UniquePath now
only represents the sort-based implementation, so we also need to
check for AggPath to account for the hash-based case.

> The second diff looks fine to me. However, the first diff in the
> subselect test happened to be the test that I asked about.
>
> Here's the comparison of the two EXPLAIN ANALYZE results:

> While both runs are fast (28ms vs. 17ms), the plan generated by the v5
> patch is slower in this case.

This is a very interesting observation.  In fact, with the original v5
patch, you can produce both plans by setting enable_hashagg on and
off.

set enable_hashagg to on;
 Incremental Sort  (cost=91.95..277.59 rows=2500 width=16)
                   (actual time=31.960..147.040 rows=90000.00 loops=1)

set enable_hashagg to off;
 Merge Join  (cost=70.14..294.34 rows=2500 width=16)
             (actual time=4.303..71.891 rows=90000.00 loops=1)

This seems to be another case where the planner chooses a suboptimal
plan due to inaccurate cost estimates.

Thanks
Richard



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: track generic and custom plans in pg_stat_statements
Next
From: Peter Smith
Date:
Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2