Re: Pathify RHS unique-ification for semijoin planning - Mailing list pgsql-hackers

From wenhui qiu
Subject Re: Pathify RHS unique-ification for semijoin planning
Date
Msg-id CAGjGUALMmJySHX=PdbNooCZ0u-VO6aiDHgodS0YPxMQ1xdn-aw@mail.gmail.com
Whole thread Raw
In response to Re: Pathify RHS unique-ification for semijoin planning  (Richard Guo <guofenglinux@gmail.com>)
Responses Re: Pathify RHS unique-ification for semijoin planning
List pgsql-hackers
HI 

> This is a very interesting observation.  In fact, with the original v5
>  patch, you can produce both plans by setting enable_hashagg on and
> off.

>  set enable_hashagg to on;
>  Incremental Sort  (cost=91.95..277.59 rows=2500 width=16)
>                    (actual time=31.960..147.040 rows=90000.00 loops=1)
> 
> set enable_hashagg to off;
 > Merge Join  (cost=70.14..294.34 rows=2500 width=16)
 >             (actual time=4.303..71.891 rows=90000.00 loops=1)
> 
> This seems to be another case where the planner chooses a suboptimal
>  plan due to inaccurate cost estimates.
Agree ,I increased some rows , set enable_hashagg to on and off ,There's no difference in execution time. The execution plan is the same.

> #define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \
 >   ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI && \
>     bms_equal((sjinfo)->syn_righthand, (rel)->relids))
>
>... and then the check in final_cost_hashjoin() becomes:
>
 >   if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo,
>                           path->jpath.jointype))
>    {
>        innerbucketsize = 1.0 / virtualbuckets;
 >       innermcvfreq = 0.0;
>    }
>
>Would this be a better approach?  Any thoughts?

This approach does indeed more accurately capture the fact that the relation has been unique-ified, especially in cases where a semi join has been transformed into an inner join. Compared to the current heuristic checks in costsize.c that rely on inner_path->rows, this method is more semantically meaningful and robust.

On Thu, Jul 31, 2025 at 4:58 PM Richard Guo <guofenglinux@gmail.com> wrote:
On Thu, Jul 31, 2025 at 1:08 PM Richard Guo <guofenglinux@gmail.com> wrote:
> On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
> <alexandra.wang.oss@gmail.com> wrote:
> > While looking at the code, I also had a question about the following
> > changes in costsize.c:
> >
> > --- a/src/backend/optimizer/path/costsize.c
> > +++ b/src/backend/optimizer/path/costsize.c
> > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
> >   * The whole issue is moot if we are working from a unique-ified outer
> >   * input, or if we know we don't need to mark/restore at all.
> >   */
> > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
> > + if (IsA(outer_path, UniquePath) ||
> > + IsA(outer_path, AggPath) ||
> > + path->skip_mark_restore)
> >
> > and
> >
> > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
> >   * because we avoid contaminating the cache with a value that's wrong for
> >   * non-unique-ified paths.
> >   */
> > - if (IsA(inner_path, UniquePath))
> > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
> >
> > I'm curious why AggPath was added in these two cases.

> Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
> some special cases when the inner or outer relation has been
> unique-ified.  Previously, it was sufficient to check whether the path
> was a UniquePath, since both hash-based and sort-based implementations
> were represented that way.  However, with this patch, UniquePath now
> only represents the sort-based implementation, so we also need to
> check for AggPath to account for the hash-based case.

BTW, maybe a better way to determine whether a relation has been
unique-ified is to check that the nominal jointype is JOIN_INNER and
sjinfo->jointype is JOIN_SEMI, and the relation is exactly the RHS of
the semijoin.  This approach is mentioned in a comment in joinpath.c:

 * Path cost estimation code may need to recognize that it's
 * dealing with such a case --- the combination of nominal jointype INNER
 * with sjinfo->jointype == JOIN_SEMI indicates that.

... but it seems we don't currently apply it in costsize.c.

To be concrete, I'm imagining a check like the following:

#define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \
    ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI && \
     bms_equal((sjinfo)->syn_righthand, (rel)->relids))

... and then the check in final_cost_hashjoin() becomes:

    if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo,
                           path->jpath.jointype))
    {
        innerbucketsize = 1.0 / virtualbuckets;
        innermcvfreq = 0.0;
    }

Would this be a better approach?  Any thoughts?

Thanks
Richard

pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: POC: enable logical decoding when wal_level = 'replica' without a server restart
Next
From: Peter Eisentraut
Date:
Subject: Convert varatt.h macros to static inline functions