Thread: Re: Pathify RHS unique-ification for semijoin planning

Re: Pathify RHS unique-ification for semijoin planning

From

Andy Fan

Date:

22 May, 12:27:48

Richard Guo <guofenglinux@gmail.com> writes:

Hi,

> However, in the case of sort-based implementation,
> this function pays no attention to the subpath's pathkeys or the
> pathkeys of the resulting output.

Good finding!

>
> In addition to this specific issue, it seems to me that there are
> other potential issues in create_unique_path().
>
> * Currently, we only consider the cheapest_total_path of the RHS when
> unique-ifying it.

I think it is better have a check the tuple_fraction for the startup_cost
factor, for some paths where the total cost is high but the required
fraction is lower.

> I think this may cause us to miss some optimization
> opportunities.  For example, there might be a path with a better sort
> order that isn't the cheapest-total one.  Such a path could help avoid
> a sort at a higher level, potentially resulting in a cheaper overall
> plan.

> * In create_unique_path(), we currently rely on heuristics to decide
> whether to use a hash-based or sort-based method.  I think a better
> approach would be to create paths for both methods and let add_path()
> determine which one is better, similar to how we handle path selection
> in other parts of the planner.
>
> Therefore, I'm thinking that maybe we could create a new RelOptInfo
> for the RHS rel to represent its unique-ified version, and then
> generate all worthwhile paths for it,

This sounds great for me. and I think we can keep the fraction
cheapest path on the new RelOptInfo as well, then all the things should
be on the way. 

> To be concrete, I'm envisioning something like the following:
>
>             if (bms_equal(sjinfo->syn_righthand, rel2->relids) &&
> -               create_unique_path(root, rel2, rel2->cheapest_total_path,
> -                                  sjinfo) != NULL)
> +               (rel2_unique = create_unique_rel(root, rel2, sjinfo)) != NULL)
>
> ...
>
> -               add_paths_to_joinrel(root, joinrel, rel1, rel2,
> -                                    JOIN_UNIQUE_INNER, sjinfo,
> +               add_paths_to_joinrel(root, joinrel, rel1, rel2_unique,
> +                                    JOIN_INNER, sjinfo,
>                                      restrictlist);
> -               add_paths_to_joinrel(root, joinrel, rel2, rel1,
> -                                    JOIN_UNIQUE_OUTER, sjinfo,
> +               add_paths_to_joinrel(root, joinrel, rel2_unique, rel1,
> +                                    JOIN_INNER, sjinfo,
>                                      restrictlist);
>
> In addition, by changing the join from "rel1" and "rel2" using
> JOIN_UNIQUE_OUTER or JOIN_UNIQUE_INNER to a join between "rel1" and
> "rel2_unique" using a standard JOIN_INNER, we might be able to get
> rid of the JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER jointypes.

if we can, +10.

-- 
Best Regards
Andy Fan

Re: Pathify RHS unique-ification for semijoin planning

From

wenhui qiu

Date:

22 May, 16:51:32

On Thu, 22 May 2025 at 17:28, Andy Fan <zhihuifan1213@163.com> wrote:

Richard Guo <guofenglinux@gmail.com> writes:

Hi,

> However, in the case of sort-based implementation,
> this function pays no attention to the subpath's pathkeys or the
> pathkeys of the resulting output.

Good finding!

>
> In addition to this specific issue, it seems to me that there are
> other potential issues in create_unique_path().
>
> * Currently, we only consider the cheapest_total_path of the RHS when
> unique-ifying it.

I think it is better have a check the tuple_fraction for the startup_cost
factor, for some paths where the total cost is high but the required
fraction is lower.

> I think this may cause us to miss some optimization
> opportunities. For example, there might be a path with a better sort
> order that isn't the cheapest-total one. Such a path could help avoid
> a sort at a higher level, potentially resulting in a cheaper overall
> plan.

> * In create_unique_path(), we currently rely on heuristics to decide
> whether to use a hash-based or sort-based method. I think a better
> approach would be to create paths for both methods and let add_path()
> determine which one is better, similar to how we handle path selection
> in other parts of the planner.
>
> Therefore, I'm thinking that maybe we could create a new RelOptInfo
> for the RHS rel to represent its unique-ified version, and then
> generate all worthwhile paths for it,

This sounds great for me. and I think we can keep the fraction
cheapest path on the new RelOptInfo as well, then all the things should
be on the way.

> To be concrete, I'm envisioning something like the following:
>
> if (bms_equal(sjinfo->syn_righthand, rel2->relids) &&
> - create_unique_path(root, rel2, rel2->cheapest_total_path,
> - sjinfo) != NULL)
> + (rel2_unique = create_unique_rel(root, rel2, sjinfo)) != NULL)
>
> ...
>
> - add_paths_to_joinrel(root, joinrel, rel1, rel2,
> - JOIN_UNIQUE_INNER, sjinfo,
> + add_paths_to_joinrel(root, joinrel, rel1, rel2_unique,
> + JOIN_INNER, sjinfo,
> restrictlist);
> - add_paths_to_joinrel(root, joinrel, rel2, rel1,
> - JOIN_UNIQUE_OUTER, sjinfo,
> + add_paths_to_joinrel(root, joinrel, rel2_unique, rel1,
> + JOIN_INNER, sjinfo,
> restrictlist);
>
> In addition, by changing the join from "rel1" and "rel2" using
> JOIN_UNIQUE_OUTER or JOIN_UNIQUE_INNER to a join between "rel1" and
> "rel2_unique" using a standard JOIN_INNER, we might be able to get
> rid of the JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER jointypes.

>>if we can, +10.

Agree

Pls kindly release a path for this？