Re: Performance improvement for joins where outer side is unique - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Performance improvement for joins where outer side is unique
Date
Msg-id 5671FF0D.2010403@2ndquadrant.com
Whole thread Raw
In response to Re: Performance improvement for joins where outer side is unique  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: Performance improvement for joins where outer side is unique  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Hi,

On 12/16/2015 11:40 PM, David Rowley wrote:
> On 17 December 2015 at 05:02, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     0) I know the patch does not tweak costing - any plans in this
>
>         direction? Would it be possible to simply use the costing used by
>         semijoin?
>
>
> Many thanks for looking at this.
>
> The patch does tweak the costings so that unique joins are costed in the
> same way as semi joins. It's a very small change, so easy to miss.

Thanks. I missed that bit somehow.

>
> For example, see the change in initial_cost_nestloop()
>
> -       if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
> +       if (jointype == JOIN_SEMI ||
> +               jointype == JOIN_ANTI ||
> +               unique_inner)
>          {
>
> Also see the changes in final_cost_nestloop() and final_cost_hashjoin()
>
>
>     1) nodeHashjoin.c (and other join nodes)
>
>     I've noticed we have this in the ExecHashJoin() method:
>
>       /*
>        * When the inner side is unique or we're performing a
>        * semijoin, we'll consider returning the first match, but
>        * after that we're done with this outer tuple.
>        */
>       if (node->js.unique_inner)
>           node->hj_JoinState = HJ_NEED_NEW_OUTER;
>
>     That seems a bit awkward because the comment speaks about unique
>     joins *OR* semijoins, but the check only references unique_inner.
>     That of course works because we do this in ExecInitHashJoin():
>
>              switch (node->join.jointype)
>              {
>                      case JOIN_SEMI:
>                              hjstate->js.unique_inner = true;
>                              /* fall through */
>
>     Either some semijoins may not be unique - in that case the naming is
>     misleading at best, and we either need to find a better name or
>     simply check two conditions like this:
>
>       if (node->js.unique_inner || node->join.type == JOIN_SEMI)
>          node->hj_JoinState = HJ_NEED_NEW_OUTER;
>
>     Or all semijoins are unique joins, and in that case the comment may
>     need rephrasing. But more importantly, it begs the question why
>     we're detecting this in the executor and not in the planner? Because
>     if we detect it in executor, we can't use this bit in costing, for
>     example.
>
>
> The reason not to detect that in the planner is simply that unique_join
> is meant to mean that the relation on the inner side of the join will at
> most contain a single Tuple which matches each outer relation tuple,
> based on the join condition. I've added no detection for this in semi
> joins, and I'd rather not go blindly setting the flag to true in the
> planner as it simply may not be true for the semi join. At the moment
> that might not matter as we're only using the unique_join flag as an
> optimization in the join nodes, but I'd prefer not to do this as its
> likely we'll want to do more with this flag later, and I'd rather we
> keep the meaning well defined. You might argue that this is also true
> for semi joins, but if down the road somewhere we want to perform some
> checks on the inner relation before the join takes place, and in that
> case the Tuples of the relation might not have the same properties we
> claim they do.
>
> But you're right that reusing the flag in the join nodes is not ideal,
> and the comment is not that great either. I'd really rather go down the
> path of either renaming the variable, or explaining this better in the
> comment. It seems unnecessary to check both for each tuple being joined.
> I'd imagine that might add a little overhead to joins which are not unique.

I'd be very surprised it that had any measurable impact.

> How about changing the comment to:
>
> /*
> * In the event that the inner side has been detected to be
> * unique, as an optimization we can skip searching for any
> * subsequent matching inner tuples, as none will exist.
> * For semijoins unique_inner will always be true, although
> * in this case we don't match another inner tuple as this
> * is the required semi join behavior.
> */
>
> Alternatively or additionally we can rename the variable in the executor
> state, although I've not thought of a name yet that I don't find overly
> verbose: unique_inner_or_semi_join,  match_first_tuple_only.

I'd go with match_first_tuple_only.

>
>
>     2) analyzejoins.c
>
>     I see that this code in join_is_removable()
>
>          innerrel = find_base_rel(root, innerrelid);
>
>          if (innerrel->reloptkind != RELOPT_BASEREL)
>              return false;
>
>     was rewritten like this:
>
>          innerrel = find_base_rel(root, innerrelid);
>
>          Assert(innerrel->reloptkind == RELOPT_BASEREL);
>
>     That suggests that previously the function expected cases where
>     reloptkind may not be RELOPT_BASEREL, but now it'll error out int
>     such cases. I haven't noticed any changes in the surrounding code
>     that'd guarantee this won't happen, but I also haven't been able to
>     come up with an example triggering the assert (haven't been trying
>     too hard). How do we know the assert() makes sense?
>
>
> I'd have changed this as this should be covered by the if
> (!sjinfo->is_unique_join || a few lines up. mark_unique_joins() only
> sets is_unique_join to true if specialjoin_is_unique_join() returns true
> and that function tests reloptkind to ensure its set to RELOPT_BASEREL
> and return false if the relation is not. Perhaps what is missing is a
> comment to explain the function should only be used on RELOPT_BASEREL
> type relations.

Yeah, I think it'd be good to document this contract somewhere. Either 
in the comment before the function, or perhaps right above the Assert().

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Remove array_nulls?
Next
From: Haribabu Kommi
Date:
Subject: Re: Multi-tenancy with RLS