Re: [HACKERS] PoC: full merge join on comparison clause - Mailing list pgsql-hackers

From Alexander Kuzmenkov
Subject Re: [HACKERS] PoC: full merge join on comparison clause
Date
Msg-id 1a4bac9a-640d-31b8-fb6f-cc4bf36abbf9@postgrespro.ru
Whole thread Raw
In response to Re: [HACKERS] PoC: full merge join on comparison clause  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses Re: [HACKERS] PoC: full merge join on comparison clause  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
List pgsql-hackers
I tried to fix the things you mentioned and improve the comments. Among 
other changes, there is now a description of how merge join works with 
inequalities at the top of nodeMergejoin.c. It also explains why we only 
support one inequality clause.

Some particular points:

On 07/06/2018 04:01 PM, Ashutosh Bapat wrote:
> -        StrategyNumber opstrategy = mergestrategies[iClause];
> +        StrategyNumber sort_strategy = mergestrategies[iClause];
> -        int            op_strategy;
> +        int            join_strategy;
> I don't see a reason why should we change the name of variable here. These are
> operator strategies and there's no need to change their names. The name change
> is introducing unnecessary diffs.

These variables have different meaning but their names differ only with 
an underscore. When I had to change this function, I made mistakes 
because of this. I'd keep the descriptive names to avoid further 
confusion. Should this be a separate patch?

> I think we should write a wrapper around MJCompare which interprets the result rather
> than changing MJCompare itself. OR at least change the name of MJCompare.

Renamed the function to MJTestTuples to reflect that it decides whether 
we join tuples or advance either side.


> -         * for get_mergejoin_opfamilies().
> +         * for get_equality_opfamilies().
>
> I think we should leave mergejoin word in there or at least indicate that these
> are btree opfamilies so that we don't confuse it with hash equality operator
> families.

Renamed these to get_btree_equality_opfamilies() and 
get_btree_comparison_opfamilies().



> +        if (parent->mj_Ineq_Present)
> +            elog(ERROR, "inequality mergejoin clause must be the last one");
> +
>
> IIUC, this never happens. If it really happens, we have created a path which
> can not be used practically. That should never happen. It will help to add a
> comment here clarifying that situation.

This is just a cross-check for the planner. Added a comment. We should 
probably use a separate error code for internal errors as opposed to 
user errors, but I'm not sure if we have one, I see just elog(ERROR) 
being used everywhere.


>
> +    bool        have_inequality = have_inequality_mergeclause(mergeclauses);
>
> There will be many paths created with different ordering of pathkeys. So,
> instead of calling have_inequality_mergeclause() for each of those paths, it's
> better to save this status in the path itself when creating the path.

I removed this function altogether, because we can just check the last 
merge clause. When we cost the path, we already have a proper 
mergejoinable list of clauses, so if there is an inequality clause, it's 
the last one.


>           /* Make a pathkey list with this guy first */
>           if (l != list_head(all_pathkeys))
> +        {
> +            if (have_inequality && l == list_tail(all_pathkeys))
> +                /* Inequality merge clause must be the last, we can't
> move it */
> +                break;
> +
>
> I am kind of baffled by this change. IIUC the way we create different orderings
> of pathkeys here, we are just rotating the pathkeys in circular order. This
> means there is exactly one ordering of pathkeys where the pathkey corresponding
> to the inequality clause is the last one.

This code does not rotate the pathkeys circularly, but puts each of them 
in the first position, and keeps the rest in the original order.
Say, if we have three equality pathkeys, and one inequality pathkey at 
the end (let's denote them as E1, E2, E3, IE), the permutations it tries 
will be like this:
E1 E2 E3 IE
E2 E1 E3 IE
E3 E1 E2 IE
Does this sound right?


>       /* Might have no mergeclauses */
>       if (nClauses == 0)
>           return NIL;
>
> +    {
> +        List *ineq_clauses = find_inequality_clauses(mergeclauses);
> +
> +        if (list_length(ineq_clauses) > 1)
> +            return NIL;
>
> Without this patch, when there is an inequality clause with FULL JOIN, we will
> not create a merge join path because select_mergejoin_clauses() will set
> mergejoin_allowed to false. This means that we won't call
> sort_inner_and_outer(). I think this patch also has to do the same i.e. when
> there are more than one inequality clauses, select_mergejoin_clauses() should
> set mergejoin_allowed to false in case of a FULL JOIN since merge join
> machinary won't be able to handle that case.
>
> If we do that, we could arrange extra.mergeclause_list such that the inequality
> clause is always at the end thus finding inequality clause would be easy.

I changed select_mergejoin_clauses() to filter multiple inequality 
clauses and disable join if needed. Now we can use extra inequalities as 
join filter, if it's not full join. I didn't reorder 
extra.mergeclause_list there, because this order is ignored later. 
select_outer_pathkeys_for_merge() chooses the order of pathkeys using 
some heuristics, and then find_mergeclauses_for_outer_pathkeys() 
reorders the clauses accordingly.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX
Next
From: Andrey Borodin
Date:
Subject: Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX