Re: [HACKERS] PoC: full merge join on comparison clause - Mailing list pgsql-hackers
From | Alexander Kuzmenkov |
---|---|
Subject | Re: [HACKERS] PoC: full merge join on comparison clause |
Date | |
Msg-id | 1a4bac9a-640d-31b8-fb6f-cc4bf36abbf9@postgrespro.ru Whole thread Raw |
In response to | Re: [HACKERS] PoC: full merge join on comparison clause (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>) |
Responses |
Re: [HACKERS] PoC: full merge join on comparison clause
|
List | pgsql-hackers |
I tried to fix the things you mentioned and improve the comments. Among other changes, there is now a description of how merge join works with inequalities at the top of nodeMergejoin.c. It also explains why we only support one inequality clause. Some particular points: On 07/06/2018 04:01 PM, Ashutosh Bapat wrote: > - StrategyNumber opstrategy = mergestrategies[iClause]; > + StrategyNumber sort_strategy = mergestrategies[iClause]; > - int op_strategy; > + int join_strategy; > I don't see a reason why should we change the name of variable here. These are > operator strategies and there's no need to change their names. The name change > is introducing unnecessary diffs. These variables have different meaning but their names differ only with an underscore. When I had to change this function, I made mistakes because of this. I'd keep the descriptive names to avoid further confusion. Should this be a separate patch? > I think we should write a wrapper around MJCompare which interprets the result rather > than changing MJCompare itself. OR at least change the name of MJCompare. Renamed the function to MJTestTuples to reflect that it decides whether we join tuples or advance either side. > - * for get_mergejoin_opfamilies(). > + * for get_equality_opfamilies(). > > I think we should leave mergejoin word in there or at least indicate that these > are btree opfamilies so that we don't confuse it with hash equality operator > families. Renamed these to get_btree_equality_opfamilies() and get_btree_comparison_opfamilies(). > + if (parent->mj_Ineq_Present) > + elog(ERROR, "inequality mergejoin clause must be the last one"); > + > > IIUC, this never happens. If it really happens, we have created a path which > can not be used practically. That should never happen. It will help to add a > comment here clarifying that situation. This is just a cross-check for the planner. Added a comment. We should probably use a separate error code for internal errors as opposed to user errors, but I'm not sure if we have one, I see just elog(ERROR) being used everywhere. > > + bool have_inequality = have_inequality_mergeclause(mergeclauses); > > There will be many paths created with different ordering of pathkeys. So, > instead of calling have_inequality_mergeclause() for each of those paths, it's > better to save this status in the path itself when creating the path. I removed this function altogether, because we can just check the last merge clause. When we cost the path, we already have a proper mergejoinable list of clauses, so if there is an inequality clause, it's the last one. > /* Make a pathkey list with this guy first */ > if (l != list_head(all_pathkeys)) > + { > + if (have_inequality && l == list_tail(all_pathkeys)) > + /* Inequality merge clause must be the last, we can't > move it */ > + break; > + > > I am kind of baffled by this change. IIUC the way we create different orderings > of pathkeys here, we are just rotating the pathkeys in circular order. This > means there is exactly one ordering of pathkeys where the pathkey corresponding > to the inequality clause is the last one. This code does not rotate the pathkeys circularly, but puts each of them in the first position, and keeps the rest in the original order. Say, if we have three equality pathkeys, and one inequality pathkey at the end (let's denote them as E1, E2, E3, IE), the permutations it tries will be like this: E1 E2 E3 IE E2 E1 E3 IE E3 E1 E2 IE Does this sound right? > /* Might have no mergeclauses */ > if (nClauses == 0) > return NIL; > > + { > + List *ineq_clauses = find_inequality_clauses(mergeclauses); > + > + if (list_length(ineq_clauses) > 1) > + return NIL; > > Without this patch, when there is an inequality clause with FULL JOIN, we will > not create a merge join path because select_mergejoin_clauses() will set > mergejoin_allowed to false. This means that we won't call > sort_inner_and_outer(). I think this patch also has to do the same i.e. when > there are more than one inequality clauses, select_mergejoin_clauses() should > set mergejoin_allowed to false in case of a FULL JOIN since merge join > machinary won't be able to handle that case. > > If we do that, we could arrange extra.mergeclause_list such that the inequality > clause is always at the end thus finding inequality clause would be easy. I changed select_mergejoin_clauses() to filter multiple inequality clauses and disable join if needed. Now we can use extra inequalities as join filter, if it's not full join. I didn't reorder extra.mergeclause_list there, because this order is ignored later. select_outer_pathkeys_for_merge() chooses the order of pathkeys using some heuristics, and then find_mergeclauses_for_outer_pathkeys() reorders the clauses accordingly. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: