HashJoin w/option to unique-ify inner rel - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | HashJoin w/option to unique-ify inner rel |
Date | |
Msg-id | 603c8f070904112100m5367a83dn2db2c7985bb464ff@mail.gmail.com Whole thread Raw |
Responses |
Re: HashJoin w/option to unique-ify inner rel
|
List | pgsql-hackers |
On Thu, Feb 19, 2009 at 5:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@enterprisedb.com> writes: >> It's tempting to have Hash cheat and just peek at the node beneath it >> to see if it's a HashAggregate, in which case it could call a special >> method to request the whole hash. But it would have to know that it's >> just a plain uniquify and not implementing a GROUP BY. > > More to the point, it would have to check if it's unique-ifying on the > same columns and with the same operators as the upper hash is using. > If we were going to do something like this, making it a real option to > the Hash node and teaching the planner about that would be *much* > easier, and would also allow saner cost estimation. I've been thinking about this some more and would like someone else to check my work. The goal here is to handle a semi-join implemented as a hash join by unique-ifying the inner rel as we are building the hash table, rather than as a separate step. This is correct when (1) the inner rel is exactly the RHS of the join [if it's more or less, then the strategy of unique-ifying the inner path and then joining is not even legal; it will potentially produce different results] and (2) the hash join can be performed by hashing all of the join clauses of the inner rel [if not, then the criteria for the unique-ify step don't match the criteria for the join step, so the join is legal but the two operations can't be done together]. We don't need to implement any new logic to test for (1). The current code calls add_paths_to_joinrel() with jointype JOIN_UNIQUE_INNER only when we are contemplating a semi-join against exactly the RHS. So hash_inner_and_outer() can disregard the possibility of using this optimization when it sees any other join type. To verify (2), hash_inner_and_outer must check that (2a) min_lefthand is a subset of the outer rel and (2b) hashing the inner path would be a valid method of unique-ifying it. If (2a) does not hold, then we're performing only part of the special join at this level, so the set of columns on which we're hashing for the join is smaller than the set of columns on which we need the inner rel to be unique, and the two steps can't be merged. If (2b) does not hold, then one or more of the join columns are mergeable but not hashable - so it's legal to sort-and-unique-ify the inner rel, hash the results, and then join to the outer rel, handling the non-hashable join clauses as qpquals, but there's no way for the hash join itself to do the unique-ification. But if (2a) and (2b) do both hold, then all of the join clauses from the SpecialJoinInfo will be present in the list of joinrel's join clause list (since we have all of the LHS relations) and all of them are hashable (since we know all the ones create_unique_path extracted are hashable, and it should be extracting the same ones that hash_inner_and_outer is pulling). Does anyone see a hole in this logic? ...Robert
pgsql-hackers by date: