HashJoin w/option to unique-ify inner rel - Mailing list pgsql-hackers

From Robert Haas
Subject HashJoin w/option to unique-ify inner rel
Date
Msg-id 603c8f070904112100m5367a83dn2db2c7985bb464ff@mail.gmail.com
Whole thread Raw
Responses Re: HashJoin w/option to unique-ify inner rel
List pgsql-hackers
On Thu, Feb 19, 2009 at 5:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@enterprisedb.com> writes:
>> It's tempting to have Hash cheat and just peek at the node beneath it
>> to see if it's a HashAggregate, in which case it could call a special
>> method to request the whole hash. But it would have to know that it's
>> just a plain uniquify and not implementing a GROUP BY.
>
> More to the point, it would have to check if it's unique-ifying on the
> same columns and with the same operators as the upper hash is using.
> If we were going to do something like this, making it a real option to
> the Hash node and teaching the planner about that would be *much*
> easier, and would also allow saner cost estimation.

I've been thinking about this some more and would like someone else to
check my work.

The goal here is to handle a semi-join implemented as a hash join by
unique-ifying the inner rel as we are building the hash table, rather
than as a separate step.  This is correct when (1) the inner rel is
exactly the RHS of the join [if it's more or less, then the strategy
of unique-ifying the inner path and then joining is not even legal; it
will potentially produce different results] and (2) the hash join can
be performed by hashing all of the join clauses of the inner rel [if
not, then the criteria for the unique-ify step don't match the
criteria for the join step, so the join is legal but the two
operations can't be done together].

We don't need to implement any new logic to test for (1).  The current
code calls add_paths_to_joinrel() with jointype JOIN_UNIQUE_INNER only
when we are contemplating a semi-join against exactly the RHS.  So
hash_inner_and_outer() can disregard the possibility of using this
optimization when it sees any other join type.

To verify (2), hash_inner_and_outer must check that (2a) min_lefthand
is a subset of the outer rel and (2b) hashing the inner path would be
a valid method of unique-ifying it.  If (2a) does not hold, then we're
performing only part of the special join at this level, so the set of
columns on which we're hashing for the join is smaller than the set of
columns on which we need the inner rel to be unique, and the two steps
can't be merged.  If (2b) does not hold, then one or more of the join
columns are mergeable but not hashable - so it's legal to
sort-and-unique-ify the inner rel, hash the results, and then join to
the outer rel, handling the non-hashable join clauses as qpquals, but
there's no way for the hash join itself to do the unique-ification.
But if (2a) and (2b) do both hold, then all of the join clauses from
the SpecialJoinInfo will be present in the list of joinrel's join
clause list (since we have all of the LHS relations) and all of them
are hashable (since we know all the ones create_unique_path extracted
are hashable, and it should be extracting the same ones that
hash_inner_and_outer is pulling).

Does anyone see a hole in this logic?

...Robert


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [pgtranslation-translators] on gettext plural support
Next
From: Magnus Hagander
Date:
Subject: Re: Path separator