Re: a few crazy ideas about hash joins - Mailing list pgsql-hackers

From Robert Haas
Subject Re: a few crazy ideas about hash joins
Date
Msg-id 603c8f070904041939j2b28f322mfbcfdfa6294f6902@mail.gmail.com
Whole thread Raw
In response to Re: a few crazy ideas about hash joins  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: a few crazy ideas about hash joins  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On Fri, Apr 3, 2009 at 5:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Apr 3, 2009 at 4:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Correct, but you've got the details all wrong.  The real problem is that
>>> the planner might discard a join path hash(A,B) at level 2 because it
>>> loses compared to, say, merge(A,B).  But when we get to level three,
>>> perhaps hash(hash(A,B),C) would've been the best plan due to synergy
>>> of the two hashes.  We'll never find that out unless we keep the
>>> "inferior" hash path around.  We can certainly do that; the question
>>> is what's it going to cost us to allow more paths to survive to the
>>> next join level.  (And I'm afraid the answer may be "plenty"; it's a
>>> combinatorial explosion we're looking at here.)
>
>> That would be crazy.  I think doing it the way I suggested is correct,
>> just not guaranteed to catch every case.  The reality is that even if
>> we took Greg Stark's suggestion of detecting this situation only in
>> the executor, we'd still get some benefit out of this.  If we take my
>> intermediate approach, we'll catch more cases where this is a win.
>> What you're suggesting here would catch every conceivable case, but at
>> the expense of what I'm sure would be an unacceptable increase in
>> planning time for very limit benefit.
>
> Maybe, maybe not.  I've seen plenty of plans that have several
> mergejoins stacked up on top of each other with no intervening sorts.
> There is 0 chance that the planner would have produced that if it
> thought that it had to re-sort at each level; something else would have
> looked cheaper.  I think that your proposals will end up getting very
> little of the possible benefit, because the planner will fail to choose
> plan trees in which the optimization can be exploited.

Well, I'm all ears if you have suggestions for improvement.  For
sorts, we use PathKeys to represent the ordering of each path and keep
the paths for each set of pathkeys.  By analogy, we could maintain a
list of PathHash structures for each path representing the tables that
had already been hashed.  add_path() would then have to consider both
the PathHash structures and the PathKey structures before concluding
that a path was definitely worse than some path previously found.  At
each level of the join tree, we'd need to truncate PathHash structures
that provably have no further use (e.g. on a base table that does not
appear again above the level of the join already planned) to avoid
keeping around paths that appeared to be better only because we didn't
know that the paths they have hashed are worthless in practice.  Maybe
that wouldn't even be that expensive, actually, because there will be
lots of cases where you know the relevant table doesn't appear
elsewhere in the query and not save any extra paths.  But I think we'd
have to write the code and benchmark it to really know.

I guess the reason I'm not too worked up about this is because my
experience is that the planner nearly always prefers hash joins on
small tables, even when an index is present - the queries I'm worried
about optimizing don't need any additional encouragement to use hash
joins; they're doing it already.  But certainly it doesn't hurt to see
how many cases we can pick up.

...Robert


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT
Next
From: Tom Lane
Date:
Subject: Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT