Re: a few crazy ideas about hash joins - Mailing list pgsql-hackers

From Robert Haas
Subject Re: a few crazy ideas about hash joins
Date
Msg-id 603c8f070904031315v61fcff1cp3c2442692d492908@mail.gmail.com
Whole thread Raw
In response to Re: a few crazy ideas about hash joins  (Greg Stark <stark@enterprisedb.com>)
Responses Re: a few crazy ideas about hash joins  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Fri, Apr 3, 2009 at 12:55 PM, Greg Stark <stark@enterprisedb.com> wrote:
>>> 1. When the hash is not expected to spill to disk, it preserves the
>>> pathkeys of the outer side of the join.  If the optimizer were allowed
>>> to assume that, it could produce significantly more efficient query
>>> plans in some cases.
>>
>> This is definitely possible, but you will have to dynamically modify the
>> execution path if the hash join ends up to be more than one batch.
>
> Yeah, this item seems to be another case where having a "conditional"
> branch in the plan would be valuable. What's interesting is that it's
> branching based on whether the hash has spilled into batches rather
> than whether the number of rows is greater or less than some breakeven
> point. It looks like these branche nodes are going to need to know
> more than just the generic plan parameters. They're going to need to
> know about specifics like "sort has spilled to disk" or "hash has
> spilled into batches" etc.

In this particular case, the operation that has to be performed is
more specific than "sort", it's "merge this set of sorted tapes".  So
it's more subtle than just inserting a sort node.

> I like the idea of coalescing hash tables. I'm not sure the order in
> which the planner decides on things is conducive to being able to make
> good decisions about it though. Even if we did it afterwards without
> adjusting the planner it might still be worthwhile though.

I don't see why hash_inner_and_outer can't walk the outer path looking
for suitable hashes to reuse.  I think the question is how aggressive
we want to be in performing that search.  If we limit the search to
hashes on base tables without restriction conditions, we'll probably
catch most of the interesting cases, but obviously not all of them.
If we look for hashes on baserels with restriction conditions or
hashes on joinrels, etc., we might pick up a few more cases, but at
the expense of increased planning time.

The advantage of searching in the executor, I suppose, is that the
checks don't have to be as cheap, since you're only checking the plan
that has already won, rather than lots and lots of potential plans.
On the other hand, your costing will be less accurate, which could
lead to bad decisions elsewhere.

> Incidentally a similar optimization is possible for materialize or
> even sorts. They may not come up nearly so often since you would
> normally want to go around adding indexes if you're repeatedly sorting
> on the same columns. Biut it might not be any harder to get them all
> in one swoop.

At least in my experience, sort and materialize nodes are pretty rare,
so I'm not sure it would be worth the time it would take to search for
them.  In most cases it seems to be cheaper to hash the inner and
outer paths than to sort even one of them and then merge-join the
result.  When I do get these sorts of paths, they tend to be in
larger, more complex queries where there's less chance of useful
reuse.  But my experience might not be representative...

...Robert


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: reloptions with a "namespace"
Next
From: Tom Lane
Date:
Subject: Re: a few crazy ideas about hash joins