Thread: Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

From

Scott Carey

Date:

13 April 2011, 17:08:00

Sorry for resurrecting this thread, but this has been in my outbox for months and I think it is important: On Oct 27, 2010, at 12:56 PM, Tom Lane wrote: > Scott Carey writes: > > Why does hashjoin behave poorly when the inner relation is not > > uniformly distributed and the outer is? > Because a poorly distributed inner relation leads to long hash chains. > In the very worst case, all the keys are on the same hash chain and it > degenerates to a nested-loop join. (There is an assumption in the > costing code that the longer hash chains also tend to get searched more > often, which maybe doesn't apply if the outer rel is flat, but it's not > obvious that it's safe to not assume that.) I disagree. Either 1: The estimator is wrong or 2: The hash data structure is flawed. A pathological skew case (all relations with the same key), should be _cheaper_ to probe. There should be only _one_ entry in the hash (for the one key), and that entry will be a list of all relations matching the key. Therefore, hash probes will either instantly fail to match on an empty bucket, fail to match the one key with one compare, or match the one key and join on the matching list. In particular for anti-join, high skew should be the best case scenario. A hash structure that allows multiple entries per key is inappropriate for skewed data, because it is not O(n). One that has one entry per key remains O(n) for all skew. Furthermore, the hash buckets and # of entries is proportional to n_distinct in this case, and smaller and more cache and memory friendly to probe. > Not really. It's still searching a long hash chain; maybe it will find > an exact match early in the chain, or maybe not. It's certainly not > *better* than antijoin with a well-distributed inner rel. There shouldn't be long hash chains. A good hash function + proper bucket count + one entry per key = no long chains. > Although the > point is moot, anyway, since if it's an antijoin there is only one > candidate for which rel to put on the outside. You can put either relation on the outside with an anti-join, but would need a different algorithm and cost estimator if done the other way around. Construct a hash on the join key, that keeps a list of relations per key, iterate over the other relation, and remove the key and corresponding list from the hash when there is a match, when complete the remaining items in the hash are the result of the join (also already grouped by the key). It could be terminated early if all entries are removed. This would be useful if the hash was small, the other side of the hash too large to fit in memory, and alternative was a massive sort on the other relation. Does the hash cost estimator bias towards smaller hashes due to hash probe cost increasing with hash size due to processor caching effects? Its not quite O(n) due to caching effects. > regards, tom lane

Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

From

Scott Carey

Date:

13 April 2011, 17:22:48

New email-client nightmares!  Fixed below.  I think.
-------------

Sorry for resurrecting this thread, but this has been in my outbox for
months and I think it is important:

>On Oct 27, 2010, at 12:56 PM, Tom Lane wrote:
>
>
>> Scott Carey  writes:
>>Why does hashjoin behave poorly when the inner relation is not
>
>>uniformly distributed and the outer is?
>
>
>
> Because a poorly distributed inner relation leads to long hash chains.
> In the very worst case, all the keys are on the same hash chain and it
> degenerates to a nested-loop join.  (There is an assumption in the
> costing code that the longer hash chains also tend to get searched more
> often, which maybe doesn't apply if the outer rel is flat, but it's not
> obvious that it's safe to not assume that.)
>

I disagree.  Either
1:  The estimator is wrong
or
2:  The hash data structure is flawed.

A pathological skew case (all relations with the same key), should be
_cheaper_ to probe.   There should be only _one_ entry in the hash (for
the one key), and that entry will be a list of all relations matching the
key.  Therefore, hash probes will either instantly fail to match on an
empty bucket, fail to match the one key with one compare, or match the one
key and join on the matching list.

In particular for anti-join, high skew should be the best case scenario.

A hash structure that allows multiple entries per key is inappropriate for
skewed data, because it is not O(n).  One that has one entry per key
remains O(n) for all skew.  Furthermore, the hash buckets and # of entries
is proportional to n_distinct in this case, and smaller and more cache and
memory friendly to probe.

>Not really.  It's still searching a long hash chain; maybe it will find
> an exact match early in the chain, or maybe not.  It's certainly not
> *better* than antijoin with a well-distributed inner rel.

There shouldn't be long hash chains.  A good hash function + proper bucket
count + one entry per key = no long chains.

> Although the
> point is moot, anyway, since if it's an antijoin there is only one
> candidate for which rel to put on the outside.

You can put either relation on the outside with an anti-join, but would
need a different algorithm and cost estimator if done the other way
around.  Construct a hash on the join key, that keeps a list of relations
per key, iterate over the other relation, and remove the key and
corresponding list from the hash when there is a match, when complete the
remaining items in the hash are the result of the join (also already
grouped by the key).  It could be terminated early if all entries are
removed.
This would be useful if the hash was small, the other side of the hash too
large to fit in memory, and alternative was a massive sort on the other
relation.

Does the hash cost estimator bias towards smaller hashes due to hash probe
cost increasing with hash size due to processor caching effects?  Its not
quite O(n) due to caching effects.

>
>>             regards, tom lane

Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

From

Tom Lane

Date:

13 April 2011, 17:35:51

Scott Carey <scott@richrelevance.com> writes:
>> On Oct 27, 2010, at 12:56 PM, Tom Lane wrote:
>> Because a poorly distributed inner relation leads to long hash chains.
>> In the very worst case, all the keys are on the same hash chain and it
>> degenerates to a nested-loop join.

> A pathological skew case (all relations with the same key), should be
> _cheaper_ to probe.

I think you're missing the point, which is that all the hash work is
just pure overhead in such a case (and it is most definitely not
zero-cost overhead).  You might as well just do a nestloop join.
Hashing is only beneficial to the extent that it allows a smaller subset
of the inner relation to be compared to each outer-relation tuple.
So I think biasing against skew-distributed inner relations is entirely
appropriate.

            regards, tom lane

Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

From

Scott Carey

Date:

13 April 2011, 18:31:20

On 4/13/11 10:35 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

>Scott Carey <scott@richrelevance.com> writes:
>>> On Oct 27, 2010, at 12:56 PM, Tom Lane wrote:
>>> Because a poorly distributed inner relation leads to long hash chains.
>>> In the very worst case, all the keys are on the same hash chain and it
>>> degenerates to a nested-loop join.
>
>> A pathological skew case (all relations with the same key), should be
>> _cheaper_ to probe.
>
>I think you're missing the point, which is that all the hash work is
>just pure overhead in such a case (and it is most definitely not
>zero-cost overhead).  You might as well just do a nestloop join.
>Hashing is only beneficial to the extent that it allows a smaller subset
>of the inner relation to be compared to each outer-relation tuple.
>So I think biasing against skew-distributed inner relations is entirely
>appropriate.

No it is not pure overhead, and nested loops is far slower.  The only way
it is the same is if there is only _one_ hash bucket!  And that would be a
bug...
In the pathological skew case:

Example:  1,000,000 outer relations.  10,000 inner relations, all with one
key.

Nested loops join:
10 billion compares.

Hashjoin with small inner relation hashed with poor hash data structure:
1. 10,000 hash functions to build the hash (10,000 'puts').
2. 1,000,000 hash functions to probe (1,000,000 'gets').
3. Only keys that fall in the same bucket trigger a compare.  Assume 100
hash buckets (any less is a bug, IMO) and skew such that the bucket is 10x
more likely than average to be hit. 100,000 hit the bucket.  Those that
match are just like nested loops -- this results in 1 billion compares.
All other probes hit an empty bucket and terminate without a compare.
Total: 1.01 million hash functions and bucket seeks, 0.01 of which are
hash 'puts', + 1 billion compares

Hashjoin with 'one entry per key; entry value is list of matching
relations' data structure:
1. 10,000 hash functions to build the hash (10,000 'puts').
2. 1,000,000 hash functions to probe (1,000,000 'gets').
3. Only keys that fall in the same bucket trigger a compare.  Assume 100
hash buckets and enough skew so that the bucket is 10x as likely to be
hit. 100,000 hit bucket.  Those that match only do a compare against one
key -- this results in 100,000 compares.
Total: 1.01 million hash functions and bucket seeks, 0.01 of which are
slightly more expensive hash 'puts', + 0.1 million compares

10 billion compares is much more expensive than either hash scenario.  If
a hash function is 5x the cost of a compare, and a hash 'put' 2x a 'get'
then the costs are about:

10 billion,
1.006 billion,
~6 million

The cost of the actual join output is significant (pairing relations and
creating new relations for output) but close to constant in all three.

In both the 'hash the big one' and 'hash the small one' case you have to
calculate the hash and seek into the hash table the same number of times.
10,000 hash calculations and 'puts' + 1,000,000 hash calculations and
'gets', versus 1,000,000 hash 'puts' and 10,000 'gets'.
But in one case that table is smaller and more cache efficient, making
those gets and puts cheaper.

Which is inner versus outer changes the number of buckets, which can alter
the number of expected compares, but that can be controlled for benefit --
the ratio of keys to buckets can be controlled.  If you choose the smaller
relation, you can afford to overcompensate with more buckets, resulting in
more probes on empty buckets and thus fewer compares.

Additionally, a hash structure that only has one entry per key can greatly
reduce the number of compares and make hashjoin immune to skew from the
cost perspective.  It also makes it so that choosing the smaller relation
over the big one to hash is always better provided the number of buckets
is chosen well.

>
>            regards, tom lane

Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

From

Robert Haas

Date:

12 May 2011, 03:21:25

On Wed, Apr 13, 2011 at 1:22 PM, Scott Carey <scott@richrelevance.com> wrote:
> A pathological skew case (all relations with the same key), should be
> _cheaper_ to probe.   There should be only _one_ entry in the hash (for
> the one key), and that entry will be a list of all relations matching the
> key.  Therefore, hash probes will either instantly fail to match on an
> empty bucket, fail to match the one key with one compare, or match the one
> key and join on the matching list.
>
> In particular for anti-join, high skew should be the best case scenario.

I think this argument may hold some water for an anti-join, and maybe
for a semi-join, but it sure doesn't seem right for any kind of join
that has to iterate over all matches (rather than just the first one);
that is, inner, left, right, or full.

> A hash structure that allows multiple entries per key is inappropriate for
> skewed data, because it is not O(n).  One that has one entry per key
> remains O(n) for all skew.  Furthermore, the hash buckets and # of entries
> is proportional to n_distinct in this case, and smaller and more cache and
> memory friendly to probe.

I don't think this argument is right.  The hash table is sized for a
load factor significantly less than one, so if there are multiple
entries in a bucket, it is fairly likely that they are all for the
same key.  Granted, we have to double-check the keys to figure that
out; but I believe that the data structure you are proposing would
require similar comparisons.  The only difference is that they'd be
required when building the hash table, rather than when probing it.

> You can put either relation on the outside with an anti-join, but would
> need a different algorithm and cost estimator if done the other way
> around.  Construct a hash on the join key, that keeps a list of relations
> per key, iterate over the other relation, and remove the key and
> corresponding list from the hash when there is a match, when complete the
> remaining items in the hash are the result of the join (also already
> grouped by the key).  It could be terminated early if all entries are
> removed.
> This would be useful if the hash was small, the other side of the hash too
> large to fit in memory, and alternative was a massive sort on the other
> relation.

This would be a nice extension of commit
f4e4b3274317d9ce30de7e7e5b04dece7c4e1791.

> Does the hash cost estimator bias towards smaller hashes due to hash probe
> cost increasing with hash size due to processor caching effects?  Its not
> quite O(n) due to caching effects.

I don't think we account for that (and I'm not convinced we need to).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company