Thread: new hashing function

new hashing function

From

Neil Conway

Date:

03 March 2002, 01:03:37

Hi,

I've implemented Bob Jenkin's hash function for PostgreSQL; more
information on the hash function can be found at
http://burtleburtle.net/bob/hash/doobs.html

I'm posting this to -hackers because I'd like to get some feedback on
whether this actually improves performance. I've tested 2 situations
locally:

    (1) pgbench, btree indexes (to test typical performance)

    (2) pgbench, hash indexes

I haven't looked at the implementation of hash joins; if they happen to
use this hash function as well, that would be another informative
situation to benchmark.

In my local tests, the performance in (1) is basically the same, while
the performance in (2) is increased by 4% to 8%. If you could let me
know the results on your local setup, it should become clear whether
this patch is a performance win overall.

Note that to test case (2) properly, you'll need to drop and re-create
your pgbench indexes after you apply the patch (so that when the new
indexes are created, they use the new hash function). Also, it would be
a good idea to use concurrency level 1; at higher concurrency levels,
hash indexes have a tendancy to deadlock (yes, I'm currently trying to
fix that too ;-) ).

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Attachment

jenkins_hash-3.patch

Re: new hashing function

From

Tom Lane

Date:

03 March 2002, 12:32:20

Neil Conway <nconway@klamath.dyndns.org> writes:
> I haven't looked at the implementation of hash joins; if they happen to
> use this hash function as well, that would be another informative
> situation to benchmark.

Hash joins use some chosen-at-random hashing code of their own; see
hashFunc() in src/backend/executor/nodeHash.c.  One of the things on my
to-do list has been to replace that with the datatype-specific hash
functions used for indexing/caching, since the latter seem better
engineered (even before your improvements).

BTW, I don't particularly approve of the parts of this patch that
simply remove unused arguments from various routines.  You aren't
going to save very many cycles that way, and you are reducing
flexibility (eg, the changes to remove nkeys would interfere with
trying to make hash indexes support multiple columns).
        regards, tom lane

Re: new hashing function

From

Tom Lane

Date:

03 March 2002, 13:48:56

Neil Conway <nconway@klamath.dyndns.org> writes:
> I've implemented Bob Jenkin's hash function for PostgreSQL; more
> information on the hash function can be found at
> http://burtleburtle.net/bob/hash/doobs.html

One other thought --- presently, catcache.c is careful to use a prime
size (257) for its hash tables, so that reducing the raw hash value
mod 257 will allow all bits of the hash to contribute to determining
the hash bucket number.  This is necessary because of the relatively
poor randomness of the hash functions.  Perhaps with Jenkins' function
we could dispense with that, and use a fixed power-of-2 size so that the
division becomes a simple bit masking.  On machines with slow integer
division, this could be a nice speedup.  Wouldn't help for hash indexes
or joins though, since they don't use constant hashtable sizes.
        regards, tom lane

Re: new hashing function

From

Neil Conway

Date:

03 March 2002, 14:17:59

On Sun, 2002-03-03 at 12:31, Tom Lane wrote:
> Neil Conway <nconway@klamath.dyndns.org> writes:
> > I haven't looked at the implementation of hash joins; if they happen to
> > use this hash function as well, that would be another informative
> > situation to benchmark.
> 
> Hash joins use some chosen-at-random hashing code of their own; see
> hashFunc() in src/backend/executor/nodeHash.c.  One of the things on my
> to-do list has been to replace that with the datatype-specific hash
> functions used for indexing/caching, since the latter seem better
> engineered (even before your improvements).

Okay, I'll implement this.

> BTW, I don't particularly approve of the parts of this patch that
> simply remove unused arguments from various routines.  You aren't
> going to save very many cycles that way, and you are reducing
> flexibility (eg, the changes to remove nkeys would interfere with
> trying to make hash indexes support multiple columns).

Hmmm... I had viewed removing extra, unused functions to arguments as
basically code cleanup. But I see your point -- although really, the
future purpose behind the extra arguments should probably be
documented... I'll review my changes and remove the ones that seem to
have some future benefit.

Thanks for the feedback Tom.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC