Thread: new hashing function
Hi, I've implemented Bob Jenkin's hash function for PostgreSQL; more information on the hash function can be found at http://burtleburtle.net/bob/hash/doobs.html I'm posting this to -hackers because I'd like to get some feedback on whether this actually improves performance. I've tested 2 situations locally: (1) pgbench, btree indexes (to test typical performance) (2) pgbench, hash indexes I haven't looked at the implementation of hash joins; if they happen to use this hash function as well, that would be another informative situation to benchmark. In my local tests, the performance in (1) is basically the same, while the performance in (2) is increased by 4% to 8%. If you could let me know the results on your local setup, it should become clear whether this patch is a performance win overall. Note that to test case (2) properly, you'll need to drop and re-create your pgbench indexes after you apply the patch (so that when the new indexes are created, they use the new hash function). Also, it would be a good idea to use concurrency level 1; at higher concurrency levels, hash indexes have a tendancy to deadlock (yes, I'm currently trying to fix that too ;-) ). Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Attachment
Neil Conway <nconway@klamath.dyndns.org> writes: > I haven't looked at the implementation of hash joins; if they happen to > use this hash function as well, that would be another informative > situation to benchmark. Hash joins use some chosen-at-random hashing code of their own; see hashFunc() in src/backend/executor/nodeHash.c. One of the things on my to-do list has been to replace that with the datatype-specific hash functions used for indexing/caching, since the latter seem better engineered (even before your improvements). BTW, I don't particularly approve of the parts of this patch that simply remove unused arguments from various routines. You aren't going to save very many cycles that way, and you are reducing flexibility (eg, the changes to remove nkeys would interfere with trying to make hash indexes support multiple columns). regards, tom lane
Neil Conway <nconway@klamath.dyndns.org> writes: > I've implemented Bob Jenkin's hash function for PostgreSQL; more > information on the hash function can be found at > http://burtleburtle.net/bob/hash/doobs.html One other thought --- presently, catcache.c is careful to use a prime size (257) for its hash tables, so that reducing the raw hash value mod 257 will allow all bits of the hash to contribute to determining the hash bucket number. This is necessary because of the relatively poor randomness of the hash functions. Perhaps with Jenkins' function we could dispense with that, and use a fixed power-of-2 size so that the division becomes a simple bit masking. On machines with slow integer division, this could be a nice speedup. Wouldn't help for hash indexes or joins though, since they don't use constant hashtable sizes. regards, tom lane
On Sun, 2002-03-03 at 12:31, Tom Lane wrote: > Neil Conway <nconway@klamath.dyndns.org> writes: > > I haven't looked at the implementation of hash joins; if they happen to > > use this hash function as well, that would be another informative > > situation to benchmark. > > Hash joins use some chosen-at-random hashing code of their own; see > hashFunc() in src/backend/executor/nodeHash.c. One of the things on my > to-do list has been to replace that with the datatype-specific hash > functions used for indexing/caching, since the latter seem better > engineered (even before your improvements). Okay, I'll implement this. > BTW, I don't particularly approve of the parts of this patch that > simply remove unused arguments from various routines. You aren't > going to save very many cycles that way, and you are reducing > flexibility (eg, the changes to remove nkeys would interfere with > trying to make hash indexes support multiple columns). Hmmm... I had viewed removing extra, unused functions to arguments as basically code cleanup. But I see your point -- although really, the future purpose behind the extra arguments should probably be documented... I'll review my changes and remove the ones that seem to have some future benefit. Thanks for the feedback Tom. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC