Re: hash_create API changes (was Re: speedup tidbitmap patch: hash BlockNumber) - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: hash_create API changes (was Re: speedup tidbitmap patch: hash BlockNumber)
Date
Msg-id 549A5CDB.50802@BlueTreble.com
Whole thread Raw
In response to Re: hash_create API changes (was Re: speedup tidbitmap patch: hash BlockNumber)  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Responses Re: hash_create API changes (was Re: speedup tidbitmap patch: hash BlockNumber)
Re: hash_create API changes (was Re: speedup tidbitmap patch: hash BlockNumber)
List pgsql-hackers
On 12/20/14, 2:13 PM, Jim Nasby wrote:
> On 12/20/14, 11:51 AM, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>>> On 2014-12-19 22:03:55 -0600, Jim Nasby wrote:
>>>> What I am thinking is not using all of those fields in their raw form to calculate the hash value. IE: something
analogousto: 
>>>> hash_any(SharedBufHash, (rot(forkNum, 2) | dbNode) ^ relNode) << 32 | blockNum)
>>>>
>>>> perhaps that actual code wouldn't work, but I don't see why we couldn't do something similar... am I missing
something?
>>
>>> I don't think that'd improve anything. Jenkin's hash does have a quite
>>> mixing properties, I don't believe that the above would improve the
>>> quality of the hash.
>>
>> I think what Jim is suggesting is to intentionally degrade the quality of
>> the hash in order to let it be calculated a tad faster.  We could do that
>> but I doubt it would be a win, especially in systems with lots of buffers.
>> IIRC, when we put in Jenkins hashing to replace the older homebrew hash
>> function, it improved performance even though the hash itself was slower.
>
> Right. Now that you mention it, I vaguely recall the discussions about changing the hash function to reduce
collisions.
>
> I'll still take a look at fash-hash, but it's looking like there may not be anything we can do here unless we change
howwe identify relation files (combining dbid, tablespace id, fork number and file id, at least for searching). If we
had64bit hash support then maybe that'd be a significant win, since you wouldn't need to hash at all. But that
certainlydoesn't seem to be low-hanging fruit to me... 

I've taken a look at a number of different hash algorithms, testing them with initially with SMHasher [1] and then with
pgbench.

It's worth noting that a lot of work has been done in the area of hash algo's since we last updated the hash_any
algorithmin 2009 [2]. It's probably worth revisiting this every other release or so. 

Most of my SMHasher results are at [3]. I suspect SpookyHash is close to what we currently have, so that's what I used
fora baseline. I determined that fash-hash (called superfast in results) was faster than Spooky, but not as fast as
CityHash64[4]or xxhash[5]. CityHash and xxhash had similar performance, but xxhash seems to have slightly better
characteristicsaccording to SMHasher, and more important, it's in C, not C++. However, CityHash has been replaced by
farmhash[7],which might be faster than xxhash. 

I did a quick hack at using xxhash ONLY for shared buffer lookup [6]. I've attached the full patch, as well as a
versionthat omits the new files. Note that currently xxhash is setup in such a way that we'd get different results
dependingon endian-ness, so we couldn't just drop this in as-is across the board. Of course, there's also the question
ofif we'd want to force a REINDEX on hash indexes. 

pgbench results are below. Select-only testing was done first, then read-write. There were several select-only runs on
bothto warm shared_buffers (set to 512MB for this test, and fsync is off). Database initialized with bin/pgbench -i -F
100-s 10. 

pgbench -S -T10 -c 4 -j 4
master:
tps = 9556.356145 (excluding connections establishing)
tps = 9897.324917 (excluding connections establishing)
tps = 9287.286907 (excluding connections establishing)
tps = 10210.130833 (excluding connections establishing)

XXH32:
tps = 32462.754437 (excluding connections establishing)
tps = 33232.144511 (excluding connections establishing)
tps = 33082.436760 (excluding connections establishing)
tps = 33597.904532 (excluding connections establishing)

pgbench -T10 -c 4 -j 4
master:
tps = 2299.314145 (excluding connections establishing)
tps = 2029.956749 (excluding connections establishing)
tps = 2067.462362 (excluding connections establishing)

XXH32:
tps = 2653.919321 (excluding connections establishing)
tps = 2615.764370 (excluding connections establishing)
tps = 2952.144461 (excluding connections establishing)

Questions:
Do we want to do what we've previously done and cut/paste/modify this code into our repo? Given how rapidly hash
algorithmsseem to be changing I'm inclined to minimize the amount of effort required to pull a new one in... 

Can someone with a big-endian CPU see what the impact of XXH_FORCE_NATIVE_FORMAT 1 vs 0? If there's a notable
differencewe might want to do different things for on-disk vs in-memory hashes. 

For that matter, assuming we adopt this, do we want to replace the index hash functions too? SMHasher shows that
CityHashis ~50% faster than XXHash for 262144 byte keys. Even SpookyHash is about 17% faster on that key size. So if we
wantto do something with hash indexes, we'd probably be better off with City or Farm hash than XXHash. 

[1] https://code.google.com/p/smhasher/
[2] https://github.com/postgres/postgres/commit/8205258fa675115439017b626c4932d5fefe2ea8
[3] https://github.com/decibel/hash_testing/tree/master/results
[4] https://code.google.com/p/cityhash/
[5] https://code.google.com/p/xxhash/
[6] https://github.com/decibel/postgres/commit/b82e7a3b936b3950b296514f6ee0542132f11e4a
[7] https://code.google.com/p/farmhash/
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: install libpq.dll in bin directory on Windows / Cygwin
Next
From: Jim Nasby
Date:
Subject: Re: hash_create API changes (was Re: speedup tidbitmap patch: hash BlockNumber)