Re: Do we want a hashset type? - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Do we want a hashset type? |
Date | |
Msg-id | d6bf1a75-e86b-6563-1478-6345d804cdda@enterprisedb.com Whole thread Raw |
In response to | Re: Do we want a hashset type? ("Joel Jacobson" <joel@compiler.org>) |
Responses |
Re: Do we want a hashset type?
|
List | pgsql-hackers |
On 6/12/23 19:34, Joel Jacobson wrote: > On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote: >>>> 1) better hash table implementation > > I noticed src/include/common/hashfn.h provides implementation > of the Jenkins/lookup3 hash function, and thought maybe > we could simply use it in hashset? > > However, I noticed that according to SMHasher [1], > Jenkins/lookup3 has some quality problems: > "UB, 28% bias, collisions, 30% distr, BIC" > > Not sure if that's true or maybe not a problem in the PostgreSQL implementation? > > According to SHHasher, the two fastest 32/64-bit hash functions > for non-cryptographic purposes without any quality problems > that are also portable seems to be these two: > > wyhash v4.1 (64-bit) [2] > MiB/sec: 22513.04 > cycl./hash: 29.01 > size: 474 > > xxh3low (xxHash v3, 64-bit, low 32-bits part) [3] > MiB/sec: 20722.94 > cycl./hash: 30.26 > size: 756 > > [1] https://github.com/rurban/smhasher > [2] https://github.com/wangyi-fudan/wyhash > [3] https://github.com/Cyan4973/xxHash > But those are numbers for large keys - if you restrict the input to 4B-16B (which is what we planned to do by focusing on int, bigint and uuid), there's no significant difference: lookup3: Small key speed test - 4-byte keys - 30.17 cycles/hash Small key speed test - 8-byte keys - 31.00 cycles/hash Small key speed test - 16-byte keys - 49.00 cycles/hash xxh3low: Small key speed test - 4-byte keys - 29.00 cycles/hash Small key speed test - 8-byte keys - 29.58 cycles/hash Small key speed test - 16-byte keys - 37.00 cycles/hash But you can try doing some measurements, of course. Or just do profiling to see how much time we spend in the hash function - I'd bet it's pretty tiny fraction of the total time. As for the "quality" issues - it's the same algorithm in Postgres, so it has the same issues. I don't if that has measurable impact, though. I'd guess it does not, particularly for "reasonably small" sets. >>>> 5) more efficient storage format, with versioning etc. >> I think the main question is whether to serialize the hash table as is, >> or compact it in some way. The current code just uses the same thing for >> both cases - on-disk format and in-memory representation (aggstate). >> That's simple, but it also means the on-disk value is likely not well >> compressible (because it's ~50% random data. >> >> I've thought about serializing just the values (as a simple array), but >> that defeats the whole purpose of fast membership checks. I have two ideas: >> >> a) sort the data, and use binary search for this compact variant (and >> then expand it into "full" hash table if needed) >> >> b) use a more compact hash table (with load factor much closer to 1.0) >> >> Not sure which if this option is the right one, each has cost for >> converting between formats (but large on-disk value is not free either). >> >> That's roughly what I did for the tdigest extension. > > Is the choice of hash function (and it's in-memory representation) > independent of the on-disk format question, i.e. could we work > on experimenting and evaluating different hash functions first, > to optimise for speed and quality, and then when done, proceed > and optimise for space, or are the two intertwined somehow? > Not sure what you mean by "optimizing for space" - I imagined the on-disk format would just use the same hash table with tiny amount of free space (say 10% and not ~%50). My suggestion is to be lazy, just use the lookup3 we have in hashfn.c (through hash_bytes or something), and at the same time make it possible to switch to a different function in the future. I'd store and ID of the hash function in the set, so that we can support a different algorithm in the future, if we choose to. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: