Home > mailing lists

Re: Do we want a hashset type? - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Do we want a hashset type?
Date	June 12, 2023 22:58:56
Msg-id	d6bf1a75-e86b-6563-1478-6345d804cdda@enterprisedb.com Whole thread Raw
In response to	Re: Do we want a hashset type? ("Joel Jacobson" <joel@compiler.org>)
Responses	Re: Do we want a hashset type?
List	pgsql-hackers

Tree view


On 6/12/23 19:34, Joel Jacobson wrote:
> On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>>> 1) better hash table implementation
> 
> I noticed src/include/common/hashfn.h provides implementation
> of the Jenkins/lookup3 hash function, and thought maybe
> we could simply use it in hashset?
> 
> However, I noticed that according to SMHasher [1],
> Jenkins/lookup3 has some quality problems:
> "UB, 28% bias, collisions, 30% distr, BIC"
> 
> Not sure if that's true or maybe not a problem in the PostgreSQL implementation?
> > According to SHHasher, the two fastest 32/64-bit hash functions
> for non-cryptographic purposes without any quality problems
> that are also portable seems to be these two:
> 
> wyhash v4.1 (64-bit) [2]
> MiB/sec: 22513.04
> cycl./hash: 29.01
> size: 474
> 
> xxh3low (xxHash v3, 64-bit, low 32-bits part) [3]
> MiB/sec: 20722.94
> cycl./hash: 30.26
> size: 756
> 
> [1] https://github.com/rurban/smhasher
> [2] https://github.com/wangyi-fudan/wyhash
> [3] https://github.com/Cyan4973/xxHash
> 

But those are numbers for large keys - if you restrict the input to
4B-16B (which is what we planned to do by focusing on int, bigint and
uuid), there's no significant difference:

lookup3:

Small key speed test -    4-byte keys -    30.17 cycles/hash
Small key speed test -    8-byte keys -    31.00 cycles/hash
Small key speed test -   16-byte keys -    49.00 cycles/hash

xxh3low:

Small key speed test -    4-byte keys -    29.00 cycles/hash
Small key speed test -    8-byte keys -    29.58 cycles/hash
Small key speed test -   16-byte keys -    37.00 cycles/hash

But you can try doing some measurements, of course. Or just do profiling
to see how much time we spend in the hash function - I'd bet it's pretty
tiny fraction of the total time.

As for the "quality" issues - it's the same algorithm in Postgres, so it
has the same issues. I don't if that has measurable impact, though. I'd
guess it does not, particularly for "reasonably small" sets.

>>>> 5) more efficient storage format, with versioning etc.
>> I think the main question is whether to serialize the hash table as is,
>> or compact it in some way. The current code just uses the same thing for
>> both cases - on-disk format and in-memory representation (aggstate).
>> That's simple, but it also means the on-disk value is likely not well
>> compressible (because it's ~50% random data.
>>
>> I've thought about serializing just the values (as a simple array), but
>> that defeats the whole purpose of fast membership checks. I have two ideas:
>>
>> a) sort the data, and use binary search for this compact variant (and
>> then expand it into "full" hash table if needed)
>>
>> b) use a more compact hash table (with load factor much closer to 1.0)
>>
>> Not sure which if this option is the right one, each has cost for
>> converting between formats (but large on-disk value is not free either).
>>
>> That's roughly what I did for the tdigest extension.
> 
> Is the choice of hash function (and it's in-memory representation)
> independent of the on-disk format question, i.e. could we work
> on experimenting and evaluating different hash functions first,
> to optimise for speed and quality, and then when done, proceed
> and optimise for space, or are the two intertwined somehow?
> 

Not sure what you mean by "optimizing for space" - I imagined the
on-disk format would just use the same hash table with tiny amount of
free space (say 10% and not ~%50).


My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
(through hash_bytes or something), and at the same time make it possible
to switch to a different function in the future. I'd store and ID of the
hash function in the set, so that we can support a different algorithm
in the future, if we choose to.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Andres Freund
Date: 12 June 2023, 22:24:30
Subject: Re: Let's make PostgreSQL multi-threaded

From: Peter Eisentraut
Date: 12 June 2023, 23:33:16
Subject: Re: Documentation for building with meson

Re: Do we want a hashset type? - Mailing list pgsql-hackers

Previous

Next