Re: Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption
Date
Msg-id 525311F1.5020408@fuzzy.cz
Whole thread Raw
In response to Re: Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption  ("ktm@rice.edu" <ktm@rice.edu>)
Responses Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption  (Huchev <hugochevrain@gmail.com>)
List pgsql-hackers
On 7.10.2013 14:50, ktm@rice.edu wrote:
> On Mon, Oct 07, 2013 at 12:41:58AM +0200, Tomas Vondra wrote:
>>> 2. Consider using a simpler/faster hash function, like FNV[1] or Jenkins[2].
>>>    For fun, try not hashing those ints at all and see how that performs (that,
>>>    I think, is what you get from HashSet<int> in Java/C#).
>>
>> I've used crc32 mostly because it's easily available in the code (i.e.
>> I'm lazy), but I've done some quick research / primitive benchmarking
>> too. For example hashing 2e9 integers takes this much time:
>>
>> FNV-1   = 11.9
>> FNV-1a  = 11.9
>> jenkins = 38.8
>> crc32   = 32.0
>>
>> So it's not really "slow" and the uniformity seems to be rather good.
>>
>> I'll try FNV in the future, however I don't think that's the main issue
>> right now.
>>
> Hi Tomas,
> 
> If you are going to use a function that is not currently in the code,
> please consider xxhash:
> 
> http://code.google.com/p/xxhash/
> 
> Here are some benchmarks for some of the faster hash functions:
> 
> Name            Speed       Q.Score   Author
> xxHash          5.4 GB/s     10
> MumurHash 3a    2.7 GB/s     10       Austin Appleby
> SpookyHash      2.0 GB/s     10       Bob Jenkins
> SBox            1.4 GB/s      9       Bret Mulvey
> Lookup3         1.2 GB/s      9       Bob Jenkins
> CityHash64      1.05 GB/s    10       Pike & Alakuijala
> FNV             0.55 GB/s     5       Fowler, Noll, Vo
> CRC32           0.43 GB/s     9
> MD5-32          0.33 GB/s    10       Ronald L. Rivest
> SHA1-32         0.28 GB/s    10

Hi,

thanks for the link. I repeated the simple benchmark (code is here:
http://pastebin.com/e9BS03MA) with these results:
 FNV  duration    = 9.821 FNVa duration    = 9.753 jenkins duration = 37.658 crc32 duration   = 7.127 xxhash duration
=68.814
 

The only difference is that this time I added -O3 (which I forgot last
time). That's probably the reason why CRC32 even faster than FNV.

Anyway, xxhash seems to be much slower than the other functions, at
least in this particular case.

I don't think the poor results necessarily contradict the table of
results you've posted, because that's on "a large block of random data"
while I'm hashing very short amounts of data (4B / 8B). So my guess is
xxhash init takes much more time than the other algorithms. Chances are
it'd be faster on large amounts of data, but that's not really useful.

Maybe I'll revisit it in the future if I'll decide to add support for
varlena types. Until then I'll stick with crc32 which is fast and has
much better Quality score than FNV.

Tomas



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: plpgsql.print_strict_params
Next
From: Robert Haas
Date:
Subject: Re: [COMMITTERS] pgsql: Add DISCARD SEQUENCES command.