Thread: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption

Hi,

I've been playing with a custom COUNT(DISTINCT) aggregate based on hash
tables - it seems to be working as planned, except for memory
consumption which is much higher than I expected.

The code is available at https://github.com/tvondra/count_distinct and
in short it works like this:

1) There's a simple hash table for each group (so with HashAggregate  it's effectively a hash table of hash tables).
Thehash table  contains only distinct values for that particular group, so the  result is simply equal to number of
itemsin the table.
 

2) The table starts very small (4 buckets), and when it grows too large  (more 20 items in a bucket on average) it gets
resizedon the fly to  4x the size. So the buckets grows 4 -> 16 -> 64 -> 256 -> ... (and  the max number of items in
thetable grows 80 -> 320 -> 1280 ...).
 

Let's assume I have a table like this
  CREATE TABLE test_table (a int, b int);
  INSERT INTO test_table SELECT mod(i,4000), (100000*random())::int  FROM generate_series(1,80000000) s(i);

And let's run a query like this:
  SELECT a, count_distinct(b) FROM test_table GROUP a;

I.e. there are ~4k groups with ~20k distinct values for each group. This
query however consumes >5GB of memory, which is much more than I
expected, and I've spent a fair amount of time looking for memory leaks
in the code so I'm wondering if I'm missing something important.

The custom hash tables are built from these very simple structures:
   typedef struct hash_element_t {       uint32  hash;       uint32  length;       char   *value; /* the actual value
(say4B for int4) */   } hash_element_t;
 
   typedef struct hash_bucket_t {       uint32  nitems;       hash_element_t * items; /* array of hash_element_t items
*/  } hash_bucket_t;
 
   typedef struct hash_table_t {       uint16  nbits;    /* number of significant bits of the hash */       uint32
nbuckets;/* number of buckets (2^nbits) */       uint32  nitems;   /* current number of elements in the table */
hash_bucket_t*  buckets; /* array of hash_bucket_t */   } hash_table_t;
 

I'm on 64-bit architecture and the example works with int32, which means
the sizes should be about this:
   hash_element_t => 20B   hash_bucket_t  => 4B + (20B * items in the bucket [in steps of 5])   hash_table_t   => 4B +
spacefor buckets
 

In the example above, there's ~20k unique values in each group. The
threshold is 20 items per bucket on average, so that's 1024 buckets, and
the buckets are almost full.

So for single group, the hash table size is about
  4B + 1024 * (4B + 20 * 20B) = 413700B = ~ 400 kB

There are 4000 groups, so the total estimate is ~1.6GB in total.

However when executed (9.2, 9.3 and HEAD behave exactly the same), the
query consumes almost ~5GB of RAM (excluding shared buffers).

This is how the memory context stats: http://pastebin.com/JHjnehvC

The interesting part is

ExecutorState: 24576 total in 2 blocks; 16368 free (6 chunks); 8208 used ExprContext: 0 total in 0 blocks; 0 free (0
chunks);0 used AggContext: 5106224640 total in 4612 blocks; 565779368 free             (1072086 chunks); 4540445272
used  TupleHashTable: 253952 total in 5 blocks; 55360 free                   (16 chunks); 198592 used ExprContext: 0
totalin 0 blocks; 0 free (0 chunks); 0 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
 

So there's ~5GB in the AggContext, ~500MB of that is free.

I'm aware that there will be some more memory allocated by the
HashAggregate itself, but although I'm not sure how much that should be,
I would expect a "small amount" compared to the 1.6GB.

So I'm wondering what uses those 3.5GB? Is this the overhead of
HashAggregate, or am I missing something (e.g. a memory leak in the code)?


Tomas



Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption

From
Tomáš Janoušek
Date:
Hi,

On Sat, Oct 05, 2013 at 08:22:54PM +0200, Tomas Vondra wrote:
> I'm on 64-bit architecture and the example works with int32, which means
> the sizes should be about this:
>
>     hash_element_t => 20B
>     hash_bucket_t  => 4B + (20B * items in the bucket [in steps of 5])
>     hash_table_t   => 4B + space for buckets
>
> In the example above, there's ~20k unique values in each group. The
> threshold is 20 items per bucket on average, so that's 1024 buckets, and
> the buckets are almost full.
>
> So for single group, the hash table size is about
>
>    4B + 1024 * (4B + 20 * 20B) = 413700B = ~ 400 kB
>
> There are 4000 groups, so the total estimate is ~1.6GB in total.
>
> However when executed (9.2, 9.3 and HEAD behave exactly the same), the
> query consumes almost ~5GB of RAM (excluding shared buffers).

I think the missing thing is the memory allocator bookkeeping overhead.
You're assuming that hash_element_t.value takes 8B for the pointer and 4B for
the value itself, but using malloc it takes another at least 20 bytes, and
from a quick glance at backend/utils/mmgr/aset.c it seems that palloc is
certainly not without its overhead either.

Also, each additional level of pointers adds execution overhead and increases
the likelihood of cache misses.  I'd suggest a few improvements, if I may:

1. Drop hash_element_t altogether, store length in hash_bucket_t and alloc  hash_bucket_t.items of size nitems * length
bytes. I doubt that storing  the hash values has a benefit worth the storage and code complexity  overhead (you're
storingfixed-size ints, not large blobs that are  expensive to compare and hash). 

2. Consider using a simpler/faster hash function, like FNV[1] or Jenkins[2].  For fun, try not hashing those ints at
alland see how that performs (that,  I think, is what you get from HashSet<int> in Java/C#). 

3. Consider dropping buckets in favor of open addressing (linear probing,  quadratic, whatever).  This avoids another
levelof pointer indirection. 

It's been a few years since I've done stuff this low level, so I won't go into
suggesting a different data structure -- I have honestly no idea what's the
best way to count the number of distinct integers in a list.

[1] http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
[2] http://en.wikipedia.org/wiki/Jenkins_hash_function

Best regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/



On 6.10.2013 20:37, Tomáš Janoušek wrote:
> Hi,
> 
> On Sat, Oct 05, 2013 at 08:22:54PM +0200, Tomas Vondra wrote:
>> I'm on 64-bit architecture and the example works with int32, which means
>> the sizes should be about this:
>>
>>     hash_element_t => 20B
>>     hash_bucket_t  => 4B + (20B * items in the bucket [in steps of 5])
>>     hash_table_t   => 4B + space for buckets
>>
>> In the example above, there's ~20k unique values in each group. The
>> threshold is 20 items per bucket on average, so that's 1024 buckets, and
>> the buckets are almost full.
>>
>> So for single group, the hash table size is about
>>
>>    4B + 1024 * (4B + 20 * 20B) = 413700B = ~ 400 kB
>>
>> There are 4000 groups, so the total estimate is ~1.6GB in total.
>>
>> However when executed (9.2, 9.3 and HEAD behave exactly the same), the
>> query consumes almost ~5GB of RAM (excluding shared buffers).
> 
> I think the missing thing is the memory allocator bookkeeping overhead.
> You're assuming that hash_element_t.value takes 8B for the pointer and 4B for
> the value itself, but using malloc it takes another at least 20 bytes, and
> from a quick glance at backend/utils/mmgr/aset.c it seems that palloc is
> certainly not without its overhead either.
> 
> Also, each additional level of pointers adds execution overhead and increases
> the likelihood of cache misses.  I'd suggest a few improvements, if I may:
> 
> 1. Drop hash_element_t altogether, store length in hash_bucket_t and alloc
>    hash_bucket_t.items of size nitems * length bytes.  I doubt that storing
>    the hash values has a benefit worth the storage and code complexity
>    overhead (you're storing fixed-size ints, not large blobs that are
>    expensive to compare and hash).

Good idea - I'll move the length to the hash table.

You're right that keeping the hash for int32/64 values is probably
useless, however I planned to eventually extend the code to support
larger values (possibly even varlena types, i.e. text/bytea). So I'll
keep this for now, but I'll keep this as a possible optimization.

> 2. Consider using a simpler/faster hash function, like FNV[1] or Jenkins[2].
>    For fun, try not hashing those ints at all and see how that performs (that,
>    I think, is what you get from HashSet<int> in Java/C#).

I've used crc32 mostly because it's easily available in the code (i.e.
I'm lazy), but I've done some quick research / primitive benchmarking
too. For example hashing 2e9 integers takes this much time:

FNV-1   = 11.9
FNV-1a  = 11.9
jenkins = 38.8
crc32   = 32.0

So it's not really "slow" and the uniformity seems to be rather good.

I'll try FNV in the future, however I don't think that's the main issue
right now.

> 3. Consider dropping buckets in favor of open addressing (linear probing,
>    quadratic, whatever).  This avoids another level of pointer indirection.

OK, this sounds really interesting. It should be fairly straightforward
for fixed-length data types (in that case I can get rid of the pointers
altogether).

> It's been a few years since I've done stuff this low level, so I won't go into
> suggesting a different data structure -- I have honestly no idea what's the
> best way to count the number of distinct integers in a list.

Thanks for the hints!

Tomas



On Mon, Oct 07, 2013 at 12:41:58AM +0200, Tomas Vondra wrote:
> > 2. Consider using a simpler/faster hash function, like FNV[1] or Jenkins[2].
> >    For fun, try not hashing those ints at all and see how that performs (that,
> >    I think, is what you get from HashSet<int> in Java/C#).
> 
> I've used crc32 mostly because it's easily available in the code (i.e.
> I'm lazy), but I've done some quick research / primitive benchmarking
> too. For example hashing 2e9 integers takes this much time:
> 
> FNV-1   = 11.9
> FNV-1a  = 11.9
> jenkins = 38.8
> crc32   = 32.0
> 
> So it's not really "slow" and the uniformity seems to be rather good.
> 
> I'll try FNV in the future, however I don't think that's the main issue
> right now.
> 
Hi Tomas,

If you are going to use a function that is not currently in the code,
please consider xxhash:

http://code.google.com/p/xxhash/

Here are some benchmarks for some of the faster hash functions:

Name            Speed       Q.Score   Author
xxHash          5.4 GB/s     10
MumurHash 3a    2.7 GB/s     10       Austin Appleby
SpookyHash      2.0 GB/s     10       Bob Jenkins
SBox            1.4 GB/s      9       Bret Mulvey
Lookup3         1.2 GB/s      9       Bob Jenkins
CityHash64      1.05 GB/s    10       Pike & Alakuijala
FNV             0.55 GB/s     5       Fowler, Noll, Vo
CRC32           0.43 GB/s     9
MD5-32          0.33 GB/s    10       Ronald L. Rivest
SHA1-32         0.28 GB/s    10

Regards,
Ken




Sent from my iPad

> On 07-Oct-2013, at 4:11, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> On 6.10.2013 20:37, Tomáš Janoušek wrote:
>> Hi,
>>
>>> On Sat, Oct 05, 2013 at 08:22:54PM +0200, Tomas Vondra wrote:
>>> I'm on 64-bit architecture and the example works with int32, which means
>>> the sizes should be about this:
>>>
>>>    hash_element_t => 20B
>>>    hash_bucket_t  => 4B + (20B * items in the bucket [in steps of 5])
>>>    hash_table_t   => 4B + space for buckets
>>>
>>> In the example above, there's ~20k unique values in each group. The
>>> threshold is 20 items per bucket on average, so that's 1024 buckets, and
>>> the buckets are almost full.
>>>
>>> So for single group, the hash table size is about
>>>
>>>   4B + 1024 * (4B + 20 * 20B) = 413700B = ~ 400 kB
>>>
>>> There are 4000 groups, so the total estimate is ~1.6GB in total.
>>>
>>> However when executed (9.2, 9.3 and HEAD behave exactly the same), the
>>> query consumes almost ~5GB of RAM (excluding shared buffers).
>>
>> I think the missing thing is the memory allocator bookkeeping overhead.
>> You're assuming that hash_element_t.value takes 8B for the pointer and 4B for
>> the value itself, but using malloc it takes another at least 20 bytes, and
>> from a quick glance at backend/utils/mmgr/aset.c it seems that palloc is
>> certainly not without its overhead either.
>>
>> Also, each additional level of pointers adds execution overhead and increases
>> the likelihood of cache misses.  I'd suggest a few improvements, if I may:
>>
>> 1. Drop hash_element_t altogether, store length in hash_bucket_t and alloc
>>   hash_bucket_t.items of size nitems * length bytes.  I doubt that storing
>>   the hash values has a benefit worth the storage and code complexity
>>   overhead (you're storing fixed-size ints, not large blobs that are
>>   expensive to compare and hash).
>
> Good idea - I'll move the length to the hash table.
>
> You're right that keeping the hash for int32/64 values is probably
> useless, however I planned to eventually extend the code to support
> larger values (possibly even varlena types, i.e. text/bytea). So I'll
> keep this for now, but I'll keep this as a possible optimization.
>
>> 2. Consider using a simpler/faster hash function, like FNV[1] or Jenkins[2].
>>   For fun, try not hashing those ints at all and see how that performs (that,
>>   I think, is what you get from HashSet<int> in Java/C#).
>
> I've used crc32 mostly because it's easily available in the code (i.e.
> I'm lazy), but I've done some quick research / primitive benchmarking
> too. For example hashing 2e9 integers takes this much time:
>
> FNV-1   = 11.9
> FNV-1a  = 11.9
> jenkins = 38.8
> crc32   = 32.0
>
> So it's not really "slow" and the uniformity seems to be rather good.
>
> I'll try FNV in the future, however I don't think that's the main issue
> right now.
>
>> 3. Consider dropping buckets in favor of open addressing (linear probing,
>>   quadratic, whatever).  This avoids another level of pointer indirection.
>
> OK, this sounds really interesting. It should be fairly straightforward
> for fixed-length data types (in that case I can get rid of the pointers
> altogether).
>
Consider the aspects associated with open addressing.Open addressing can quickly lead to growth in the main table.Also,
chainingis a much cleaner way of collision resolution,IMHO. 

>> It's been a few years since I've done stuff this low level, so I won't go into
>> suggesting a different data structure -- I have honestly no idea what's the
>> best way to count the number of distinct integers in a list.
>
> Thanks for the hints!
>
> Tomas
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



On 7.10.2013 14:50, ktm@rice.edu wrote:
> On Mon, Oct 07, 2013 at 12:41:58AM +0200, Tomas Vondra wrote:
>>> 2. Consider using a simpler/faster hash function, like FNV[1] or Jenkins[2].
>>>    For fun, try not hashing those ints at all and see how that performs (that,
>>>    I think, is what you get from HashSet<int> in Java/C#).
>>
>> I've used crc32 mostly because it's easily available in the code (i.e.
>> I'm lazy), but I've done some quick research / primitive benchmarking
>> too. For example hashing 2e9 integers takes this much time:
>>
>> FNV-1   = 11.9
>> FNV-1a  = 11.9
>> jenkins = 38.8
>> crc32   = 32.0
>>
>> So it's not really "slow" and the uniformity seems to be rather good.
>>
>> I'll try FNV in the future, however I don't think that's the main issue
>> right now.
>>
> Hi Tomas,
> 
> If you are going to use a function that is not currently in the code,
> please consider xxhash:
> 
> http://code.google.com/p/xxhash/
> 
> Here are some benchmarks for some of the faster hash functions:
> 
> Name            Speed       Q.Score   Author
> xxHash          5.4 GB/s     10
> MumurHash 3a    2.7 GB/s     10       Austin Appleby
> SpookyHash      2.0 GB/s     10       Bob Jenkins
> SBox            1.4 GB/s      9       Bret Mulvey
> Lookup3         1.2 GB/s      9       Bob Jenkins
> CityHash64      1.05 GB/s    10       Pike & Alakuijala
> FNV             0.55 GB/s     5       Fowler, Noll, Vo
> CRC32           0.43 GB/s     9
> MD5-32          0.33 GB/s    10       Ronald L. Rivest
> SHA1-32         0.28 GB/s    10

Hi,

thanks for the link. I repeated the simple benchmark (code is here:
http://pastebin.com/e9BS03MA) with these results:
 FNV  duration    = 9.821 FNVa duration    = 9.753 jenkins duration = 37.658 crc32 duration   = 7.127 xxhash duration
=68.814
 

The only difference is that this time I added -O3 (which I forgot last
time). That's probably the reason why CRC32 even faster than FNV.

Anyway, xxhash seems to be much slower than the other functions, at
least in this particular case.

I don't think the poor results necessarily contradict the table of
results you've posted, because that's on "a large block of random data"
while I'm hashing very short amounts of data (4B / 8B). So my guess is
xxhash init takes much more time than the other algorithms. Chances are
it'd be faster on large amounts of data, but that's not really useful.

Maybe I'll revisit it in the future if I'll decide to add support for
varlena types. Until then I'll stick with crc32 which is fast and has
much better Quality score than FNV.

Tomas



Hi Atri!

On 7.10.2013 16:56, Atri Sharma wrote:
>>> 3. Consider dropping buckets in favor of open addressing (linear
>>> probing, quadratic, whatever).  This avoids another level of
>>> pointer indirection.
>> 
>> OK, this sounds really interesting. It should be fairly
>> straightforward for fixed-length data types (in that case I can get
>> rid of the pointers altogether).
>> 
> Consider the aspects associated with open addressing.Open addressing
> can quickly lead to growth in the main table.Also, chaining is a much
> cleaner way of collision resolution,IMHO.

What do you mean by "growth in the main table"?

Tomas



Hi Tomas,


>> Consider the aspects associated with open addressing.Open addressing
>> can quickly lead to growth in the main table.Also, chaining is a much
>> cleaner way of collision resolution,IMHO.
>
> What do you mean by "growth in the main table"?

Sorry, I should have been more verbose.

AFAIK, Open addressing can be slower with a load factor approaching 1
as compared to chaining. Also, I feel that implementation of open
addressing can be more complicated as we have to deal with deletes
etc.

I feel we can redesign our current chaining mechanism to have skip
lists instead of singly linked lists. I experimented with it sometime
back and I feel that it gives a stable performance with higher loads.

Regards,

Atri



-- 
Regards,

Atri
l'apprenant



On Tue, Oct 8, 2013 at 1:23 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
>>> Consider the aspects associated with open addressing.Open addressing
>>> can quickly lead to growth in the main table.Also, chaining is a much
>>> cleaner way of collision resolution,IMHO.
>>
>> What do you mean by "growth in the main table"?
>
> Sorry, I should have been more verbose.
>
> AFAIK, Open addressing can be slower with a load factor approaching 1
> as compared to chaining. Also, I feel that implementation of open
> addressing can be more complicated as we have to deal with deletes
> etc.


Deletes for a hash aggregate?




Sent from my iPad

> On 08-Oct-2013, at 10:41, Claudio Freire <klaussfreire@gmail.com> wrote:
>
> On Tue, Oct 8, 2013 at 1:23 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
>>>> Consider the aspects associated with open addressing.Open addressing
>>>> can quickly lead to growth in the main table.Also, chaining is a much
>>>> cleaner way of collision resolution,IMHO.
>>>
>>> What do you mean by "growth in the main table"?
>>
>> Sorry, I should have been more verbose.
>>
>> AFAIK, Open addressing can be slower with a load factor approaching 1
>> as compared to chaining. Also, I feel that implementation of open
>> addressing can be more complicated as we have to deal with deletes
>> etc.
>
>
> Deletes for a hash aggregate?

Yeah, that doesn't apply here.I was just listing out the demerits of open addressing :)

My point is, it is not wise to unnecessarily complicate matters by shifting to open addressing. If we want, we could
lookat changing the data structure used for chaining, but chaining is better for our requirements IMHO. 

Regards,

Atri


On 8 Říjen 2013, 6:23, Atri Sharma wrote:
> Hi Tomas,
>
>
>>> Consider the aspects associated with open addressing.Open addressing
>>> can quickly lead to growth in the main table.Also, chaining is a much
>>> cleaner way of collision resolution,IMHO.
>>
>> What do you mean by "growth in the main table"?
>
> Sorry, I should have been more verbose.
>
> AFAIK, Open addressing can be slower with a load factor approaching 1
> as compared to chaining. Also, I feel that implementation of open
> addressing can be more complicated as we have to deal with deletes
> etc.
>
> I feel we can redesign our current chaining mechanism to have skip
> lists instead of singly linked lists. I experimented with it sometime
> back and I feel that it gives a stable performance with higher loads.
>
> Regards,
>
> Atri

OK, thanks for the explanation.

I've spent some time hacking this yesterday, the results are available in
a separate branch:
 https://github.com/tvondra/count_distinct/tree/open-addressing

The complexity of the implementation is pretty much comparable to
chaining. I assume it would get much more complex with handling deletes
etc., but that's not really necessary here.

However I'm unable to make it at least comparable to chaining. The trick
is that the hash table performs reasonably only until it's ~ 65-70% full,
it gets really slow after that. So to maintain performance comparable to
chaining, I'd have to keep the table below this threshold, effectively
wasting ~30% of memory.

And the topic of this thread was about decreasing the memory consumptions,
so it seems to me open-addressing is not a good match here. I'll try a few
more things but I don't think it's going to fly.

I've made some significant improvements in the chaining version (in the
master branch), now getting about the memory consumption I've estimated.

Tomas





Sent from my iPad

>
> However I'm unable to make it at least comparable to chaining. The trick
> is that the hash table performs reasonably only until it's ~ 65-70% full,
> it gets really slow after that. So to maintain performance comparable to
> chaining, I'd have to keep the table below this threshold, effectively
> wasting ~30% of memory.
I expected that. AFAIK, open addressing gets slow when the load factor approaches 1. I feel this is what is happening
here.
>
> And the topic of this thread was about decreasing the memory consumptions,
> so it seems to me open-addressing is not a good match here. I'll try a few
> more things but I don't think it's going to fly.
>
Yeah, I also feel that open addressing isn't the way to go for the problem here.

> I've made some significant improvements in the chaining version (in the
> master branch), now getting about the memory consumption I've estimated.
>
I agree, we can hope to reduce the memory consumption by making changes in the current chaining implementation. I would
liketo look into changing the data structure used for chaining from singly linked list to maybe skip list or something
else.

Regards,

Atri


On 8 Říjen 2013, 11:42, Atri Sharma wrote:
>>
>> I've made some significant improvements in the chaining version (in the
>> master branch), now getting about the memory consumption I've estimated.
>>
> I agree, we can hope to reduce the memory consumption by making changes in
> the current chaining implementation. I would like to look into changing
> the data structure used for chaining from singly linked list to maybe skip
> list or something else.

Just to be sure - I haven't been messing with the HashAggregate
implementation directly, but with a custom aggregate. But feel free to
tweak the built-in hash table ;-)

Tomas




On Tue, Oct 8, 2013 at 4:15 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 8 Říjen 2013, 11:42, Atri Sharma wrote:
>>>
>>> I've made some significant improvements in the chaining version (in the
>>> master branch), now getting about the memory consumption I've estimated.
>>>
>> I agree, we can hope to reduce the memory consumption by making changes in
>> the current chaining implementation. I would like to look into changing
>> the data structure used for chaining from singly linked list to maybe skip
>> list or something else.
>
> Just to be sure - I haven't been messing with the HashAggregate
> implementation directly, but with a custom aggregate. But feel free to
> tweak the built-in hash table ;-)
>
> Tomas
>

Heh.

Do you mind if I try it out on the custom agg you built? I assume it
is on the github repo link you shared?

Regards,

Atri


--
Regards,

Atri
l'apprenant



On 8 Říjen 2013, 13:52, Atri Sharma wrote:
> On Tue, Oct 8, 2013 at 4:15 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 8 Říjen 2013, 11:42, Atri Sharma wrote:
>>>>
>>>> I've made some significant improvements in the chaining version (in
>>>> the
>>>> master branch), now getting about the memory consumption I've
>>>> estimated.
>>>>
>>> I agree, we can hope to reduce the memory consumption by making changes
>>> in
>>> the current chaining implementation. I would like to look into changing
>>> the data structure used for chaining from singly linked list to maybe
>>> skip
>>> list or something else.
>>
>> Just to be sure - I haven't been messing with the HashAggregate
>> implementation directly, but with a custom aggregate. But feel free to
>> tweak the built-in hash table ;-)
>>
>> Tomas
>>
>
> Heh.
>
> Do you mind if I try it out on the custom agg you built? I assume it
> is on the github repo link you shared?

Not at all, that's why I pushed that into a public repo. The "master"
branch contains the regular chained hash table, the open addressing is in
a separate branch (also in the repo).

Tomas




gettimeofday(&start, NULL);   for (i = 0; i < VALUES; i++) {       state = XXH32_init(result);
XXH32_update(state,&i, 4);       XXH32_digest(state);   }
 
gettimeofday(&end, NULL);


This code is using the "update" variant, which is only useful when dealing
with very large amount of data which can't fit into a single block of
memory. This is obviously overkill for a 4-bytes-only test. 3 functions
calls, a malloc, intermediate data book keeping, etc.

To hash a single block of data, it's better to use the simpler (and faster)
variant XXH32() :

gettimeofday(&start, NULL);   for (i = 0; i < VALUES; i++) { XXH32(&i, 4, result); }
gettimeofday(&end, NULL);

You'll probably get better results by an order of magnitude. For better
results, you could even inline it (yes, for such short loop with almost no
work to do, it makes a very sensible difference).


That being said, it's true that these advanced hash algorithms only shine
with "big enough" amount of data to hash. Hashing a 4-bytes value into a
4-bytes hash is a bit limited exercise. There is no "pigeon hole" issue. A
simple multiplication by a 32-bits prime would fare good enough and result
in zero collision.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/custom-hash-based-COUNT-DISTINCT-aggregate-unexpectedly-high-memory-consumption-tp5773463p5774264.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



On 11.10.2013 13:42, Huchev wrote:
> 
> gettimeofday(&start, NULL);
>     for (i = 0; i < VALUES; i++) {
>         state = XXH32_init(result);
>         XXH32_update(state, &i, 4);
>         XXH32_digest(state);
>     }
> gettimeofday(&end, NULL);
> 
> 
> This code is using the "update" variant, which is only useful when dealing
> with very large amount of data which can't fit into a single block of
> memory. This is obviously overkill for a 4-bytes-only test. 3 functions
> calls, a malloc, intermediate data book keeping, etc.
> 
> To hash a single block of data, it's better to use the simpler (and faster)
> variant XXH32() :
> 
> gettimeofday(&start, NULL);
>     for (i = 0; i < VALUES; i++) { XXH32(&i, 4, result); }
> gettimeofday(&end, NULL);
> 
> You'll probably get better results by an order of magnitude. For better
> results, you could even inline it (yes, for such short loop with almost no
> work to do, it makes a very sensible difference).

Not really. Even with this change it's slightly slower than crc32, at
least with the 32-bit integers. With 64-bit integers it's about 2x as
fast. But even then it's like ~1% of the total runtime, so any
improvements here are not really changing anything.

The inlining is not a good idea IMHO, because that'd be very different
from the actual usage (there won't be such tight loop). OTOH I'm not
sure if the compiler does not already inline that as an optimization.

> That being said, it's true that these advanced hash algorithms only
> shine with "big enough" amount of data to hash. Hashing a 4-bytes
> value into a 4-bytes hash is a bit limited exercise. There is no
> "pigeon hole" issue. A simple multiplication by a 32-bits prime would
> fare good enough and result in zero collision.

Agreed. I'll revisit this if/when I'll need to support larger data types
in this aggregate.

Tomas