Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
Date
Msg-id 29a04fd9-abe6-7c5f-dadc-a80459890f53@2ndquadrant.com
Whole thread Raw
In response to Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop  ("Todd A. Cook" <tcook@blackducksoftware.com>)
List pgsql-bugs

On 12/07/2017 12:45 AM, Tomas Vondra wrote:
> 
> 
> On 12/06/2017 11:55 PM, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2017-12-06 16:38:18 -0500, Todd A. Cook wrote:
>>>> I found this problem when I dropped 10.1 into a test environment to see
>>>> what would happen.  There was no deliberate attempt to break anything.
>>
>>> Read Thomas' message at:
http://archives.postgresql.org/message-id/263b03b1-3e1c-49ca-165a-8ac6751419c4%402ndquadrant.com
>>
>> I'm confused by Tomas' claim that
>>
>>>> (essentially hashint8 only ever produces 60% of
>>>> values from [0,1000000], which likely increases collision rate).
>>
>> This is directly contradicted by the simple experiments I've done, eg
>>
>> regression=# select count(distinct hashint8(v)) from generate_series(0,1000000::int8) v;
>>  count  
>> --------
>>  999879
>> (1 row)
>>
>> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,1000000::int8) v;
>>  count  
>> --------
>>  644157
>> (1 row)
>>
>> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,10000000::int8) v;
>>   count  
>> ---------
>>  1048514
>> (1 row)
>>
>> It's certainly not perfect, but I'm not observing any major failure to
>> span the output space.
>>
> 
> That is not the claim I was making, though. When generating data sets
> I've been considering only values where hashint8(v) is between 0 and
> 1000000 *without* the masking. Otherwise growing the hash table would
> break the "chain" of values in the hash table, which would resolve the
> issue with SH_GROW_MAX_MOVE.
> 
> So you'd have to do something like this:
> 
>     select count(distinct hashint8(v))
>       from generate_series(0,1000000::int8) v
>      where hashint8(v) between 0 and 1024*1024;
> 
> but to get some numbers you'd have to also increase the value passed to
> generate_series (because it'll filter most of the values).
> 
> Which is why I generated the data by an ugly C program, similar to the
> attached one. It keeps a small bitmap for generated hashes in the [0,1M]
> range, and prints the number of bits set after every 1e9 values processed.
> 
> What I do see is this:
> 
>     i=1000000000 nset=217666
>     i=2000000000 nset=389526
>     i=3000000000 nset=525135
>     i=4000000000 nset=632305
>     i=5000000000 nset=659574
>     i=6000000000 nset=659574
>     i=7000000000 nset=659574
>     i=8000000000 nset=659574
>     i=9000000000 nset=659574
>     ...
> 
> So somewhere between 4e9 and 5e9 we generate 659574 hashes in the range,
> and then never increase it further.
> 
> I don't know if this an issue in practice, but it seems that for large
> hash tables (on bigint), growing the hash table may not be that great
> improvement because we're not really doubling the number of slots we can
> hit directly. We'll probably find an empty slot nearby, though.
> 
> It was just an observation - I only noticed that because I tried to
> construct the adversarial data set on bigint, and couldn't make it work
> no matter what.
> 

I've done a simple experiment today - computing a hash for every uint32
value, and checking how many distinct hash values that produces. For the
4.2 billion values the results look like this:

    second key        ndistinct     ndistinct/nkeys
    3.380 i=100000000 nset=98829159 0.99
    6.240 i=200000000 nset=195351181 0.98
    9.106 i=300000000 nset=289623103 0.97
    11.932 i=400000000 nset=381695003 0.95
    14.814 i=500000000 nset=471621355 0.94
    17.706 i=600000000 nset=559452278 0.93
    20.577 i=700000000 nset=645242762 0.92
    23.496 i=800000000 nset=729036217 0.91
    26.460 i=900000000 nset=810879821 0.90
    29.380 i=1000000000 nset=890818778 0.89
    32.331 i=1100000000 nset=968898478 0.88
    35.282 i=1200000000 nset=1045164189 0.87
    38.262 i=1300000000 nset=1119654810 0.86
    41.251 i=1400000000 nset=1192424766 0.85
    44.258 i=1500000000 nset=1263482593 0.84
    47.268 i=1600000000 nset=1332897449 0.83
    50.305 i=1700000000 nset=1400717923 0.82
    53.356 i=1800000000 nset=1466962823 0.81
    56.425 i=1900000000 nset=1531660191 0.81
    59.489 i=2000000000 nset=1594856205 0.80
    62.593 i=2100000000 nset=1656588855 0.79
    65.706 i=2200000000 nset=1716895530 0.78
    68.829 i=2300000000 nset=1775809288 0.77
    71.966 i=2400000000 nset=1833353377 0.76
    75.123 i=2500000000 nset=1889558599 0.76
    78.271 i=2600000000 nset=1944463039 0.75
    81.418 i=2700000000 nset=1998096496 0.74
    84.574 i=2800000000 nset=2050490745 0.73
    87.711 i=2900000000 nset=2101666393 0.72
    90.852 i=3000000000 nset=2151669155 0.72
    93.986 i=3100000000 nset=2200517580 0.71
    97.084 i=3200000000 nset=2248232772 0.70
    100.172 i=3300000000 nset=2294849331 0.70
    103.232 i=3400000000 nset=2340389131 0.69
    106.273 i=3500000000 nset=2384875319 0.68
    109.272 i=3600000000 nset=2428339639 0.67
    112.260 i=3700000000 nset=2470798655 0.67
    115.199 i=3800000000 nset=2512271730 0.66
    118.125 i=3900000000 nset=2552788321 0.65
    121.029 i=4000000000 nset=2592379529 0.65
    123.895 i=4100000000 nset=2631056297 0.64
    126.726 i=4200000000 nset=2668843329 0.64
    129.397 loops 4294967295 set 2703915179 (0.63)

That means we only ever generate about 64% of the possible hash keys.
That doesn't seem particularly healthy I guess, but for small hash
tables (with fewer than 2^32 buckets) that may not be an issue, as some
of the values will wrap because of the modulo.

For comparison, murmurhash3 and xxhash (both fairly popular hash
functions) end with 1:1 ratio of values and hashes, that is not having
any collisions at all. Of course, both are somewhat slower than the
simple hash functions we use for int/bigint values.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-bugs by date:

Previous
From: Jeff Janes
Date:
Subject: Re: BUG #14938: ALTER TABLE hang/ poor performance
Next
From: Tomas Vondra
Date:
Subject: Re: BUG #14938: ALTER TABLE hang/ poor performance