Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop - Mailing list pgsql-bugs

From Todd A. Cook
Subject Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
Date
Msg-id 0f552dee-2b87-2a74-7862-7b0ff095fc82@blackducksoftware.com
Whole thread Raw
In response to Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-bugs
On 12/10/17 18:14, Tomas Vondra wrote:
> 
> I've done a simple experiment today - computing a hash for every uint32
> value, and checking how many distinct hash values that produces. For the
> 4.2 billion values the results look like this:
> 
>      second key        ndistinct     ndistinct/nkeys
>      3.380 i=100000000 nset=98829159 0.99
>      6.240 i=200000000 nset=195351181 0.98
>      9.106 i=300000000 nset=289623103 0.97
>      11.932 i=400000000 nset=381695003 0.95
>      14.814 i=500000000 nset=471621355 0.94
>      17.706 i=600000000 nset=559452278 0.93
>      20.577 i=700000000 nset=645242762 0.92
>      23.496 i=800000000 nset=729036217 0.91
>      26.460 i=900000000 nset=810879821 0.90
>      29.380 i=1000000000 nset=890818778 0.89
>      32.331 i=1100000000 nset=968898478 0.88
>      35.282 i=1200000000 nset=1045164189 0.87
>      38.262 i=1300000000 nset=1119654810 0.86
>      41.251 i=1400000000 nset=1192424766 0.85
>      44.258 i=1500000000 nset=1263482593 0.84
>      47.268 i=1600000000 nset=1332897449 0.83
>      50.305 i=1700000000 nset=1400717923 0.82
>      53.356 i=1800000000 nset=1466962823 0.81
>      56.425 i=1900000000 nset=1531660191 0.81
>      59.489 i=2000000000 nset=1594856205 0.80
>      62.593 i=2100000000 nset=1656588855 0.79
>      65.706 i=2200000000 nset=1716895530 0.78
>      68.829 i=2300000000 nset=1775809288 0.77
>      71.966 i=2400000000 nset=1833353377 0.76
>      75.123 i=2500000000 nset=1889558599 0.76
>      78.271 i=2600000000 nset=1944463039 0.75
>      81.418 i=2700000000 nset=1998096496 0.74
>      84.574 i=2800000000 nset=2050490745 0.73
>      87.711 i=2900000000 nset=2101666393 0.72
>      90.852 i=3000000000 nset=2151669155 0.72
>      93.986 i=3100000000 nset=2200517580 0.71
>      97.084 i=3200000000 nset=2248232772 0.70
>      100.172 i=3300000000 nset=2294849331 0.70
>      103.232 i=3400000000 nset=2340389131 0.69
>      106.273 i=3500000000 nset=2384875319 0.68
>      109.272 i=3600000000 nset=2428339639 0.67
>      112.260 i=3700000000 nset=2470798655 0.67
>      115.199 i=3800000000 nset=2512271730 0.66
>      118.125 i=3900000000 nset=2552788321 0.65
>      121.029 i=4000000000 nset=2592379529 0.65
>      123.895 i=4100000000 nset=2631056297 0.64
>      126.726 i=4200000000 nset=2668843329 0.64
>      129.397 loops 4294967295 set 2703915179 (0.63)
> 
> That means we only ever generate about 64% of the possible hash keys.
> That doesn't seem particularly healthy I guess, but for small hash
> tables (with fewer than 2^32 buckets) that may not be an issue, as some
> of the values will wrap because of the modulo.

FWIW, changing hashint8() to

    int64   val = PG_GETARG_INT64(0);

    if (val <= UINT32_MAX && val >= 0)
        return hash_uint32((uint32) val);
    else
        return hash_any((unsigned char *) &val, sizeof(val));

fixes the problem for me on all of my data sets.  My conjecture is that the
existing implementation

    uint32    lohalf = (uint32) val;
    uint32    hihalf = (uint32) (val >> 32);

    lohalf ^= (val >= 0) ? hihalf : ~hihalf;

biases lohalf when hashing collections of large-magnitude negative numbers
(like my data sets) because the individual bits of lohalf do not have equal
probability of being toggled by the exclusive-or.  For example, given a
value of -9223261146507558080 (0x8000770b48a68340), 15 of the 16 most
significant bits in lohalf will toggle, but only half of the least significant
will toggle.

-- todd


pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: vacuum vs heap_update_tuple() and multixactids
Next
From: Robert Haas
Date:
Subject: Re: vacuum vs heap_update_tuple() and multixactids