Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
Date
Msg-id 0e91ba4e-5afc-220e-b1f7-56a838aee845@2ndquadrant.com
Whole thread Raw
In response to Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-bugs

On 12/06/2017 11:55 PM, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2017-12-06 16:38:18 -0500, Todd A. Cook wrote:
>>> I found this problem when I dropped 10.1 into a test environment to see
>>> what would happen.  There was no deliberate attempt to break anything.
> 
>> Read Thomas' message at:
http://archives.postgresql.org/message-id/263b03b1-3e1c-49ca-165a-8ac6751419c4%402ndquadrant.com
> 
> I'm confused by Tomas' claim that
> 
>>> (essentially hashint8 only ever produces 60% of
>>> values from [0,1000000], which likely increases collision rate).
> 
> This is directly contradicted by the simple experiments I've done, eg
> 
> regression=# select count(distinct hashint8(v)) from generate_series(0,1000000::int8) v;
>  count  
> --------
>  999879
> (1 row)
> 
> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,1000000::int8) v;
>  count  
> --------
>  644157
> (1 row)
> 
> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,10000000::int8) v;
>   count  
> ---------
>  1048514
> (1 row)
> 
> It's certainly not perfect, but I'm not observing any major failure to
> span the output space.
> 

That is not the claim I was making, though. When generating data sets
I've been considering only values where hashint8(v) is between 0 and
1000000 *without* the masking. Otherwise growing the hash table would
break the "chain" of values in the hash table, which would resolve the
issue with SH_GROW_MAX_MOVE.

So you'd have to do something like this:

    select count(distinct hashint8(v))
      from generate_series(0,1000000::int8) v
     where hashint8(v) between 0 and 1024*1024;

but to get some numbers you'd have to also increase the value passed to
generate_series (because it'll filter most of the values).

Which is why I generated the data by an ugly C program, similar to the
attached one. It keeps a small bitmap for generated hashes in the [0,1M]
range, and prints the number of bits set after every 1e9 values processed.

What I do see is this:

    i=1000000000 nset=217666
    i=2000000000 nset=389526
    i=3000000000 nset=525135
    i=4000000000 nset=632305
    i=5000000000 nset=659574
    i=6000000000 nset=659574
    i=7000000000 nset=659574
    i=8000000000 nset=659574
    i=9000000000 nset=659574
    ...

So somewhere between 4e9 and 5e9 we generate 659574 hashes in the range,
and then never increase it further.

I don't know if this an issue in practice, but it seems that for large
hash tables (on bigint), growing the hash table may not be that great
improvement because we're not really doubling the number of slots we can
hit directly. We'll probably find an empty slot nearby, though.

It was just an observation - I only noticed that because I tried to
construct the adversarial data set on bigint, and couldn't make it work
no matter what.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Next
From: Jan Schulz
Date:
Subject: Re: BUG #14948: cost overflow