Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
Date
Msg-id 263b03b1-3e1c-49ca-165a-8ac6751419c4@2ndquadrant.com
Whole thread Raw
In response to Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs

On 12/05/2017 11:01 PM, Tomas Vondra wrote:
> 
> Based on the initial discussion, the problem here is twofold.
> 
> Firstly, the code assumes that if it gets certain number of bucket
> collisions (essentially, the initial bucket and certain number of
> neighboring buckets already being full), making the table larger is
> guaranteed to reduce the number of collisions.
> 
> Which is obviously untrue for duplicate hash values, but it may also
> happen for distinct hash values that form a chain (think a sequence of
> hash values 1,2,3,4,5,6,7,...,K - that is never gonna get fixed).
> 

I've managed to construct (by sheer brute force) an example that breaks
simplehash in this way. It triggers the growth by having a sequence of
distinct but consecutive hash values, e.g. 1, 2, 3, 4, ..., 1000, and
then trying to insert a value with duplicate hash (say, 4).

Which will trigger the "emptydist" condition (as opposed to
"insertdist", as in the other example).

This means there are very few collisions (no hash value matching more
than 2 distinct values), and all the hash values are "at the beginning
of the range" so growing the hash table can't possible break the hash
sequences.

The generated datasets are quite large (16MB each), so I've placed them
here (along with the ugly C code to generate it):

    https://github.com/tvondra/simplehash-break

Each CSV file contains ~1M of strings with distinct hashes between 0 and
1048576 (essentially, covering about 97.5% of the range). There are
"chains" of up to ~300 consecutive hash values. This works:

    CREATE TABLE hash_text (val text);
    COPY hash_text FROM '/tmp/data1.csv';
    SET work_mem = '1GB';
    SELECT DISTINCT val FROM hash_text;

but this does not:

    CREATE TABLE hash_text (val text);
    COPY hash_text FROM '/tmp/data1.csv';
    COPY hash_text FROM '/tmp/data2.csv';
    SET work_mem = '1GB';
    SELECT DISTINCT val FROM hash_text;

In fact, only a small subset of the second data set should be needed.

FWIW I first tried to do the same thing with bigint values, but no
matter what I did I've been unable to generate a dataset covering more
than ~60% of the range (essentially hashint8 only ever produces 60% of
values from [0,1000000], which likely increases collision rate).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-bugs by date:

Previous
From: "Todd A. Cook"
Date:
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
Next
From: Andres Freund
Date:
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop