Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
Date
Msg-id CAEepm=10VqZm7JF2YSC9CV2uYzVvVeVUB2yGtXYf5TZGmCsSmw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Tue, Nov 28, 2017 at 5:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> You beat me to it --- after looking at simplehash.h I'd guessed that
> either the SH_GROW_MAX_DIB or SH_GROW_MAX_MOVE code path was causing
> an infinite loop, but I'd not gotten to determining which one yet.
>
> I'd ask what's the theory behind SH_GROW_MAX_MOVE, as well.  Neither
> of them are obviously loop-proof.
>
> Note that the sample data has a lot of collisions:
>
> regression=# select hashint8(val), count(*) from reproducer group by 1 order by 2 desc;
>   hashint8   | count
> -------------+-------
>    441526644 |  2337
>  -1117776826 |  1221
>  -1202007016 |   935
>  -2068831050 |   620
>   1156644653 |   538
>    553783815 |   510
>    259780770 |   444
>    371047036 |   394
>    915722575 |   359
>  ... etc etc ...
>
> It's evidently more complicated than just that the code fails with
> more than SH_GROW_MAX_DIB duplicate hashcodes, but I suspect not
> by a lot.  There needs to be a safety valve that prevents letting
> the hash fill factor approach zero, which is what's happening in
> this test case.

Yeah.  If you count distinct hashint8(val) of *distinct* values, you get:

postgres=# select hashint8(val), count(*) from (select distinct val
from reproducer) ss group by 1 order by 2 desc limit 1; hashint8  | count
------------+-------1288181237 |    26
(1 row)

It doesn't matter how many bits of that you use, you'll get at least
26 collisions, so our loop can't terminate just by increasing the mask
size.  My guess is that you'd either need a defence based on something
like a secondary hash function, or at least a dynamic SH_GROW_MAX_DIB
that wouldn't allow the fill factor to plummet as you say.  A dataset
could be found that would exceed any static value of SH_GROW_MAX_DIB
by brute force.  In this case, considering the definition of hashint8,
there are more straightforward ways to find distinct bigint values
that hash to the same value... just swap some bits between the high
and low words, or something like that.

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Next
From: Tomas Vondra
Date:
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop