Re: 9.1 support for hashing arrays - Mailing list pgsql-hackers

From Robert Haas
Subject Re: 9.1 support for hashing arrays
Date
Msg-id BANLkTikx8zcek855j=t8tv-pS1xyq6KFiQ@mail.gmail.com
Whole thread Raw
In response to 9.1 support for hashing arrays  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: 9.1 support for hashing arrays  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On Tue, May 17, 2011 at 2:44 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> The algorithm for this was discussed in the original thread
> (http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php)
> but I don't that think a satisfactory conclusion was really reached.
> In particular, it is way too easy to come up with pathological cases
> that defeat the hashing algorithm, for example:
>
> CREATE TABLE foo(a int[][]);
> INSERT INTO foo SELECT array_fill(i, ARRAY[8,8])
>  FROM generate_series(1,10000) g(i);
>
> All 10000 arrays are different, but they all have the same hash value
> (0), so if the query optimiser chooses to hash the arrays, the
> performance will be very poor.
>
> A few people on that thread (myself included -
> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php)
> suggested using the multiply-by-31 algorithm but I think I failed to
> properly make the case for it. Having given it some further thought, I
> think there are some very sound mathematical reasons why that
> algorithm performs well:
>
> The algorithm is to take the current hash total, multiply it by 31 and
> then add on the hash of the next element. The final result is a
> polynomial sum, where each element's hash value is multiplied by a
> different power of 31.
>
> Since this is all modulo 2^32 arithmetic, the powers of 31 will
> eventually start repeating, and at that point the hashing algorithm
> could be defeated by transpositions. However, the number 31 has the
> property that its powers don't repeat for a long time - the powers of
> 31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27
> (134217728). In other words 31^134217728 = 1 mod 2^32, and there are
> no smaller (strictly positive) powers of 31 for which this is the
> case.
>
> So the multiply-by-31 algorithm is only vulnerable to transpositions
> once the arrays reach 134217728 elements.
>
> For all smaller arrays, each array element's hash value is multiplied
> by a number different number from all the other elements, and since
> all the multipliers are odd numbers, *all* the individual bits from
> each element's hash value are distributed (differently) in the final
> value.
>
> Of course there are still going to be pathological cases, but they are
> very difficult to construct deliberately, and extremely unlikely to
> occur randomly. ISTM that this has all the properties of a good
> hashing algorithm (possibly the Java folks did a similar analysis and
> came to the same conclusion).

Yes, I never was very happy with the way that we were doing this, and
I think you make a coherent argument for why we should do it
differently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Cannot build docs of 9.1 on Windows
Next
From: Selena Deckelmann
Date:
Subject: Re: Adding an example for replication configuration to pg_hba.conf