Re: 9.1 support for hashing arrays - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: 9.1 support for hashing arrays |
Date | |
Msg-id | BANLkTikx8zcek855j=t8tv-pS1xyq6KFiQ@mail.gmail.com Whole thread Raw |
In response to | 9.1 support for hashing arrays (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: 9.1 support for hashing arrays
(Dean Rasheed <dean.a.rasheed@gmail.com>)
|
List | pgsql-hackers |
On Tue, May 17, 2011 at 2:44 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > The algorithm for this was discussed in the original thread > (http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php) > but I don't that think a satisfactory conclusion was really reached. > In particular, it is way too easy to come up with pathological cases > that defeat the hashing algorithm, for example: > > CREATE TABLE foo(a int[][]); > INSERT INTO foo SELECT array_fill(i, ARRAY[8,8]) > FROM generate_series(1,10000) g(i); > > All 10000 arrays are different, but they all have the same hash value > (0), so if the query optimiser chooses to hash the arrays, the > performance will be very poor. > > A few people on that thread (myself included - > http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php) > suggested using the multiply-by-31 algorithm but I think I failed to > properly make the case for it. Having given it some further thought, I > think there are some very sound mathematical reasons why that > algorithm performs well: > > The algorithm is to take the current hash total, multiply it by 31 and > then add on the hash of the next element. The final result is a > polynomial sum, where each element's hash value is multiplied by a > different power of 31. > > Since this is all modulo 2^32 arithmetic, the powers of 31 will > eventually start repeating, and at that point the hashing algorithm > could be defeated by transpositions. However, the number 31 has the > property that its powers don't repeat for a long time - the powers of > 31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27 > (134217728). In other words 31^134217728 = 1 mod 2^32, and there are > no smaller (strictly positive) powers of 31 for which this is the > case. > > So the multiply-by-31 algorithm is only vulnerable to transpositions > once the arrays reach 134217728 elements. > > For all smaller arrays, each array element's hash value is multiplied > by a number different number from all the other elements, and since > all the multipliers are odd numbers, *all* the individual bits from > each element's hash value are distributed (differently) in the final > value. > > Of course there are still going to be pathological cases, but they are > very difficult to construct deliberately, and extremely unlikely to > occur randomly. ISTM that this has all the properties of a good > hashing algorithm (possibly the Java folks did a similar analysis and > came to the same conclusion). Yes, I never was very happy with the way that we were doing this, and I think you make a coherent argument for why we should do it differently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: