9.1 support for hashing arrays - Mailing list pgsql-hackers

From Dean Rasheed
Subject 9.1 support for hashing arrays
Date
Msg-id BANLkTingTMjG=ycaOU8YZqzfhA9YG2DdEA@mail.gmail.com
Whole thread Raw
Responses Re: 9.1 support for hashing arrays  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
The algorithm for this was discussed in the original thread
(http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php)
but I don't that think a satisfactory conclusion was really reached.
In particular, it is way too easy to come up with pathological cases
that defeat the hashing algorithm, for example:

CREATE TABLE foo(a int[][]);
INSERT INTO foo SELECT array_fill(i, ARRAY[8,8])
  FROM generate_series(1,10000) g(i);

All 10000 arrays are different, but they all have the same hash value
(0), so if the query optimiser chooses to hash the arrays, the
performance will be very poor.

A few people on that thread (myself included -
http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php)
suggested using the multiply-by-31 algorithm but I think I failed to
properly make the case for it. Having given it some further thought, I
think there are some very sound mathematical reasons why that
algorithm performs well:

The algorithm is to take the current hash total, multiply it by 31 and
then add on the hash of the next element. The final result is a
polynomial sum, where each element's hash value is multiplied by a
different power of 31.

Since this is all modulo 2^32 arithmetic, the powers of 31 will
eventually start repeating, and at that point the hashing algorithm
could be defeated by transpositions. However, the number 31 has the
property that its powers don't repeat for a long time - the powers of
31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27
(134217728). In other words 31^134217728 = 1 mod 2^32, and there are
no smaller (strictly positive) powers of 31 for which this is the
case.

So the multiply-by-31 algorithm is only vulnerable to transpositions
once the arrays reach 134217728 elements.

For all smaller arrays, each array element's hash value is multiplied
by a number different number from all the other elements, and since
all the multipliers are odd numbers, *all* the individual bits from
each element's hash value are distributed (differently) in the final
value.

Of course there are still going to be pathological cases, but they are
very difficult to construct deliberately, and extremely unlikely to
occur randomly. ISTM that this has all the properties of a good
hashing algorithm (possibly the Java folks did a similar analysis and
came to the same conclusion).

Regards,
Dean

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Review: psql include file using relative path
Next
From: Jaime Casanova
Date:
Subject: Re: DOMAINs and CASTs