Thread: 9.1 support for hashing arrays

9.1 support for hashing arrays

From

Dean Rasheed

Date:

17 May 2011, 18:44:46

The algorithm for this was discussed in the original thread
(http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php)
but I don't that think a satisfactory conclusion was really reached.
In particular, it is way too easy to come up with pathological cases
that defeat the hashing algorithm, for example:

CREATE TABLE foo(a int[][]);
INSERT INTO foo SELECT array_fill(i, ARRAY[8,8])
  FROM generate_series(1,10000) g(i);

All 10000 arrays are different, but they all have the same hash value
(0), so if the query optimiser chooses to hash the arrays, the
performance will be very poor.

A few people on that thread (myself included -
http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php)
suggested using the multiply-by-31 algorithm but I think I failed to
properly make the case for it. Having given it some further thought, I
think there are some very sound mathematical reasons why that
algorithm performs well:

The algorithm is to take the current hash total, multiply it by 31 and
then add on the hash of the next element. The final result is a
polynomial sum, where each element's hash value is multiplied by a
different power of 31.

Since this is all modulo 2^32 arithmetic, the powers of 31 will
eventually start repeating, and at that point the hashing algorithm
could be defeated by transpositions. However, the number 31 has the
property that its powers don't repeat for a long time - the powers of
31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27
(134217728). In other words 31^134217728 = 1 mod 2^32, and there are
no smaller (strictly positive) powers of 31 for which this is the
case.

So the multiply-by-31 algorithm is only vulnerable to transpositions
once the arrays reach 134217728 elements.

For all smaller arrays, each array element's hash value is multiplied
by a number different number from all the other elements, and since
all the multipliers are odd numbers, *all* the individual bits from
each element's hash value are distributed (differently) in the final
value.

Of course there are still going to be pathological cases, but they are
very difficult to construct deliberately, and extremely unlikely to
occur randomly. ISTM that this has all the properties of a good
hashing algorithm (possibly the Java folks did a similar analysis and
came to the same conclusion).

Regards,
Dean

Attachment

array-hashing.patch

Re: 9.1 support for hashing arrays

From

Robert Haas

Date:

19 May 2011, 14:33:35

On Tue, May 17, 2011 at 2:44 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> The algorithm for this was discussed in the original thread
> (http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php)
> but I don't that think a satisfactory conclusion was really reached.
> In particular, it is way too easy to come up with pathological cases
> that defeat the hashing algorithm, for example:
>
> CREATE TABLE foo(a int[][]);
> INSERT INTO foo SELECT array_fill(i, ARRAY[8,8])
>  FROM generate_series(1,10000) g(i);
>
> All 10000 arrays are different, but they all have the same hash value
> (0), so if the query optimiser chooses to hash the arrays, the
> performance will be very poor.
>
> A few people on that thread (myself included -
> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php)
> suggested using the multiply-by-31 algorithm but I think I failed to
> properly make the case for it. Having given it some further thought, I
> think there are some very sound mathematical reasons why that
> algorithm performs well:
>
> The algorithm is to take the current hash total, multiply it by 31 and
> then add on the hash of the next element. The final result is a
> polynomial sum, where each element's hash value is multiplied by a
> different power of 31.
>
> Since this is all modulo 2^32 arithmetic, the powers of 31 will
> eventually start repeating, and at that point the hashing algorithm
> could be defeated by transpositions. However, the number 31 has the
> property that its powers don't repeat for a long time - the powers of
> 31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27
> (134217728). In other words 31^134217728 = 1 mod 2^32, and there are
> no smaller (strictly positive) powers of 31 for which this is the
> case.
>
> So the multiply-by-31 algorithm is only vulnerable to transpositions
> once the arrays reach 134217728 elements.
>
> For all smaller arrays, each array element's hash value is multiplied
> by a number different number from all the other elements, and since
> all the multipliers are odd numbers, *all* the individual bits from
> each element's hash value are distributed (differently) in the final
> value.
>
> Of course there are still going to be pathological cases, but they are
> very difficult to construct deliberately, and extremely unlikely to
> occur randomly. ISTM that this has all the properties of a good
> hashing algorithm (possibly the Java folks did a similar analysis and
> came to the same conclusion).

Yes, I never was very happy with the way that we were doing this, and
I think you make a coherent argument for why we should do it
differently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9.1 support for hashing arrays

From

Dean Rasheed

Date:

20 May 2011, 05:43:49

On 19 May 2011 15:33, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, May 17, 2011 at 2:44 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>> The algorithm for this was discussed in the original thread
>> (http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php)
>> but I don't that think a satisfactory conclusion was really reached.
>> In particular, it is way too easy to come up with pathological cases
>> that defeat the hashing algorithm, for example:
>>
>> CREATE TABLE foo(a int[][]);
>> INSERT INTO foo SELECT array_fill(i, ARRAY[8,8])
>>  FROM generate_series(1,10000) g(i);
>>
>> All 10000 arrays are different, but they all have the same hash value
>> (0), so if the query optimiser chooses to hash the arrays, the
>> performance will be very poor.
>>
>> A few people on that thread (myself included -
>> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php)
>> suggested using the multiply-by-31 algorithm but I think I failed to
>> properly make the case for it. Having given it some further thought, I
>> think there are some very sound mathematical reasons why that
>> algorithm performs well:
>>
>> The algorithm is to take the current hash total, multiply it by 31 and
>> then add on the hash of the next element. The final result is a
>> polynomial sum, where each element's hash value is multiplied by a
>> different power of 31.
>>
>> Since this is all modulo 2^32 arithmetic, the powers of 31 will
>> eventually start repeating, and at that point the hashing algorithm
>> could be defeated by transpositions. However, the number 31 has the
>> property that its powers don't repeat for a long time - the powers of
>> 31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27
>> (134217728). In other words 31^134217728 = 1 mod 2^32, and there are
>> no smaller (strictly positive) powers of 31 for which this is the
>> case.
>>
>> So the multiply-by-31 algorithm is only vulnerable to transpositions
>> once the arrays reach 134217728 elements.
>>
>> For all smaller arrays, each array element's hash value is multiplied
>> by a number different number from all the other elements, and since
>> all the multipliers are odd numbers, *all* the individual bits from
>> each element's hash value are distributed (differently) in the final
>> value.
>>
>> Of course there are still going to be pathological cases, but they are
>> very difficult to construct deliberately, and extremely unlikely to
>> occur randomly. ISTM that this has all the properties of a good
>> hashing algorithm (possibly the Java folks did a similar analysis and
>> came to the same conclusion).
>
> Yes, I never was very happy with the way that we were doing this, and
> I think you make a coherent argument for why we should do it
> differently.
>

Doh! I forgot one important piece of this algorithm - it is necessary
to initialise the result to something non-zero at the start so that
adding leading nulls to an array changes the final result.

Regards,
Dean

Attachment

array-hashing2.patch

Re: 9.1 support for hashing arrays

From

Robert Haas

Date:

23 May 2011, 03:16:59

On Fri, May 20, 2011 at 1:43 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> Doh! I forgot one important piece of this algorithm - it is necessary
> to initialise the result to something non-zero at the start so that
> adding leading nulls to an array changes the final result.

Looks reasonable.

I believe, however, that applying this will invalidate the contents of
any hash indexes on array types that anyone has built using 9.1beta1.
Do we need to do something about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9.1 support for hashing arrays

From

Tom Lane

Date:

23 May 2011, 03:49:25

Robert Haas <robertmhaas@gmail.com> writes:
> I believe, however, that applying this will invalidate the contents of
> any hash indexes on array types that anyone has built using 9.1beta1.
> Do we need to do something about that?

Like bumping catversion?

I would probably complain about that, except you already did it post-beta1:
http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=9bb6d9795253bb521f81c626fea49a704a369ca9

Possibly Bruce will feel like adding a check to pg_upgrade for the case.
I wouldn't bother myself though.  It seems quite unlikely that anyone's
depending on the feature yet.
        regards, tom lane

Re: 9.1 support for hashing arrays

From

Robert Haas

Date:

23 May 2011, 14:34:45

On Sun, May 22, 2011 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I believe, however, that applying this will invalidate the contents of
>> any hash indexes on array types that anyone has built using 9.1beta1.
>> Do we need to do something about that?
>
> Like bumping catversion?

Sure.  Although note that the system catalogs are not actually
changing, which goes to someone else's recent point about catversion
getting bumped for things other than changes in the things for which
the "cat" in "catversion" is an abbreviation.

> I would probably complain about that, except you already did it post-beta1:
> http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=9bb6d9795253bb521f81c626fea49a704a369ca9

Unfortunately, I was unable to make that omelet without breaking some eggs.  :-(

> Possibly Bruce will feel like adding a check to pg_upgrade for the case.
> I wouldn't bother myself though.  It seems quite unlikely that anyone's
> depending on the feature yet.

I'll leave that to you, Bruce, and whoever else wants to weigh in to
hammer that one out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9.1 support for hashing arrays

From

Bruce Momjian

Date:

24 May 2011, 12:33:16

Robert Haas wrote:
> On Sun, May 22, 2011 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> I believe, however, that applying this will invalidate the contents of
> >> any hash indexes on array types that anyone has built using 9.1beta1.
> >> Do we need to do something about that?
> >
> > Like bumping catversion?
> 
> Sure.  Although note that the system catalogs are not actually
> changing, which goes to someone else's recent point about catversion
> getting bumped for things other than changes in the things for which
> the "cat" in "catversion" is an abbreviation.
> 
> > I would probably complain about that, except you already did it post-beta1:
> > http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=9bb6d9795253bb521f81c626fea49a704a369ca9
> 
> Unfortunately, I was unable to make that omelet without breaking some eggs.  :-(
> 
> > Possibly Bruce will feel like adding a check to pg_upgrade for the case.
> > I wouldn't bother myself though. ?It seems quite unlikely that anyone's
> > depending on the feature yet.
> 
> I'll leave that to you, Bruce, and whoever else wants to weigh in to
> hammer that one out.

Oh, you are worried someone might have stored hash indexes with the old
catalog format?  Seems like something we might mention in the next beta
release announcement, but nothing more.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +