Re: Hash support for arrays - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Hash support for arrays
Date
Msg-id 4CD0217202000025000371A2@gw.wicourts.gov
Whole thread Raw
In response to Re: Hash support for arrays  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Hash support for arrays
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> The goal is to make those hard to predict, though.
>>
>> Really?  I think "I don't understand when this fails" isn't
>> obviously better than being able to predict when it fails ...
> 
> Isn't that the whole point of hash functions?  Collisions are
> inevitable, but you want them to be unpredictable.
Are you sure you're not confusing attributes of a good random number
generator with hash generation?  (They have much in common, but I've
only seen concern with "hard to predict" when working on random
number algorithms, as for financial audits or jury selection.)
> If you want a hash function with predictable collision behavior,
> just has the first element.  It'll be highly predictable AND
> wicked fast.
You're confusing "unpredictable" with "widely distributed", I think.
There's no reason that the hash value of an integer of the same size
as the hash shouldn't be the integer itself, for example.  It's hard
to get more predictable than that, yet it works well for hash
lookups.
I know that for a hash of a character string, the total = (total *
31) + nextcharacter has been shown to be effective.  That's hardly
random or hard to predict, but it does tend to spread out the hash
values well.  Whether it is as good for arrays, I don't know.  It
seems like a reasonable place to start, though, since a character
string is rather like an array of characters.
>> What concerns me about that is that it tends to push the bits to
>> the left --- I think the effects of the earlier inputs are going
>> to overflow out as you incorporate a lot of newer inputs.  What
>> you want is a scheme where every input item affects all the bits
>> of the final result about equally, and my gut feeling is this
>> doesn't provide that.
> 
> I don't think so.  That would be a problem if you multiplied by an
> even number.  You can see that you don't get dups if you just add
> in the same value over and over:
Yeah, that 1 in the low order position ensures that the impact of
any value which is ever added into the total is never entirely lost.
-Kevin


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Hash support for arrays
Next
From: Robert Haas
Date:
Subject: Re: ALTER TYPE recursion to typed tables