Re: Values larger than 1/3 of a buffer page cannot be indexed. - Mailing list pgsql-general

From Merlin Moncure
Subject Re: Values larger than 1/3 of a buffer page cannot be indexed.
Date
Msg-id AANLkTi=hQuxQpdcBumAh6FWiFZ+KwoFQ9bhAPjFoH7-s@mail.gmail.com
Whole thread Raw
In response to Re: Values larger than 1/3 of a buffer page cannot be indexed.  (Craig Ringer <craig@postnewspapers.com.au>)
List pgsql-general
On Mon, Mar 28, 2011 at 10:28 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> On 03/14/2011 09:25 PM, Merlin Moncure wrote:
>
>>> Unless the point is to guarantee uniqueness of the "long-long value"s.
>>
>> md5 will do that too: the main thing you lose going to hash indexing
>> is ordering.
>
> MD5 will *probably* guarantee the uniqueness of the values. Personally I'm
> not a big fan of betting on that, and tend to like to test for equality
> against the hash first and then, if the hashes are equal, against the
> values.
>
> I have several md5 collisions in one of my tables, so it's far from
> impossible. In fact, it's much more likely than you'd expect thanks to the
> birthday paradox.

I'll agree that 128 bit key is just on the cusp of discomfort if you
have a large data set and/or want to be really safe from random
collision, but any larger key and it's just worry for the sake of
worrying.

You mentioned the birthday attack -- let's consult the chart.  For MD5
(128 bits), if you are ok with 10^-15 chance of *one* collision (odds
1 in 1,000,000,000,000,000), you can safely hash up to 100,000,000,000
unique values.  Of course, md5 is not perfectly distributing so the
numbers are not in fact that good,  but those are pretty tall odds.
If you are concerned about safety, you are much better off jumping to
sha-1 (or sha-256 if you're really nervous)  than comparing the base
string, since your chance of random bump is now well below a number of
other low probability events, such as a drive bit error rate (meaning,
your base string comparison is useless) or getting hit by a comet.

Now, if you're worried about intentionally forged data, that's
different, but it doesn't sound like that's the case here.  When NIST
hash competition resolves (my money is on skein), you'll likely be
able to to just pick an output size that matches your requirements.

merlin

pgsql-general by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Values larger than 1/3 of a buffer page cannot be indexed.
Next
From: Jeremy Palmer
Date:
Subject: Re: Out of memory