Re: Hash index todo list item - Mailing list pgsql-hackers

From Kenneth Marshall
Subject Re: Hash index todo list item
Date
Msg-id 20070908202122.GA5679@it.is.rice.edu
Whole thread Raw
In response to Re: Hash index todo list item  (Brian Hurt <bhurt@janestcapital.com>)
Responses Re: Hash index todo list item
List pgsql-hackers
On Fri, Sep 07, 2007 at 10:36:41AM -0400, Brian Hurt wrote:
> Kenneth Marshall wrote:
>
>> I understand that a hash value is a many-to-one mapping. That is the
>> point of the flag in the index. The flag means that there is only one
>> item in the heap corresponding to that hash value. In this case we
>> know that the value in the heap is the correct one and a possibly
>> very expensive string comparison can be skipped. Given that the hash
>> function is doing its job, almost every string comparison can be skipped.
>> How long would it take to compare 1-32K of data? How much CPU usage?
>> With this field in place, you only need to check tuple visibility.
>>  
>
> How likely is it that you will get a hash collision, two strings that are 
> different that will hash to the same value?  To avoid this requires a very 
> large hash key (128 bits, minimum)- otherwise you get into birthday attack 
> problems.  With a 32-bit hash, the likelyhood is greater than 50% that two 
> strings in a collection of 100,000 will hash to the same value.  With a 
> 64-bit hash, the likelyhood is greater than 50% that two strings in a 
> collection of 10 billion will has to same value.  10 billion is a large 
> number, but not an unreasonable number, of strings to want to put into a 
> hash table- and it's exactly this case where the O(1) cost of hashtables 
> starts being a real win.
>
> Brian
>
Continuing this train of thought.... While it would make sense for larger
keys to store the hash in the index, if the key is smaller, particularly
if it is of fixed size, it would make sense to store the key in the index
instead. This would have the benefit of allowing use of the hash index
in non-lossy mode albeit with a slight increase in complexity.

Ken


pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Next
From: Tom Lane
Date:
Subject: Re: WIP patch for latestCompletedXid method of computing snapshot xmax