Thread: hashtext () and collisions
Hello, Okay, I have some troubles trying to determine how to most efficiently store a database which will contain a couple of huge tables (think 5bil+ rows). These tables each have a bigint id and a character varying value. Now, I'm currently partitioning these tables based on the hashtext (value) % 1000, to determine which subtable a certain value should be stored in. However, I often also need to find a value for an id; instead of using the sequential numbering that a BIGSERIAL would provide, I am thinking: wouldn't it make some kind of sense if I used the value of hashtext('value') to determine the id ? Then, if I need to determine the value that belongs to a certain id, I can just % 1000 the value and know which subtable the value is stored in, reducing the amount of tables to search with a factor 500. Now, my question is: how big is the chance that a collision happens between hashes ? I noticed that the function only returns a 32 bit number, so I figure it must be at least once in the 4 billion values. If this approach is not recommended (using hashes as keys), any other suggestions on how to make the subtable name derivable from an identification number ? -- Leon Mergen http://www.solatis.com
On 2007-04-11, "Leon Mergen" <leon@solatis.com> wrote: > Now, my question is: how big is the chance that a collision happens > between hashes ? I noticed that the function only returns a 32 bit > number, so I figure it must be at least once in the 4 billion values. Assuming it's a uniform random hash, 32 bits long, then if you have 65536 values, you have a ~40% chance of at least one collision. Any defects in the hash function only increase that probability. This is a result of what's known as the "birthday paradox" (so-called because in a group of 23 people, there is a better than even chance that two of them share a birthday). The number of rows needed to have an approximately even chance of at least one collision grows as the _square root_ of the number of hash buckets; or to put it another way, you always need _more than twice as many bits_ in your hash value than you think you do. (e.g. using md5(), which is a 128-bit hash) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
On 2007-04-11, "Leon Mergen" <leon@solatis.com> wrote: > Now, my question is: how big is the chance that a collision happens > between hashes ? I noticed that the function only returns a 32 bit > number, so I figure it must be at least once in the 4 billion values. Assuming it's a uniform random hash, 32 bits long, then if you have 65536 values, you have a ~40% chance of at least one collision. Any defects in the hash function only increase that probability. This is a result of what's known as the "birthday paradox" (so-called because in a group of 23 people, there is a better than even chance that two of them share a birthday). The number of rows needed to have an approximately even chance of at least one collision grows as the _square root_ of the number of hash buckets; or to put it another way, you always need _more than twice as many bits_ in your hash value than you think you do. (e.g. using md5(), which is a 128-bit hash) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services