Re: Hash Function: MD5 or other? - Mailing list pgsql-general
From | Peter Fein |
---|---|
Subject | Re: Hash Function: MD5 or other? |
Date | |
Msg-id | 42AEDCAE.2070608@pobox.com Whole thread Raw |
In response to | Re: Hash Function: MD5 or other? (Alex Stapleton <alexs@advfn.com>) |
Responses |
Re: Hash Function: MD5 or other?
Re: Hash Function: MD5 or other? |
List | pgsql-general |
Alex Stapleton wrote: > > On 13 Jun 2005, at 23:49, Peter Fein wrote: > > >> Hi- >> >> I wanted to use a partially unique index (dependent on a flag) on a TEXT >> column, but the index row size was too big for btrees. See the thread >> "index row size 2728 exceeds btree maximum, 2713" from the beginning of >> this month for someone with a similar problem. In it, someone suggests >> indexing on a hash of the text. I'm fine with this, as the texts in >> question are similar enough to each other to make collisions unlikely >> and a collision won't really cause any serious problems. >> >> My question is: is the builtin MD5 appropriate for this use or should I >> be using a function from pl/something? Figures on collision rates would >> be nice as well - the typical chunk of text is probably 1k-8k. >> >> Thanks! >> > > As others have said MD5 isn't the fastest one out there. However no > cryptographically secure hashes are really that fast. However you can > probably get away with using a CRC hash which is long enough to reduce > your chances of collision a lot. However, PostgreSQL doesn't have a > built in CRC function, which is a bit of a pain unless your prepared to > implement one, or use pl/* to do it, which sounds like overkill. I > suggest you run some benchmarks on MD5 and see if it's fast enough to > meet your current (and perhaps future) needs. > > You could of course, just use a hash index on your text field! I think > that would probably cope with larger data sets OK. It has the advantage > of handling collisions for you as well :) However it means you have to > transfer large amounts of data around, so if network speed ever becomes > a limitation, MD5 hashing (or enabling compression on your PgSQL > connection) may help. Let me clarify my problem: I've got a bunch of texts, stored in column sometext. These texts are assigned to groups, group_id. Each group may contain rows with identical sometext. Finding the appropriate group when inserting is very expensive and is calculated on the client side. I want to mark *one* of each of the exact duplicates in a group with a flag, group_representative. Basically, a partial index on: (group_id, sometext) WHERE group_representative=TRUE. The motivation is then to be able to do: SELECT * FROM mytable WHERE <some other conditions> AND group_representative=TRUE instead of: SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON (group_id, sometext) since there are more of these queries than inserts (and they return a lot more rows than belong to a single group_id). I suspect that running a DISTINCT in this case would be more expensive than updating the index when inserting. Knowing the specifics of the data I'm putting in sometext, a halfway decent hash function would make collisions so rare as to make the chance insignificant (and collisions wouldn't break anything anyway). Is this approach reasonable, or should I use a hash index on (group_id, sometext) - does this suffer from the same size limitation as btrees? I thought hash indexes were slow... Hope that's clearer... --Pete -- Peter Fein pfein@pobox.com 773-575-0694 Basically, if you're not a utopianist, you're a schmuck. -J. Feldman
pgsql-general by date: