On 13 Jun 2005, at 23:49, Peter Fein wrote:
> Hi-
>
> I wanted to use a partially unique index (dependent on a flag) on a
> TEXT
> column, but the index row size was too big for btrees. See the thread
> "index row size 2728 exceeds btree maximum, 2713" from the
> beginning of
> this month for someone with a similar problem. In it, someone
> suggests
> indexing on a hash of the text. I'm fine with this, as the texts in
> question are similar enough to each other to make collisions unlikely
> and a collision won't really cause any serious problems.
>
> My question is: is the builtin MD5 appropriate for this use or
> should I
> be using a function from pl/something? Figures on collision rates
> would
> be nice as well - the typical chunk of text is probably 1k-8k.
>
> Thanks!
>
As others have said MD5 isn't the fastest one out there. However no
cryptographically secure hashes are really that fast. However you
can probably get away with using a CRC hash which is long enough to
reduce your chances of collision a lot. However, PostgreSQL doesn't
have a built in CRC function, which is a bit of a pain unless your
prepared to implement one, or use pl/* to do it, which sounds like
overkill. I suggest you run some benchmarks on MD5 and see if it's
fast enough to meet your current (and perhaps future) needs.
You could of course, just use a hash index on your text field! I
think that would probably cope with larger data sets OK. It has the
advantage of handling collisions for you as well :) However it means
you have to transfer large amounts of data around, so if network
speed ever becomes a limitation, MD5 hashing (or enabling compression
on your PgSQL connection) may help.
> --
> Peter Fein pfein@pobox.com
> 773-575-0694
>
> Basically, if you're not a utopianist, you're a schmuck. -J. Feldman
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>
>
>