what about to use a CRC32 checksum of the text, computed by client
application an then make index of crc32 data ?
ie
add column crc int4,
add column md5 varchar(255 )
create index blabla on mytable (crc)
or even create index blabla2 on mytable (crc,md5)
and query like ......... where crc='crc32 of your searched text' AND
md5='md5 hash'
i am not shure, but this should work pretty fast.
----- Original Message -----
From: "Jon Lapham" <lapham@jandr.org>
To: <pgsql-general@postgresql.org>
Sent: Saturday, September 04, 2004 3:04 PM
Subject: [GENERAL] Index on TEXT versus CHAR(32)... fast exact TEXT matching
> I have a table that stores TEXT information. I need query this table to
> find *exact* matches to the TEXT... no regular expressions, no LIKE
> queries, etc. The TEXT could be from 1 to 10000+ characters in length,
> quite variable. If it matters, the TEXT may contain UNICODE characters...
>
> Example:
> CREATE TABLE a (id SERIAL, thetext TEXT);
> SELECT id FROM a WHERE thetext='Some other text';
>
> One way I thought to optimize this process would be to store an MD5 hash
> of the "thetext" column and query on that:
>
> CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32));
> SELECT id FROM a WHERE thetext_m5d=md5('Some other text');
>
> Now, obviously I would want to build an INDEX on either "thetext" or
> "thetext_md5", depending on which way I decide to make the table.
>
> My question is, what is the absolute fastest way to find the exact match
> to a TEXT column? Any amount of pre-processing is fine (such as
> calculating the MD5's of all the TEXT tuples), but the query must be
> extremely fast.
>
> Has anyone compared (theoretical or practical) the performance of
> querying a TEXT-based INDEX versus a CHAR(32)-based INDEX? Is my MD5
> idea a waste of time? Is there something better than MD5? Would it be
> better to store the "fingerprint" of the TEXT as an integer somehow, so
> that the INDEX could be based on a INT* column?
>
> Thanks for any help!
>
> -Jon
>
> --
> -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*---
> Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil
> Personal: http://www.jandr.org/
> ***-*--*----*-------*------------*--------------------*---------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>
>