Re: Index on TEXT versus CHAR(32)... fast exact TEXT matching - Mailing list pgsql-general

From NTPT
Subject Re: Index on TEXT versus CHAR(32)... fast exact TEXT matching
Date
Msg-id 006301c49460$3399df70$74beebd5@wbp1
Whole thread Raw
In response to Index on TEXT versus CHAR(32)... fast exact TEXT matching  (Jon Lapham <lapham@jandr.org>)
List pgsql-general
what about to use a CRC32 checksum of the text, computed by client
application an then  make index of crc32 data ?

ie

add column crc int4,
add column md5 varchar(255 )
 create index blabla on mytable (crc)
or even create index blabla2 on mytable (crc,md5)

and  query like ......... where crc='crc32 of your searched text' AND
md5='md5 hash'

i am not shure, but this should work pretty fast.


----- Original Message -----
From: "Jon Lapham" <lapham@jandr.org>
To: <pgsql-general@postgresql.org>
Sent: Saturday, September 04, 2004 3:04 PM
Subject: [GENERAL] Index on TEXT versus CHAR(32)... fast exact TEXT matching


> I have a table that stores TEXT information.  I need query this table to
> find *exact* matches to the TEXT... no regular expressions, no LIKE
> queries, etc.  The TEXT could be from 1 to 10000+ characters in length,
> quite variable.  If it matters, the TEXT may contain UNICODE characters...
>
> Example:
> CREATE TABLE a (id SERIAL, thetext TEXT);
> SELECT id FROM a WHERE thetext='Some other text';
>
> One way I thought to optimize this process would be to store an MD5 hash
> of the "thetext" column and query on that:
>
> CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32));
> SELECT id FROM a WHERE thetext_m5d=md5('Some other text');
>
> Now, obviously I would want to build an INDEX on either "thetext" or
> "thetext_md5", depending on which way I decide to make the table.
>
> My question is, what is the absolute fastest way to find the exact match
> to a TEXT column?  Any amount of pre-processing is fine (such as
> calculating the MD5's of all the TEXT tuples), but the query must be
> extremely fast.
>
> Has anyone compared (theoretical or practical) the performance of
> querying a TEXT-based INDEX versus a CHAR(32)-based INDEX?  Is my MD5
> idea a waste of time?  Is there something better than MD5?  Would it be
> better to store the "fingerprint" of the TEXT as an integer somehow, so
> that the INDEX could be based on a INT* column?
>
> Thanks for any help!
>
> -Jon
>
> --
> -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*---
>   Jon Lapham  <lapham@jandr.org>                Rio de Janeiro, Brasil
>   Personal: http://www.jandr.org/
> ***-*--*----*-------*------------*--------------------*---------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>
>


pgsql-general by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Passing RECORD variable from func1() to func2()
Next
From: Valerie Schneider DSI/DEV
Date:
Subject: Pb with ecpg and include file on PG 8.0.0