Thread: Index on TEXT versus CHAR(32)... fast exact TEXT matching
I have a table that stores TEXT information. I need query this table to find *exact* matches to the TEXT... no regular expressions, no LIKE queries, etc. The TEXT could be from 1 to 10000+ characters in length, quite variable. If it matters, the TEXT may contain UNICODE characters... Example: CREATE TABLE a (id SERIAL, thetext TEXT); SELECT id FROM a WHERE thetext='Some other text'; One way I thought to optimize this process would be to store an MD5 hash of the "thetext" column and query on that: CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32)); SELECT id FROM a WHERE thetext_m5d=md5('Some other text'); Now, obviously I would want to build an INDEX on either "thetext" or "thetext_md5", depending on which way I decide to make the table. My question is, what is the absolute fastest way to find the exact match to a TEXT column? Any amount of pre-processing is fine (such as calculating the MD5's of all the TEXT tuples), but the query must be extremely fast. Has anyone compared (theoretical or practical) the performance of querying a TEXT-based INDEX versus a CHAR(32)-based INDEX? Is my MD5 idea a waste of time? Is there something better than MD5? Would it be better to store the "fingerprint" of the TEXT as an integer somehow, so that the INDEX could be based on a INT* column? Thanks for any help! -Jon -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil Personal: http://www.jandr.org/ ***-*--*----*-------*------------*--------------------*---------------
> CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32)); > SELECT id FROM a WHERE thetext_m5d=md5('Some other text'); From your explanations : - you need an exact match text->id - you already know the text Then, why bother with the TEXT column ? I have no idea on the performance of pg text indexes, but I do know that a table which fits in RAM, or an index which fits in RAM, is faster than when it does not. Try this : create table my_searches ( id serial primary key, text_md5 char[16] (or even a binary string) ); Then create a table my_text (id,full_text) with appropriate foreign key (to get a 1<->1 relationship between your tables) because you may still want to remember the text. Index it on id of course. My point is that your search table no longer has the big text field, it has only constant size fields, (thus constant rowsize) and it is much smaller. The index will also be smaller and maybe fit in RAM. You then : SELECT id FROM my_text WHERE id in (SELECT id FROM my_searches WHERE text_md5 = (your md5)) AND full_text=(your text); Unless a MD5 collision occurs, which should be rare, the subquery will return 1 result and the first SELECT will just do an index lookup on ID. Now, ok, I'm stupid, I reimplemented the Hash index... I'd be interested in your benchmarks of your real data with : - table with full text + id, with btree index on text - table with full text + id, with hash index on text - the scheme I just came up with
what about to use a CRC32 checksum of the text, computed by client application an then make index of crc32 data ? ie add column crc int4, add column md5 varchar(255 ) create index blabla on mytable (crc) or even create index blabla2 on mytable (crc,md5) and query like ......... where crc='crc32 of your searched text' AND md5='md5 hash' i am not shure, but this should work pretty fast. ----- Original Message ----- From: "Jon Lapham" <lapham@jandr.org> To: <pgsql-general@postgresql.org> Sent: Saturday, September 04, 2004 3:04 PM Subject: [GENERAL] Index on TEXT versus CHAR(32)... fast exact TEXT matching > I have a table that stores TEXT information. I need query this table to > find *exact* matches to the TEXT... no regular expressions, no LIKE > queries, etc. The TEXT could be from 1 to 10000+ characters in length, > quite variable. If it matters, the TEXT may contain UNICODE characters... > > Example: > CREATE TABLE a (id SERIAL, thetext TEXT); > SELECT id FROM a WHERE thetext='Some other text'; > > One way I thought to optimize this process would be to store an MD5 hash > of the "thetext" column and query on that: > > CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32)); > SELECT id FROM a WHERE thetext_m5d=md5('Some other text'); > > Now, obviously I would want to build an INDEX on either "thetext" or > "thetext_md5", depending on which way I decide to make the table. > > My question is, what is the absolute fastest way to find the exact match > to a TEXT column? Any amount of pre-processing is fine (such as > calculating the MD5's of all the TEXT tuples), but the query must be > extremely fast. > > Has anyone compared (theoretical or practical) the performance of > querying a TEXT-based INDEX versus a CHAR(32)-based INDEX? Is my MD5 > idea a waste of time? Is there something better than MD5? Would it be > better to store the "fingerprint" of the TEXT as an integer somehow, so > that the INDEX could be based on a INT* column? > > Thanks for any help! > > -Jon > > -- > -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- > Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil > Personal: http://www.jandr.org/ > ***-*--*----*-------*------------*--------------------*--------------- > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > >