Techniques for quickly finding words in a phrase... - Mailing list pgsql-sql

From Saltsgaver, Scott
Subject Techniques for quickly finding words in a phrase...
Date
Msg-id 7283DE19D141D111AD0E00A0C95B1955024888FD@mail2.aiinet.com
Whole thread Raw
Responses Re: [SQL] Techniques for quickly finding words in a phrase...  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-sql
I have a large table where one of the columns is a text phrase (ex. "MARY
HAD A LITTLE LAMB").  I want to be able to search on that table with
the start of one or more words from that column ("LITT" and "LAMB").
I built a separate table whose foreign key maps back to my original table,
but also has a column with just the words from the phrase.

For example:

Table Phrase (indexed by id):   id | phrase   ---+-----------------------------   1  | MARY HAD A LITTLE LAMB   2  |
WHOSEFLEECE WAS WHITE AS SNOW
 

Table PhraseWords (indexed by word):   id | word   ---+-------   1  | MARY   1  | HAD   1  | A   1  | LITTLE   1  |
LAMB  2  | WHOSE   2  | FLEECE   2  | WAS   2  | WHITE   2  | AS   2  | SNOW
 

The number of rows in the Prase table is ~225K, and the number of rows
in the PhraseWords table is ~1M.  And there are also some high-frequency
words in the PhraseWords table, such as "THE" which recur many times.
If the user attempts to search for the words "WAS WHIT SNOW" I generate
the following SQL:

SELECT p.id, phrase FROM Phrase AS p, PhraseWords AS pw   WHERE       ((p.id = pw.id) AND word LIKE 'WAS%')       AND
EXISTS(SELECT id FROM PhraseWords AS pw               WHERE (p.id = pw.id) AND word LIKE 'WHIT%')       AND EXISTS
(SELECTid FROM PhraseWords AS pw               WHERE (p.id = pw.id) AND word LIKE 'SNOW%');
 

My EXPLAIN for this is:

Nested Loop  (cost=8033.62 rows=172 width=48) ->  Index Scan using idx_phraseword on phraseword pw     (cost=6256.27
rows=867width=4)
 
 ->  Index Scan using phrase_pkey on phrase p     (cost=2.05 rows=46669 width=44)
       SubPlan         ->  Index Scan using idx_phraseword on phraseword pw      (cost=6256.27 rows=1 width=4)
         ->  Index Scan using idx_phraseword on phraseword pw      (cost=6256.27 rows=1 width=4)


For some reason, the select still takes > 1 minute on a fairly decent
sized Linux box (500Mhz, 128MB ram).  I was using IN clauses instead of
the EXISTS clauses, and was noticing the EXPLAIN telling me that it was
doing sequential table lookups.  I then found in the FAQ that this is a
known issue.  I was hoping that by using EXISTS I could do better.

So I guess my two questions are:

1) Is there a way to make this more efficient?
2) Is there another technique in general that will allow me to let  my users search the table for words (or the start
ofwords, in  this case)?
 

Any help would be greatly appreciated!

Scott Saltsgaver




pgsql-sql by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [SQL] How to quietly increment a SEQUENCE?
Next
From: Tom Lane
Date:
Subject: Re: [SQL] Techniques for quickly finding words in a phrase...