Thread: Techniques for quickly finding words in a phrase...

Techniques for quickly finding words in a phrase...

From
"Saltsgaver, Scott"
Date:
I have a large table where one of the columns is a text phrase (ex. "MARY
HAD A LITTLE LAMB").  I want to be able to search on that table with
the start of one or more words from that column ("LITT" and "LAMB").
I built a separate table whose foreign key maps back to my original table,
but also has a column with just the words from the phrase.

For example:

Table Phrase (indexed by id):   id | phrase   ---+-----------------------------   1  | MARY HAD A LITTLE LAMB   2  |
WHOSEFLEECE WAS WHITE AS SNOW
 

Table PhraseWords (indexed by word):   id | word   ---+-------   1  | MARY   1  | HAD   1  | A   1  | LITTLE   1  |
LAMB  2  | WHOSE   2  | FLEECE   2  | WAS   2  | WHITE   2  | AS   2  | SNOW
 

The number of rows in the Prase table is ~225K, and the number of rows
in the PhraseWords table is ~1M.  And there are also some high-frequency
words in the PhraseWords table, such as "THE" which recur many times.
If the user attempts to search for the words "WAS WHIT SNOW" I generate
the following SQL:

SELECT p.id, phrase FROM Phrase AS p, PhraseWords AS pw   WHERE       ((p.id = pw.id) AND word LIKE 'WAS%')       AND
EXISTS(SELECT id FROM PhraseWords AS pw               WHERE (p.id = pw.id) AND word LIKE 'WHIT%')       AND EXISTS
(SELECTid FROM PhraseWords AS pw               WHERE (p.id = pw.id) AND word LIKE 'SNOW%');
 

My EXPLAIN for this is:

Nested Loop  (cost=8033.62 rows=172 width=48) ->  Index Scan using idx_phraseword on phraseword pw     (cost=6256.27
rows=867width=4)
 
 ->  Index Scan using phrase_pkey on phrase p     (cost=2.05 rows=46669 width=44)
       SubPlan         ->  Index Scan using idx_phraseword on phraseword pw      (cost=6256.27 rows=1 width=4)
         ->  Index Scan using idx_phraseword on phraseword pw      (cost=6256.27 rows=1 width=4)


For some reason, the select still takes > 1 minute on a fairly decent
sized Linux box (500Mhz, 128MB ram).  I was using IN clauses instead of
the EXISTS clauses, and was noticing the EXPLAIN telling me that it was
doing sequential table lookups.  I then found in the FAQ that this is a
known issue.  I was hoping that by using EXISTS I could do better.

So I guess my two questions are:

1) Is there a way to make this more efficient?
2) Is there another technique in general that will allow me to let  my users search the table for words (or the start
ofwords, in  this case)?
 

Any help would be greatly appreciated!

Scott Saltsgaver




Re: [SQL] Techniques for quickly finding words in a phrase...

From
Tom Lane
Date:
"Saltsgaver, Scott" <scottsa@aiinet.com> writes:
> SELECT p.id, phrase FROM Phrase AS p, PhraseWords AS pw
>     WHERE
>         ((p.id = pw.id) AND word LIKE 'WAS%')
>         AND EXISTS (SELECT id FROM PhraseWords AS pw
>                 WHERE (p.id = pw.id) AND word LIKE 'WHIT%')
>         AND EXISTS (SELECT id FROM PhraseWords AS pw
>                 WHERE (p.id = pw.id) AND word LIKE 'SNOW%');

> For some reason, the select still takes > 1 minute on a fairly decent
> sized Linux box (500Mhz, 128MB ram).

Subselects are pretty inefficient in Postgres at present.  Try rewriting
it as a join:

SELECT p.id, phrase FROM Phrase AS p, PhraseWords AS pw1,PhraseWords AS pw2, PhraseWords AS pw3
WHERE p.id = pw1.id AND pw1.word LIKE 'WAS%'AND p.id = pw2.id AND pw2.word LIKE 'WHIT%'AND p.id = pw3.id AND pw3.word
LIKE'SNOW%';
 

(with the obvious adjustments depending on how many words in your
search phrase).

If you are using search phrases with more than half a dozen words,
you will probably need to enable GEQO planning to avoid spending
an unreasonable amount of time in planning the query.  (If 'explain'
itself starts to take a long time, you are seeing excessive plan time.)
        regards, tom lane


Re: [SQL] Techniques for quickly finding words in a phrase...

From
om
Date:
On Fri, Feb 11, 2000 at 06:08:43PM -0500, Tom Lane wrote:
> "Saltsgaver, Scott" <scottsa@aiinet.com> writes:
> > SELECT p.id, phrase FROM Phrase AS p, PhraseWords AS pw
> >     WHERE
> >         ((p.id = pw.id) AND word LIKE 'WAS%')
> >         AND EXISTS (SELECT id FROM PhraseWords AS pw
> >                 WHERE (p.id = pw.id) AND word LIKE 'WHIT%')
> >         AND EXISTS (SELECT id FROM PhraseWords AS pw
> >                 WHERE (p.id = pw.id) AND word LIKE 'SNOW%');
> 
> > For some reason, the select still takes > 1 minute on a fairly decent
> > sized Linux box (500Mhz, 128MB ram).
> 
> Subselects are pretty inefficient in Postgres at present.  Try rewriting
> it as a join:
> 
> SELECT p.id, phrase FROM Phrase AS p, PhraseWords AS pw1,
>     PhraseWords AS pw2, PhraseWords AS pw3
> WHERE p.id = pw1.id AND pw1.word LIKE 'WAS%'
>     AND p.id = pw2.id AND pw2.word LIKE 'WHIT%'
>     AND p.id = pw3.id AND pw3.word LIKE 'SNOW%';

another approach would leave the PhraseWords table aside and use regular
expressions to find matches in table Phrase. of course, this couldn't take
advantage of indices, but maybe the fact that it avoids the join (or
subselect) helps performance.

SELECT id, phrase FROM Phrase
WHERE  phrase ~* '[[:<:]]was'  AND phrase ~* '[[:<:]]whit'  AND phrase ~* '[[:<:]]snow';


-- oliver