Re: Fastest Index/Algorithm to find similar sentences - Mailing list pgsql-general

From Kevin Grittner
Subject Re: Fastest Index/Algorithm to find similar sentences
Date
Msg-id 1375457112.43393.YahooMailNeo@web162905.mail.bf1.yahoo.com
Whole thread Raw
In response to Fastest Index/Algorithm to find similar sentences  ("Janek Sendrowski" <janek12@web.de>)
Responses Re: Fastest Index/Algorithm to find similar sentences  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-general
Janek Sendrowski <janek12@web.de> wrote:

> I also tried pg_trgm module, which works with tri-grams, but it's
> also very slow with 100.000+ rows.

Hmm.  I found the pg_trgm module very fast for name searches with
millions of rows *as long as I used KNN-GiST techniques*.  Were you
careful to do so?  Check out the "Index Support" section of this
page:

http://www.postgresql.org/docs/current/static/pgtrgm.html

While I have not tested this technique with a column containing
sentences, I would expect it to work well.  As a quick
confirmation, I imported the text form of War and Peace into a
table, with one row per *line* (because that was easier than
parsing sentence boundaries for a quick test).  That was over
65,000 rows.

test=# select * from war_and_peace order by linetext <-> 'young wealthy gay gentlemen' limit 3;
 lineno |                               linetext                                
--------+-----------------------------------------------------------------------
   9082 | The gentlemen assembled at Bilibin's were young, wealthy, gay society
  36575 | "Gentlemen, you are crushing me!..."
  55997 | * "Good day, gentlemen."
(3 rows)

test=# explain analyze select * from war_and_peace order by linetext <-> 'young wealthy gay gentlemen' limit 3;
                                                               QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.28..0.80 rows=3 width=53) (actual time=37.986..38.002 rows=3 loops=1)
   ->  Index Scan using wap_text on war_and_peace  (cost=0.28..11216.42 rows=65007 width=53) (actual
time=37.984..37.999rows=3 loops=1) 
         Order By: (linetext <-> 'young wealthy gay gentlemen'::text)
 Total runtime: 38.180 ms
(4 rows)

To me, 38 milliseconds to search War and Peace for best matches to
a text string seems reasonable; I'm not sure what you're looking
for, since you didn't give any numbers.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-general by date:

Previous
From: Lionel Elie Mamane
Date:
Subject: Identify primary key in simple/updatable view
Next
From: Melvin Call
Date:
Subject: Understanding database schemas