Thread: Similarity search for sentences

Similarity search for sentences

From
"Janek Sendrowski"
Date:
Hi,
 
I have tables with millions of sentences. Each row contains a sentence. It is natural language and every language is
possible,but the sentences of one table have the same language. 
I have to do a similarity search on them. It has to be very fast, because I have to search for a few hundert sentences
manytimes. 
The search shouldn't be context-based. It should just get sentences with similar words(maybe stemmed).
 
I already had a try with gist/gin-index-based trigramm search (pg_trgm extension), fulltextsearch (tsearch2 extension)
anda pivot-based indexing (Fixed Query Array), but it's all to slow or not suitable. 
Soundex and Metaphone aren't suitable, as well.
 
I'm already working on this project since a long time, but without any success.
Do any of you have an idea?
 
I would be very thankful for help.
 
Janek Sendrowski


Re: Similarity search for sentences

From
Rémi Cura
Date:
May be totally a bad idea :
explode your sentence into(sentence_number, one_word), n times , (makes a big table, you may want to partition)
then, classic index on sentence number, and on the one world (btree if you make = comparison , more subtel if you do "like 'word' ")

depending on perf, it could be wort it to regroup by words :
sentence_number[], on_word
Then you could try array or hstore on sentence_number[] ?

Cheers,

Rémi-C


2013/12/5 Janek Sendrowski <janek12@web.de>
Hi,
 
I have tables with millions of sentences. Each row contains a sentence. It is natural language and every language is possible, but the sentences of one table have the same language.
I have to do a similarity search on them. It has to be very fast, because I have to search for a few hundert sentences many times.
The search shouldn't be context-based. It should just get sentences with similar words(maybe stemmed).
 
I already had a try with gist/gin-index-based trigramm search (pg_trgm extension), fulltextsearch (tsearch2 extension) and a pivot-based indexing (Fixed Query Array), but it's all to slow or not suitable.
Soundex and Metaphone aren't suitable, as well.
 
I'm already working on this project since a long time, but without any success.
Do any of you have an idea?
 
I would be very thankful for help.
 
Janek Sendrowski


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Similarity search for sentences

From
Kevin Grittner
Date:
Janek Sendrowski <janek12@web.de> wrote:

> I already had a try with gist/gin-index-based trigramm search
> (pg_trgm extension), fulltextsearch (tsearch2 extension) and a
> pivot-based indexing (Fixed Query Array), but it's all to slow or
> not suitable.

When you tried tsearch2, did you use a trigger to store the
tsvector, or did you use a functional index?  I found the former to
be a couple orders of magnitude faster with an index on court
document text.  Likewise, the trigram searches I used in production
had wildly different performance depending on the similarity
threshold or the LIMIT for KNN searches.  You may want to get the
most promising technology to run as fast as you can, and then post
to the pgsql-performance list with the information suggested here:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

The problems may be solvable, but nobody is likely to know what to
suggest without more to work with.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Similarity search for sentences

From
"Janek Sendrowski"
Date:
Hi,
thanks for your Answers.
 
@Rémi Cura
You suggest a kind of Full Text Search.  I already had a try with the tsearch2 extension.
The issue is to realize the similarity search. I have to use many OR statements with a low set of arguments.
That significantly slows the FTS down.
 
@Kevin Grittner
I used my own trigger to store the tsvector of the sentences and I created a usual gist Index on them.
What kind of functional Index would you suggest. Like i already told Rémi, I have to to use many OR statements with a
lowset of arguments, which heavy damages the perfance. 
Do you have a better idea?
I usually used a query like this:
 
The tiger is the largest cat species[http://en.wikipedia.org/wiki/Felidae], reaching a total body length of up to 3.3 m
 andweighing up to 306 kg. 

--------------------------------------------------------------------------------------------------------------------------------------------------
totsvector:
'3.3':16 '306':22 'bodi':11 'cat':6 'kg':23 'largest':5 'length':12 'm':17 'reach':8 'speci':7 'tiger':2 'total':10
'weigh':19
(1 row)
 
SELECT * FROM tablename WHERE vector @@ to_tsquery('speci & tiger & total & weigh') AND vector @@
to_tsquery('largest & length & m & reach')  ANDvector @@ to_tsquery('3.3 & 306 & bodi & cat & kg'); 

And thats very slow
 
I didn't know that the pg_trgm Module provides KNN search.
 
Janek Sendrowski
 
 
 


Re: Similarity search for sentences

From
Kevin Grittner
Date:
Janek Sendrowski <janek12@web.de> wrote:

> I didn't know that the pg_trgm Module provides KNN search

It does, although my own experience shows that it tends to be more
appropriate for name searches or similar smaller columns than for
big text columns.  Using the war_and_peace table from another
thread:

test=# CREATE INDEX war_and_peace_linetext_trgm2 ON war_and_peace
using gist (linetext gist_trgm_ops);
CREATE INDEX
test=# VACUUM ANALYZE war_and_peace;
VACUUM
test=# -- Use a KNN search for the words.
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
  ORDER BY 'ladies gentlemen provinces distance' <-> linetext
  LIMIT 10;
                                                                          QUERY
PLAN                                                                          

---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.28..2.74 rows=10 width=116) (actual time=42.157..42.246 rows=10 loops=1)
   ->  Index Scan using war_and_peace_linetext_trgm2 on war_and_peace  (cost=0.28..16016.42 rows=65007 width=116)
(actualtime=42.155..42.243 rows=10 loops=1) 
         Order By: (linetext <-> 'ladies gentlemen provinces distance'::text)
 Total runtime: 42.716 ms
(4 rows)

As you can see, it pulls the requested number of rows from the
index in the order of "distance" (inverse similarity).  It is,
however, not as fast as a tsearch2 search (43 ms instead of a
fraction of a ms), and it doesn't give you options to use AND/OR as
you might like.  It is still a lot faster than scanning the whole
table and applying the test to each row.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company