Thread: Similarity search for sentences
Hi, I have tables with millions of sentences. Each row contains a sentence. It is natural language and every language is possible,but the sentences of one table have the same language. I have to do a similarity search on them. It has to be very fast, because I have to search for a few hundert sentences manytimes. The search shouldn't be context-based. It should just get sentences with similar words(maybe stemmed). I already had a try with gist/gin-index-based trigramm search (pg_trgm extension), fulltextsearch (tsearch2 extension) anda pivot-based indexing (Fixed Query Array), but it's all to slow or not suitable. Soundex and Metaphone aren't suitable, as well. I'm already working on this project since a long time, but without any success. Do any of you have an idea? I would be very thankful for help. Janek Sendrowski
May be totally a bad idea :
Rémi-C
explode your sentence into(sentence_number, one_word), n times , (makes a big table, you may want to partition)
then, classic index on sentence number, and on the one world (btree if you make = comparison , more subtel if you do "like 'word' ")
depending on perf, it could be wort it to regroup by words :
sentence_number[], on_word
Then you could try array or hstore on sentence_number[] ?
Cheers,
Rémi-C
2013/12/5 Janek Sendrowski <janek12@web.de>
Hi,
I have tables with millions of sentences. Each row contains a sentence. It is natural language and every language is possible, but the sentences of one table have the same language.
I have to do a similarity search on them. It has to be very fast, because I have to search for a few hundert sentences many times.
The search shouldn't be context-based. It should just get sentences with similar words(maybe stemmed).
I already had a try with gist/gin-index-based trigramm search (pg_trgm extension), fulltextsearch (tsearch2 extension) and a pivot-based indexing (Fixed Query Array), but it's all to slow or not suitable.
Soundex and Metaphone aren't suitable, as well.
I'm already working on this project since a long time, but without any success.
Do any of you have an idea?
I would be very thankful for help.
Janek Sendrowski
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Janek Sendrowski <janek12@web.de> wrote: > I already had a try with gist/gin-index-based trigramm search > (pg_trgm extension), fulltextsearch (tsearch2 extension) and a > pivot-based indexing (Fixed Query Array), but it's all to slow or > not suitable. When you tried tsearch2, did you use a trigger to store the tsvector, or did you use a functional index? I found the former to be a couple orders of magnitude faster with an index on court document text. Likewise, the trigram searches I used in production had wildly different performance depending on the similarity threshold or the LIMIT for KNN searches. You may want to get the most promising technology to run as fast as you can, and then post to the pgsql-performance list with the information suggested here: http://wiki.postgresql.org/wiki/SlowQueryQuestions The problems may be solvable, but nobody is likely to know what to suggest without more to work with. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, thanks for your Answers. @Rémi Cura You suggest a kind of Full Text Search. I already had a try with the tsearch2 extension. The issue is to realize the similarity search. I have to use many OR statements with a low set of arguments. That significantly slows the FTS down. @Kevin Grittner I used my own trigger to store the tsvector of the sentences and I created a usual gist Index on them. What kind of functional Index would you suggest. Like i already told Rémi, I have to to use many OR statements with a lowset of arguments, which heavy damages the perfance. Do you have a better idea? I usually used a query like this: The tiger is the largest cat species[http://en.wikipedia.org/wiki/Felidae], reaching a total body length of up to 3.3 m andweighing up to 306 kg. -------------------------------------------------------------------------------------------------------------------------------------------------- totsvector: '3.3':16 '306':22 'bodi':11 'cat':6 'kg':23 'largest':5 'length':12 'm':17 'reach':8 'speci':7 'tiger':2 'total':10 'weigh':19 (1 row) SELECT * FROM tablename WHERE vector @@ to_tsquery('speci & tiger & total & weigh') AND vector @@ to_tsquery('largest & length & m & reach') ANDvector @@ to_tsquery('3.3 & 306 & bodi & cat & kg'); And thats very slow I didn't know that the pg_trgm Module provides KNN search. Janek Sendrowski
Janek Sendrowski <janek12@web.de> wrote: > I didn't know that the pg_trgm Module provides KNN search It does, although my own experience shows that it tends to be more appropriate for name searches or similar smaller columns than for big text columns. Using the war_and_peace table from another thread: test=# CREATE INDEX war_and_peace_linetext_trgm2 ON war_and_peace using gist (linetext gist_trgm_ops); CREATE INDEX test=# VACUUM ANALYZE war_and_peace; VACUUM test=# -- Use a KNN search for the words. EXPLAIN ANALYZE SELECT * FROM war_and_peace ORDER BY 'ladies gentlemen provinces distance' <-> linetext LIMIT 10; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.28..2.74 rows=10 width=116) (actual time=42.157..42.246 rows=10 loops=1) -> Index Scan using war_and_peace_linetext_trgm2 on war_and_peace (cost=0.28..16016.42 rows=65007 width=116) (actualtime=42.155..42.243 rows=10 loops=1) Order By: (linetext <-> 'ladies gentlemen provinces distance'::text) Total runtime: 42.716 ms (4 rows) As you can see, it pulls the requested number of rows from the index in the order of "distance" (inverse similarity). It is, however, not as fast as a tsearch2 search (43 ms instead of a fraction of a ms), and it doesn't give you options to use AND/OR as you might like. It is still a lot faster than scanning the whole table and applying the test to each row. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company