Re: Similarity search for sentences - Mailing list pgsql-general

From Rémi Cura
Subject Re: Similarity search for sentences
Date
Msg-id CAJvUf_tb_bdk4nCMHMfRB2XvFTUxzYW0ho6kL5ALGp2y3nKxvg@mail.gmail.com
Whole thread Raw
In response to Similarity search for sentences  ("Janek Sendrowski" <janek12@web.de>)
List pgsql-general
May be totally a bad idea :
explode your sentence into(sentence_number, one_word), n times , (makes a big table, you may want to partition)
then, classic index on sentence number, and on the one world (btree if you make = comparison , more subtel if you do "like 'word' ")

depending on perf, it could be wort it to regroup by words :
sentence_number[], on_word
Then you could try array or hstore on sentence_number[] ?

Cheers,

Rémi-C


2013/12/5 Janek Sendrowski <janek12@web.de>
Hi,
 
I have tables with millions of sentences. Each row contains a sentence. It is natural language and every language is possible, but the sentences of one table have the same language.
I have to do a similarity search on them. It has to be very fast, because I have to search for a few hundert sentences many times.
The search shouldn't be context-based. It should just get sentences with similar words(maybe stemmed).
 
I already had a try with gist/gin-index-based trigramm search (pg_trgm extension), fulltextsearch (tsearch2 extension) and a pivot-based indexing (Fixed Query Array), but it's all to slow or not suitable.
Soundex and Metaphone aren't suitable, as well.
 
I'm already working on this project since a long time, but without any success.
Do any of you have an idea?
 
I would be very thankful for help.
 
Janek Sendrowski


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pgsql-general by date:

Previous
From: "Janek Sendrowski"
Date:
Subject: Similarity search for sentences
Next
From: 吕晓旭
Date:
Subject: Fwd: Help!Why CPU Usage and LoadAverage Jump up Suddenly