My initial request was about a way to build up a tsquery that was
made similar to what plainto_tsquery does but using | inspite of &
as a glue.
But at the end of the day I'd like to find similar tsvectors and
rank them.
I've a table containing several fields that contribute to build up a
weighted tsvector.
I'd like to pick up a tsvector and find which are the N most similar
ones.
I've found this:
http://domas.monkus.lt/document-similarity-postgresql
That's not really too far from what I was trying to do.
But I have precomputed tsvectors (I think turning text into a
tsvector should be a more expensive operation than string
replacement) and I'd like to conserve weights.
I'm not really sure but I think a lexeme can actually contain a '
or a space (depending on stemmer/parser?), so I'd have to take care
of escaping etc...
Since there is no direct access to the elements of a tsvector... the
only "correct" way I see to build the query would be to manually
rebuild the tsvector and getting back the result as a record using
ts_debug and ts_lexize... that looks a bit a PITA.
I don't even think that having direct access to elements of a
tsvector will completely solve the problem since tsvectors store
positions too, but it will be a step forward in making easier to
compare documents to find similar ones.
An operator that check the intersection of tsvectors would come
handy.
Adding a ts_rank(tsvector, tsvector) will surely help too.
thanks
--
Ivan Sergio Borgonovo
http://www.webthatworks.it