ranking how "similar" are tsvectors was: OR tsquery - Mailing list pgsql-general

From Ivan Sergio Borgonovo
Subject ranking how "similar" are tsvectors was: OR tsquery
Date
Msg-id 20100117175624.315cfa55@dawn.webthatworks.it
Whole thread Raw
In response to Re: OR tsquery  (Ivan Sergio Borgonovo <mail@webthatworks.it>)
Responses Re: ranking how "similar" are tsvectors was: OR tsquery  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-general
My initial request was about a way to build up a tsquery that was
made similar to what plainto_tsquery does but using | inspite of &
as a glue.

But at the end of the day I'd like to find similar tsvectors and
rank them.

I've a table containing several fields that contribute to build up a
weighted tsvector.

I'd like to pick up a tsvector and find which are the N most similar
ones.

I've found this:

http://domas.monkus.lt/document-similarity-postgresql

That's not really too far from what I was trying to do.

But I have precomputed tsvectors (I think turning text into a
tsvector should be a more expensive operation than string
replacement) and I'd like to conserve weights.

I'm not really sure but I think a lexeme can actually contain a '
or a space (depending on stemmer/parser?), so I'd have to take care
of escaping etc...

Since there is no direct access to the elements of a tsvector... the
only "correct" way I see to build the query would be to manually
rebuild the tsvector and getting back the result as a record using
ts_debug and ts_lexize... that looks a bit a PITA.

I don't even think that having direct access to elements of a
tsvector will completely solve the problem since tsvectors store
positions too, but it will be a step forward in making easier to
compare documents to find similar ones.
An operator that check the intersection of tsvectors would come
handy.
Adding a ts_rank(tsvector, tsvector) will surely help too.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


pgsql-general by date:

Previous
From: Dan Langille
Date:
Subject: PGCon 2010
Next
From: Jamie Kahgee
Date:
Subject: Data Generators