Home > mailing lists

ranking how "similar" are tsvectors was: OR tsquery - Mailing list pgsql-general

From	Ivan Sergio Borgonovo
Subject	ranking how "similar" are tsvectors was: OR tsquery
Date	January 17, 2010 12:56:41
Msg-id	20100117175624.315cfa55@dawn.webthatworks.it Whole thread Raw
In response to	Re: OR tsquery (Ivan Sergio Borgonovo <mail@webthatworks.it>)
Responses	Re: ranking how "similar" are tsvectors was: OR tsquery
List	pgsql-general

Tree view

My initial request was about a way to build up a tsquery that was
made similar to what plainto_tsquery does but using | inspite of &
as a glue.

But at the end of the day I'd like to find similar tsvectors and
rank them.

I've a table containing several fields that contribute to build up a
weighted tsvector.

I'd like to pick up a tsvector and find which are the N most similar
ones.

I've found this:

http://domas.monkus.lt/document-similarity-postgresql

That's not really too far from what I was trying to do.

But I have precomputed tsvectors (I think turning text into a
tsvector should be a more expensive operation than string
replacement) and I'd like to conserve weights.

I'm not really sure but I think a lexeme can actually contain a '
or a space (depending on stemmer/parser?), so I'd have to take care
of escaping etc...

Since there is no direct access to the elements of a tsvector... the
only "correct" way I see to build the query would be to manually
rebuild the tsvector and getting back the result as a record using
ts_debug and ts_lexize... that looks a bit a PITA.

I don't even think that having direct access to elements of a
tsvector will completely solve the problem since tsvectors store
positions too, but it will be a step forward in making easier to
compare documents to find similar ones.
An operator that check the intersection of tsvectors would come
handy.
Adding a ts_rank(tsvector, tsvector) will surely help too.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

pgsql-general by date:

From: Dan Langille
Date: 17 January 2010, 12:29:17
Subject: PGCon 2010

From: Jamie Kahgee
Date: 17 January 2010, 13:13:27
Subject: Data Generators

ranking how "similar" are tsvectors was: OR tsquery - Mailing list pgsql-general

Previous

Next