Re: ranking how "similar" are tsvectors was: OR tsquery - Mailing list pgsql-general

From Oleg Bartunov
Subject Re: ranking how "similar" are tsvectors was: OR tsquery
Date
Msg-id Pine.LNX.4.64.1001172012500.16860@sn.sai.msu.ru
Whole thread Raw
In response to ranking how "similar" are tsvectors was: OR tsquery  (Ivan Sergio Borgonovo <mail@webthatworks.it>)
Responses Re: ranking how "similar" are tsvectors was: OR tsquery  (Ivan Sergio Borgonovo <mail@webthatworks.it>)
List pgsql-general
Ivan,

You can write function to get lexemes from tsvector:

CREATE OR REPLACE FUNCTION ts_stat(tsvector, weights text, OUT word text, OUT ndoc
integer, OUT nentry integer)
RETURNS SETOF record AS
$$
     SELECT ts_stat('SELECT ' || quote_literal( $1::text ) || '::tsvector', quote_literal( $2::text) );
$$ LANGUAGE SQL RETURNS NULL ON NULL INPUT IMMUTABLE;

Then, you can create ARRAY like:

select ARRAY ( select (ts_stat(fts,'*')).word from papers where id=2);

Then, you will have two arrays and you're free to apply any similarity
function (cosine, jaccard,....) to calculate what do you want.
If you want to preserve weights, then use weight label instead of '*'.


Another idea is to use array_agg, but I'm not ready to discuss it.

Please, keep in mind, that document similarity is a hot topic in IR,
and, yes, I and Teodor have something about this, but code isn't available
for public. Unfortunately, we had no sponsor for full-text search for last
year and I see no perspectives this year, so we postpone our text-search
development.

Oleg

On Sun, 17 Jan 2010, Ivan Sergio Borgonovo wrote:

> My initial request was about a way to build up a tsquery that was
> made similar to what plainto_tsquery does but using | inspite of &
> as a glue.
>
> But at the end of the day I'd like to find similar tsvectors and
> rank them.
>
> I've a table containing several fields that contribute to build up a
> weighted tsvector.
>
> I'd like to pick up a tsvector and find which are the N most similar
> ones.
>
> I've found this:
>
> http://domas.monkus.lt/document-similarity-postgresql
>
> That's not really too far from what I was trying to do.
>
> But I have precomputed tsvectors (I think turning text into a
> tsvector should be a more expensive operation than string
> replacement) and I'd like to conserve weights.
>
> I'm not really sure but I think a lexeme can actually contain a '
> or a space (depending on stemmer/parser?), so I'd have to take care
> of escaping etc...
>
> Since there is no direct access to the elements of a tsvector... the
> only "correct" way I see to build the query would be to manually
> rebuild the tsvector and getting back the result as a record using
> ts_debug and ts_lexize... that looks a bit a PITA.
>
> I don't even think that having direct access to elements of a
> tsvector will completely solve the problem since tsvectors store
> positions too, but it will be a step forward in making easier to
> compare documents to find similar ones.
> An operator that check the intersection of tsvectors would come
> handy.
> Adding a ts_rank(tsvector, tsvector) will surely help too.
>
> thanks
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-general by date:

Previous
From: Jamie Kahgee
Date:
Subject: Data Generators
Next
From: Andy Colson
Date:
Subject: Re: Data Generators