Re: ranking how "similar" are tsvectors was: OR tsquery - Mailing list pgsql-general

From Ivan Sergio Borgonovo
Subject Re: ranking how "similar" are tsvectors was: OR tsquery
Date
Msg-id 20100117192759.4b063416@dawn.webthatworks.it
Whole thread Raw
In response to Re: ranking how "similar" are tsvectors was: OR tsquery  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-general
On Sun, 17 Jan 2010 20:19:59 +0300 (MSK)
Oleg Bartunov <oleg@sai.msu.su> wrote:

> Ivan,
>
> You can write function to get lexemes from tsvector:

> CREATE OR REPLACE FUNCTION ts_stat(tsvector, weights text, OUT
> word text, OUT ndoc integer, OUT nentry integer)
> RETURNS SETOF record AS
> $$
>      SELECT ts_stat('SELECT ' || quote_literal( $1::text ) ||
> '::tsvector', quote_literal( $2::text) ); $$ LANGUAGE SQL RETURNS
> NULL ON NULL INPUT IMMUTABLE;

Thanks very much Oleg.

Still it is not really making the pain go away.
I've weights stored in my tsvector and I need to build the query
using them.

This means that if I have:
'aubergine':4A 'orange':1B 'banana':5A 'apple':3C
and
'coconut':3B 'bananas':1A 'tomatoes:2C
stored in a column (tsv) I really would like to build up the query:

to_tsquery('aubergine:A | orange:B | bananas:A | apple:C')

then

tsv
@@
to_tsquery('aubergine:A | orange:B | bananas:A | apple:C')

and relative ts_rank()

I'm aware that it is not symmetrical, but it looks as the cheapest
and fastest thing I can do right now.

I'm using pg_catalog.english. Am I supposing correctly that NO
lexeme will contain spaces?

If that is the case I could simply use string manipulation tools.
Not nice to see but it will work.

> Then, you can create ARRAY like:
>
> select ARRAY ( select (ts_stat(fts,'*')).word from papers where
> id=2);
>
> Then, you will have two arrays and you're free to apply any
> similarity function (cosine, jaccard,....) to calculate what do
> you want. If you want to preserve weights, then use weight label
> instead of '*'.

What ts_rank does is more than enough right now.

> Another idea is to use array_agg, but I'm not ready to discuss it.
>
> Please, keep in mind, that document similarity is a hot topic in

Not hard to imagine.

> IR, and, yes, I and Teodor have something about this, but code
> isn't available for public. Unfortunately, we had no sponsor for
> full-text search for last year and I see no perspectives this
> year, so we postpone our text-search development.

Good luck. Do you have anything like http://www.chipin.com/ for
small donations?

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


pgsql-general by date:

Previous
From: Andy Colson
Date:
Subject: Re: Data Generators
Next
From: Jeff Davis
Date:
Subject: Re: Constraint exclusion issue