Thread: OR tsquery

OR tsquery

From

Ivan Sergio Borgonovo

Date:

16 January 2010, 13:32:23

to_tsquery and plainto_tsquery produce AND tsquery
'apple banana orange' -> 'apple' & 'banana' & 'orange'
I can't see anything that will produce OR tsquery.
'apple banana orange' -> 'apple' | 'banana' | 'orange'

The only thing I can think of is looping on ts_lexize that looks not
very efficient in plpgsql.

Am I missing something?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: OR tsquery

From

Tom Lane

Date:

16 January 2010, 15:06:08

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:
> to_tsquery and plainto_tsquery produce AND tsquery
> 'apple banana orange' -> 'apple' & 'banana' & 'orange'

Only the latter.  Try to_tsquery again:

regression=# select to_tsquery('foo bar');
ERROR:  syntax error in tsquery: "foo bar"
regression=# select to_tsquery('foo & bar');
  to_tsquery
---------------
 'foo' & 'bar'
(1 row)

regression=# select to_tsquery('foo | bar');
  to_tsquery
---------------
 'foo' | 'bar'
(1 row)


            regards, tom lane

Re: OR tsquery

From

Oleg Bartunov

Date:

16 January 2010, 15:10:58

Ivan,

did you ever read official documentation ?
http://www.postgresql.org/docs/8.4/static/textsearch-controls.html

Oleg
On Sat, 16 Jan 2010, Ivan Sergio Borgonovo wrote:

> to_tsquery and plainto_tsquery produce AND tsquery
> 'apple banana orange' -> 'apple' & 'banana' & 'orange'
> I can't see anything that will produce OR tsquery.
> 'apple banana orange' -> 'apple' | 'banana' | 'orange'
>
> The only thing I can think of is looping on ts_lexize that looks not
> very efficient in plpgsql.
>
> Am I missing something?
>
> thanks
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: OR tsquery

From

Ivan Sergio Borgonovo

Date:

16 January 2010, 19:59:00

On Sat, 16 Jan 2010 19:10:45 +0300 (MSK)
Oleg Bartunov <oleg@sai.msu.su> wrote:

> Ivan,

> did you ever read official documentation ?
> http://www.postgresql.org/docs/8.4/static/textsearch-controls.html

Yes but I still can't find something that works like plainto_tsquery
but with | or any example that wouldn't make obtaining that result
convoluted.

plainto_tsquery do a couple of stuff that I find hard to replicate
with the available functions.
It split a string into lexemes.
It loops over the lexemes to build up the query with &.

Something like:

select (
  string_to_array(
    strip(
    to_tsvector('pg_catalog.english',
      'orange banana woods booking'))::text
   , ' ')
   )[i]
  from generate_series(0,3) s(i);

and then gluing up the pieces with |.

And the above example still miss to solve some of the details like
cleaning the '.

Another option would be to return the tsvector to the client and
then build the tsquery there and send it back to the server.

I'm on 8.3 but I don't think it makes any real difference for this.

Sorry if I'm still missing the obvious.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

ranking how "similar" are tsvectors was: OR tsquery

From

Ivan Sergio Borgonovo

Date:

17 January 2010, 15:56:41

My initial request was about a way to build up a tsquery that was
made similar to what plainto_tsquery does but using | inspite of &
as a glue.

But at the end of the day I'd like to find similar tsvectors and
rank them.

I've a table containing several fields that contribute to build up a
weighted tsvector.

I'd like to pick up a tsvector and find which are the N most similar
ones.

I've found this:

http://domas.monkus.lt/document-similarity-postgresql

That's not really too far from what I was trying to do.

But I have precomputed tsvectors (I think turning text into a
tsvector should be a more expensive operation than string
replacement) and I'd like to conserve weights.

I'm not really sure but I think a lexeme can actually contain a '
or a space (depending on stemmer/parser?), so I'd have to take care
of escaping etc...

Since there is no direct access to the elements of a tsvector... the
only "correct" way I see to build the query would be to manually
rebuild the tsvector and getting back the result as a record using
ts_debug and ts_lexize... that looks a bit a PITA.

I don't even think that having direct access to elements of a
tsvector will completely solve the problem since tsvectors store
positions too, but it will be a step forward in making easier to
compare documents to find similar ones.
An operator that check the intersection of tsvectors would come
handy.
Adding a ts_rank(tsvector, tsvector) will surely help too.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: ranking how "similar" are tsvectors was: OR tsquery

From

Oleg Bartunov

Date:

17 January 2010, 16:20:13

Ivan,

You can write function to get lexemes from tsvector:

CREATE OR REPLACE FUNCTION ts_stat(tsvector, weights text, OUT word text, OUT ndoc
integer, OUT nentry integer)
RETURNS SETOF record AS
$$
     SELECT ts_stat('SELECT ' || quote_literal( $1::text ) || '::tsvector', quote_literal( $2::text) );
$$ LANGUAGE SQL RETURNS NULL ON NULL INPUT IMMUTABLE;

Then, you can create ARRAY like:

select ARRAY ( select (ts_stat(fts,'*')).word from papers where id=2);

Then, you will have two arrays and you're free to apply any similarity
function (cosine, jaccard,....) to calculate what do you want.
If you want to preserve weights, then use weight label instead of '*'.


Another idea is to use array_agg, but I'm not ready to discuss it.

Please, keep in mind, that document similarity is a hot topic in IR,
and, yes, I and Teodor have something about this, but code isn't available
for public. Unfortunately, we had no sponsor for full-text search for last
year and I see no perspectives this year, so we postpone our text-search
development.

Oleg

On Sun, 17 Jan 2010, Ivan Sergio Borgonovo wrote:

> My initial request was about a way to build up a tsquery that was
> made similar to what plainto_tsquery does but using | inspite of &
> as a glue.
>
> But at the end of the day I'd like to find similar tsvectors and
> rank them.
>
> I've a table containing several fields that contribute to build up a
> weighted tsvector.
>
> I'd like to pick up a tsvector and find which are the N most similar
> ones.
>
> I've found this:
>
> http://domas.monkus.lt/document-similarity-postgresql
>
> That's not really too far from what I was trying to do.
>
> But I have precomputed tsvectors (I think turning text into a
> tsvector should be a more expensive operation than string
> replacement) and I'd like to conserve weights.
>
> I'm not really sure but I think a lexeme can actually contain a '
> or a space (depending on stemmer/parser?), so I'd have to take care
> of escaping etc...
>
> Since there is no direct access to the elements of a tsvector... the
> only "correct" way I see to build the query would be to manually
> rebuild the tsvector and getting back the result as a record using
> ts_debug and ts_lexize... that looks a bit a PITA.
>
> I don't even think that having direct access to elements of a
> tsvector will completely solve the problem since tsvectors store
> positions too, but it will be a step forward in making easier to
> compare documents to find similar ones.
> An operator that check the intersection of tsvectors would come
> handy.
> Adding a ts_rank(tsvector, tsvector) will surely help too.
>
> thanks
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: ranking how "similar" are tsvectors was: OR tsquery

From

Ivan Sergio Borgonovo

Date:

17 January 2010, 17:28:16

On Sun, 17 Jan 2010 20:19:59 +0300 (MSK)
Oleg Bartunov <oleg@sai.msu.su> wrote:

> Ivan,
>
> You can write function to get lexemes from tsvector:

> CREATE OR REPLACE FUNCTION ts_stat(tsvector, weights text, OUT
> word text, OUT ndoc integer, OUT nentry integer)
> RETURNS SETOF record AS
> $$
>      SELECT ts_stat('SELECT ' || quote_literal( $1::text ) ||
> '::tsvector', quote_literal( $2::text) ); $$ LANGUAGE SQL RETURNS
> NULL ON NULL INPUT IMMUTABLE;

Thanks very much Oleg.

Still it is not really making the pain go away.
I've weights stored in my tsvector and I need to build the query
using them.

This means that if I have:
'aubergine':4A 'orange':1B 'banana':5A 'apple':3C
and
'coconut':3B 'bananas':1A 'tomatoes:2C
stored in a column (tsv) I really would like to build up the query:

to_tsquery('aubergine:A | orange:B | bananas:A | apple:C')

then

tsv
@@
to_tsquery('aubergine:A | orange:B | bananas:A | apple:C')

and relative ts_rank()

I'm aware that it is not symmetrical, but it looks as the cheapest
and fastest thing I can do right now.

I'm using pg_catalog.english. Am I supposing correctly that NO
lexeme will contain spaces?

If that is the case I could simply use string manipulation tools.
Not nice to see but it will work.

> Then, you can create ARRAY like:
>
> select ARRAY ( select (ts_stat(fts,'*')).word from papers where
> id=2);
>
> Then, you will have two arrays and you're free to apply any
> similarity function (cosine, jaccard,....) to calculate what do
> you want. If you want to preserve weights, then use weight label
> instead of '*'.

What ts_rank does is more than enough right now.

> Another idea is to use array_agg, but I'm not ready to discuss it.
>
> Please, keep in mind, that document similarity is a hot topic in

Not hard to imagine.

> IR, and, yes, I and Teodor have something about this, but code
> isn't available for public. Unfortunately, we had no sponsor for
> full-text search for last year and I see no perspectives this
> year, so we postpone our text-search development.

Good luck. Do you have anything like http://www.chipin.com/ for
small donations?

--
Ivan Sergio Borgonovo
http://www.webthatworks.it