Thread: tsearch2: plainto_tsquery() with OR?

tsearch2: plainto_tsquery() with OR?

From
cluster
Date:
I have some questions related to tsearch2:

1) Is
    ...WHERE rank(myTsVector, myTsQuery) > 0 ...
just as fast as
    ...WHERE myTsVector @@ myTsQuery...
?


2)) I will use plainto_tsquery() to parse search keys entered by a
website user to a tsquery. However, if only some of the entered keywords
does not exist in the searched tsvectors (but others do), I would still
like the search result to be "true".
plainto_tsquery() glues each keyword together with "&". I search for a
plainto_tsquery() that glues the keywords with an "|" (the OR operator).
In that way, not ALL keywords are required to exist in the tsvector in
order for the row to be returned,

Re: tsearch2: plainto_tsquery() with OR?

From
cluster
Date:
Does anyone know where I can request an OR-version of plainto_tsquery()?

I don't understand why it doesn't exist already: In most cases, when
using user entered keywords to search for, there should be returned some
  rows even though not ALL keywords are matched.

Re: tsearch2: plainto_tsquery() with OR?

From
Oleg Bartunov
Date:
On Wed, 8 Aug 2007, cluster wrote:

> Does anyone know where I can request an OR-version of plainto_tsquery()?

plainto_tsquery expects plain text, use to_tsquery for boolean operators.

>
> I don't understand why it doesn't exist already: In most cases, when using
> user entered keywords to search for, there should be returned some  rows even
> though not ALL keywords are matched.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2: plainto_tsquery() with OR?

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> On Wed, 8 Aug 2007, cluster wrote:
>> Does anyone know where I can request an OR-version of plainto_tsquery()?

> plainto_tsquery expects plain text, use to_tsquery for boolean operators.

Are either of these definitions really right?  If I type "foo bar baz"
into Google, for instance, it seems to produce some sort of weighted
result, neither a strict AND nor a strict OR.  Google didn't get where
they are by misjudging what the simplest search behavior should be like.

            regards, tom lane

Re: tsearch2: plainto_tsquery() with OR?

From
Oleg Bartunov
Date:
On Thu, 9 Aug 2007, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> On Wed, 8 Aug 2007, cluster wrote:
>>> Does anyone know where I can request an OR-version of plainto_tsquery()?
>
>> plainto_tsquery expects plain text, use to_tsquery for boolean operators.
>
> Are either of these definitions really right?  If I type "foo bar baz"
> into Google, for instance, it seems to produce some sort of weighted
> result, neither a strict AND nor a strict OR.  Google didn't get where
> they are by misjudging what the simplest search behavior should be like.

we provide strict basic query language via to_tsquery(), which could be
a foundation for different ql. We need consensus here and we leave it for
future. Someone could write google like ql, but I didn't see any description
what exactly google does. Currently, people write their own search wrappers
which implement google-like ql.

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2: plainto_tsquery() with OR?

From
Oleg Bartunov
Date:
On Thu, 9 Aug 2007, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> On Thu, 9 Aug 2007, Tom Lane wrote:
>>> Are either of these definitions really right?  If I type "foo bar baz"
>>> into Google, for instance, it seems to produce some sort of weighted
>>> result, neither a strict AND nor a strict OR.  Google didn't get where
>>> they are by misjudging what the simplest search behavior should be like.
>
>> we provide strict basic query language via to_tsquery(), which could be
>> a foundation for different ql. We need consensus here and we leave it for
>> future.
>
> Since we're about to push tsearch into core, I'm not very happy with a
> "leave it to the future" approach.  We need to get the API right *now*.

API is available right now - to_tsquery realized it. You asked about
google-like API, which I dont' know exact description,
" neither a strict AND nor a strict OR" is not a good foundation for
database text search API.
We intentionally realized strict programming-like query language, since
in database search we need sort of exhaustive search, not just first
1000 results in some intuitive order, depending on the day of week
(do you know google is best at sunday night :).


>
> I think that a function defined as "take these words and do an
> appropriate weighted search with them" doesn't necessarily have to
> specify what the weighting is.  But if the definition is "find the AND
> of these words", you can't fudge that later on.

What do you mean "appropriate weighted search" ? Does it mean just
use AND by default ?
     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2: plainto_tsquery() with OR?

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> " neither a strict AND nor a strict OR" is not a good foundation for
> database text search API.

Maybe not, but the Google boys have sure done well without telling
anyone what their algorithms are.

My feeling is that if you use an API that involves explicit AND and OR
operators (to_tsquery does this if I'm not mistaken) then you should
get a result that matches those semantics exactly.  But the other
behavior that people want is "here's some words, get me a weighted
result", and if the weighting improves from time to time that's OK.
We need to provide that API too.

Whether plainto_tsquery() should be defined that way, I'm not sure.
Maybe there's enough historical behavior behind it that we should
stick with defining it as "strict AND of these words".  But if so,
I want another function that has a fuzzier weighted definition,
because I think that'll be what most applications actually want.

The OP was asking for a version that has a strict OR behavior.
I'm not sure if that's really interesting or not ...

            regards, tom lane

Re: tsearch2: plainto_tsquery() with OR?

From
Oleg Bartunov
Date:
On Thu, 9 Aug 2007, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> " neither a strict AND nor a strict OR" is not a good foundation for
>> database text search API.
>
> Maybe not, but the Google boys have sure done well without telling
> anyone what their algorithms are.
>
> My feeling is that if you use an API that involves explicit AND and OR
> operators (to_tsquery does this if I'm not mistaken) then you should
> get a result that matches those semantics exactly.  But the other

right, to_tsquery does exact and predicted matching.

> behavior that people want is "here's some words, get me a weighted
> result", and if the weighting improves from time to time that's OK.
> We need to provide that API too.

I think I understand. It's called non-exact (approximate) matching,
when by default you search documents containing ALL words in query, weighted
in usual way, and then append documents, which contains either words in query
weighted by the number of words found. It's useful for rare words.

>
> Whether plainto_tsquery() should be defined that way, I'm not sure.
> Maybe there's enough historical behavior behind it that we should
> stick with defining it as "strict AND of these words".  But if so,

yeah, this was requested by people to have simple search interface.

> I want another function that has a fuzzier weighted definition,
> because I think that'll be what most applications actually want.

Tom, approximate search is a good challenge, it's a subject of research of
people from IR world. It's what I always wanted to do, but we have to
support our families and no company was interested in such kind of search,
unfortunately.

>
> The OP was asking for a version that has a strict OR behavior.
> I'm not sure if that's really interesting or not ...

I understood now, what he wanted. He don't want to parse search query,
he wanted simplicity of plainto_tsquery() and ability for boolean search
of to_tsquery(). As I already said, people have their own query language
wrappers on top of to_tsquery. Unfortunately, we have no consensus here.

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2: plainto_tsquery() with OR?

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> On Thu, 9 Aug 2007, Tom Lane wrote:
>> Are either of these definitions really right?  If I type "foo bar baz"
>> into Google, for instance, it seems to produce some sort of weighted
>> result, neither a strict AND nor a strict OR.  Google didn't get where
>> they are by misjudging what the simplest search behavior should be like.

> we provide strict basic query language via to_tsquery(), which could be
> a foundation for different ql. We need consensus here and we leave it for
> future.

Since we're about to push tsearch into core, I'm not very happy with a
"leave it to the future" approach.  We need to get the API right *now*.

I think that a function defined as "take these words and do an
appropriate weighted search with them" doesn't necessarily have to
specify what the weighting is.  But if the definition is "find the AND
of these words", you can't fudge that later on.

            regards, tom lane

Re: tsearch2: plainto_tsquery() with OR?

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> On Thu, 9 Aug 2007, Tom Lane wrote:
>> ... behavior that people want is "here's some words, get me a weighted
>> result", and if the weighting improves from time to time that's OK.
>> We need to provide that API too.

> I think I understand. It's called non-exact (approximate) matching,

Right, exactly.  I don't claim that we have a perfect solution for
approximate matching; we don't.  But I think we should provide an
API function that is defined to do approximate matching, with the
understanding that the details of its behavior will change (for the
better hopefully) from release to release.  The first version might
not be very good, but we can improve it.

            regards, tom lane

Re: tsearch2: plainto_tsquery() with OR?

From
"Trevor Talbot"
Date:
On 8/8/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > On Wed, 8 Aug 2007, cluster wrote:
> >> Does anyone know where I can request an OR-version of plainto_tsquery()?
>
> > plainto_tsquery expects plain text, use to_tsquery for boolean operators.
>
> Are either of these definitions really right?  If I type "foo bar baz"
> into Google, for instance, it seems to produce some sort of weighted
> result, neither a strict AND nor a strict OR.  Google didn't get where
> they are by misjudging what the simplest search behavior should be like.

As far as I'm aware Google is normally AND -- the catch is that it
doesn't always use keywords from the page itself.  Sometimes it'll
look for search terms that appear in pages that link to the returned
one; the cached version will notify you of this in the header.
Ranking seems to make words weighted by order and proximity, but of
course Google's full ranking behavior is another matter...

http://www.google.com/intl/en/help/basics.html#and
http://www.google.com/help/cheatsheet.html

For me personally, I expect to need a search interface that accepts
"AND", "OR", and "NOT" as boolean op words, and possibly parenthetical
grouping (IOW, everything to_tsquery supports in plain english form),
with freeform words defaulting to AND.  Is this the same thing most
people need?  I doubt it.

Re: tsearch2: plainto_tsquery() with OR?

From
Oleg Bartunov
Date:
On Thu, 9 Aug 2007, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> On Thu, 9 Aug 2007, Tom Lane wrote:
>>> ... behavior that people want is "here's some words, get me a weighted
>>> result", and if the weighting improves from time to time that's OK.
>>> We need to provide that API too.
>
>> I think I understand. It's called non-exact (approximate) matching,
>
> Right, exactly.  I don't claim that we have a perfect solution for
> approximate matching; we don't.  But I think we should provide an
> API function that is defined to do approximate matching, with the
> understanding that the details of its behavior will change (for the
> better hopefully) from release to release.  The first version might
> not be very good, but we can improve it.

Aha, never thought you'd interested in something non-exact :)
What'd be a name of such function ?

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2: plainto_tsquery() with OR?

From
cluster
Date:
Thanks for your response! Let me try to elaborate what I meant with my
original post.

If R is the set of words in the tsvector for a given table row and S is
the set of keywords to search for (entered by e.g. a website user) I
would like to receive all rows for which the intersection between R and
S is nonempty. That is: The row should be return if just there is SOME
match. S does not necessarily need to be a subset of R.

Furthermore I would like a measure for how "nonempty" the intersection
is (we would call this measure "the rank").
Example:
For R = "three big houses" and S = "three small houses" the rank should
be higher than for R = "three big houses" and S = "four small houses" as
the first case has two words in common while the second case has only one.

A version of plainto_tsquery() with a simple OR operator instead of AND
would solve this problem somewhat elegant:
1) I can now use the conventional "tsvector @@ tsquery" syntax in my
WHERE clause as the "@@" operator will return true and thus include the
row in the result. Example:
   select to_tsvector('simple', 'three small houses')
          @@ 'four|big|houses'::tsquery;
would return "true".

2) The rank() of the @@ operator is automatically higher when there is a
good match.


An example where this OR-version of plainto_tsquery() could be useful is
for websites using tags. Each website entry is associated with some tags
and each user has defined some "tags of interest". The search should
then return all website entries where there is a match (not necessarily
complete) with the users tags of interest. Of course the best matching
entries should be displayed top most.


I find it important that this function is a part of tsearch2 itself as:
1) The user can input arbitrary data. Also potentially harmful data if
they are not escaped right.
2) Special characters should be stripped in just the same way as
to_tsvector() does it. E.g. stripping the dot in "Hi . there" but
keeping it in "web 2.0". Only tsearch2 can do that in a clean consistent
way - it would be fairly messy if some thirdparty or especially some
website-developer-homecooked stripping functionality is used for this.

Re: tsearch2: plainto_tsquery() with OR?

From
Ron Mayer
Date:
Tom Lane wrote:
> Oleg Bartunov <oleg@sai.msu.su> writes:
>> On Wed, 8 Aug 2007, cluster wrote:
>>> Does anyone know where I can request an OR-version of plainto_tsquery()?
>
>> plainto_tsquery expects plain text, use to_tsquery for boolean operators.
>
> Are either of these definitions really right?  If I type "foo bar baz"
> into Google, for instance, it seems to produce some sort of weighted
> result, neither a strict AND nor a strict OR.  Google didn't get where
> they are by misjudging what the simplest search behavior should be like.


For what it's worth, Google states [1]

   Automatic "and" queries

   By default, Google only returns pages that include
   all of your search terms. There is no need to
   include "and" between terms. Keep in mind that
   the order in which the terms are typed will affect
   the search results. To restrict a search further,
   just include more terms. For example, to plan a
   vacation to Hawaii, simply type vacation hawaii.

and also describes "OPERATOR EXAMPLE...vacation hawaii"
as "FINDS PAGES CONTAINING...the words vacation and Hawaii".

If I'm not mistaken, it sounds the same as what tsearch describes.


[1] http://www.google.com/intl/en/help/basics.html#and
[2] http://www.google.com/intl/en/help/cheatsheet.html


Re: tsearch2: plainto_tsquery() with OR?

From
"Mike Rylander"
Date:
On 8/9/07, cluster <skrald@amossen.dk> wrote:
> Thanks for your response! Let me try to elaborate what I meant with my
> original post.
>
> If R is the set of words in the tsvector for a given table row and S is
> the set of keywords to search for (entered by e.g. a website user) I
> would like to receive all rows for which the intersection between R and
> S is nonempty. That is: The row should be return if just there is SOME
> match. S does not necessarily need to be a subset of R.
>

You could just wrap up a simple query in an SQL function called
plainto_or_tsquery or the like.

CREATE OR REPLACE FUNCTION plainto_or_tsquery (TEXT) RETURNS tsquery AS $$
  SELECT to_tsquery( regexp_replace( $1, E'[\\s\'|:&()!]+','|','g') );
$$ LANGUAGE SQL STRICT IMMUTABLE;

Paste this into a PG database that has tsearch2 loaded (after creating
the above function, of course):

select
  rank_cd(to_tsvector('hi . there web 2.0'), plainto_or_tsquery('hello
. web 2.0')),
  to_tsvector('hi . there web 2.0') @@ plainto_or_tsquery('hello . web
2.0') as matches;

> Furthermore I would like a measure for how "nonempty" the intersection
> is (we would call this measure "the rank").
> Example:
> For R = "three big houses" and S = "three small houses" the rank should
> be higher than for R = "three big houses" and S = "four small houses" as
> the first case has two words in common while the second case has only one.

Both rank() and rank_cd() work fine for all-ORed queries, full match
or otherwise.  The more "matchy", the better the rank.

>
> A version of plainto_tsquery() with a simple OR operator instead of AND
> would solve this problem somewhat elegant:
> 1) I can now use the conventional "tsvector @@ tsquery" syntax in my
> WHERE clause as the "@@" operator will return true and thus include the
> row in the result. Example:
>    select to_tsvector('simple', 'three small houses')
>           @@ 'four|big|houses'::tsquery;
> would return "true".
>

Um... it does.

forge=# select to_tsvector('simple', 'three small houses') @@
'four|big|houses'::tsquery;
 ?column?
----------
 t
(1 row)

> 2) The rank() of the @@ operator is automatically higher when there is a
> good match.
>

Again, that's already the case.

forge=# select rank(to_tsvector('hi . there web 2.0'),
plainto_or_tsquery('hello . web 2.0')), rank(to_tsvector('hi . there
web 2.0'), plainto_or_tsquery('hi . web 2.0'));
   rank    |   rank
-----------+-----------
 0.0405285 | 0.0607927
(1 row)

The second is a better match; "hi" vs "hello" in the queries.

>
> An example where this OR-version of plainto_tsquery() could be useful is
> for websites using tags. Each website entry is associated with some tags
> and each user has defined some "tags of interest". The search should
> then return all website entries where there is a match (not necessarily
> complete) with the users tags of interest. Of course the best matching
> entries should be displayed top most.
>

See above.  Though, again, you'd need to put in a little work to make
sure everything is completely protected.  Probably less time than it's
take you to discuss this so far, though.  And you'd want to create a
2-param version that could accept the correct tsearch2 config.

>
> I find it important that this function is a part of tsearch2 itself as:
> 1) The user can input arbitrary data. Also potentially harmful data if
> they are not escaped right.

That's not tsearch2's problem in particular.  You should be using
parameterized queries in your app (or applying the correct quoting
functions) for all data, not just directly user supplied strings.  All
data is user supplied at some level.

> 2) Special characters should be stripped in just the same way as
> to_tsvector() does it. E.g. stripping the dot in "Hi . there" but
> keeping it in "web 2.0". Only tsearch2 can do that in a clean consistent
> way - it would be fairly messy if some thirdparty or especially some
> website-developer-homecooked stripping functionality is used for this.
>

The simple example above uses to_tsquery to do that.  Try it out, and
if  you improve it please feel free to share.

On a more general note, IMO this should not be getting in the way of
integrating tsearch2 into core.  The example above shows how trivial
it is to do the "simple" thing, but that's probably not going to be
the "right" thing.  Any time you find yourself forcing your user to is
"x OR y OR z" to get a result is a time you should probably be
augmenting your tsvectors (or tsquerys now, with query rewriting) with
thesauri.

BTW, Google /does/ just AND everything.  They just don't tell you
about everything they're ANDing.

All this is, of course, disregarding the actual utility of an all-OR
query (not much in practice, IME) and the speed of such queries on
non-trivial datasets (not good -- essentially a scan per ORed
component)).

Anyway, I hope that helps...

--miker