Re: tsearch2: plainto_tsquery() with OR? - Mailing list pgsql-general

From cluster
Subject Re: tsearch2: plainto_tsquery() with OR?
Date
Msg-id 46BADADE.1080901@amossen.dk
Whole thread Raw
In response to Re: tsearch2: plainto_tsquery() with OR?  (Oleg Bartunov <oleg@sai.msu.su>)
Responses Re: tsearch2: plainto_tsquery() with OR?  ("Mike Rylander" <mrylander@gmail.com>)
List pgsql-general
Thanks for your response! Let me try to elaborate what I meant with my
original post.

If R is the set of words in the tsvector for a given table row and S is
the set of keywords to search for (entered by e.g. a website user) I
would like to receive all rows for which the intersection between R and
S is nonempty. That is: The row should be return if just there is SOME
match. S does not necessarily need to be a subset of R.

Furthermore I would like a measure for how "nonempty" the intersection
is (we would call this measure "the rank").
Example:
For R = "three big houses" and S = "three small houses" the rank should
be higher than for R = "three big houses" and S = "four small houses" as
the first case has two words in common while the second case has only one.

A version of plainto_tsquery() with a simple OR operator instead of AND
would solve this problem somewhat elegant:
1) I can now use the conventional "tsvector @@ tsquery" syntax in my
WHERE clause as the "@@" operator will return true and thus include the
row in the result. Example:
   select to_tsvector('simple', 'three small houses')
          @@ 'four|big|houses'::tsquery;
would return "true".

2) The rank() of the @@ operator is automatically higher when there is a
good match.


An example where this OR-version of plainto_tsquery() could be useful is
for websites using tags. Each website entry is associated with some tags
and each user has defined some "tags of interest". The search should
then return all website entries where there is a match (not necessarily
complete) with the users tags of interest. Of course the best matching
entries should be displayed top most.


I find it important that this function is a part of tsearch2 itself as:
1) The user can input arbitrary data. Also potentially harmful data if
they are not escaped right.
2) Special characters should be stripped in just the same way as
to_tsvector() does it. E.g. stripping the dot in "Hi . there" but
keeping it in "web 2.0". Only tsearch2 can do that in a clean consistent
way - it would be fairly messy if some thirdparty or especially some
website-developer-homecooked stripping functionality is used for this.

pgsql-general by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Sylph Searcher
Next
From: Håkan Jacobsson
Date:
Subject: [JDBC] Restore database from zipped textfile (.sql) created by pg_dumpall