Thread: Compound words giving undesirable results with tsearch2

Compound words giving undesirable results with tsearch2

From

Lars Haugseth

Date:

30 May 2006, 10:40:04

I've setup a database using tsearch2, configured with support for compound
words according to the excellent guide found here:

 http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words

This works fine. There is however one drawback that I'd like to know
whether can be remedied. Let's say I want to search for records containing
the word 'fritekst', which is a compound Norwegian word meaning
'free text'.

testdb=# select to_tsquery('default_norwegian', 'fritekst');
          to_tsquery
------------------------------
 'fritekst' | 'fri' & 'tekst'
(1 row)

Now, this will indeed match those records, but it will also match any
records containing both of the words 'fri' and 'tekst', without regard
to whether they are next to each other or in completely different parts
of the text being indexed. In many situations, this will lead to a lot
of 'false' matches, seen from a user perspective.

Ideas on how to handle this problem will be much appreciated.

--
Lars Haugseth

"If anyone disagrees with anything I say, I am quite prepared not only to
 retract it, but also to deny under oath that I ever said it." -Tom Lehrer

Re: Compound words giving undesirable results with tsearch2

From

Oleg Bartunov

Date:

30 May 2006, 11:11:35

On Tue, 30 May 2006, Lars Haugseth wrote:

> I've setup a database using tsearch2, configured with support for compound
> words according to the excellent guide found here:
>
> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words
>
> This works fine. There is however one drawback that I'd like to know
> whether can be remedied. Let's say I want to search for records containing
> the word 'fritekst', which is a compound Norwegian word meaning
> 'free text'.
>
> testdb=# select to_tsquery('default_norwegian', 'fritekst');
>          to_tsquery
> ------------------------------
> 'fritekst' | 'fri' & 'tekst'
> (1 row)
>
> Now, this will indeed match those records, but it will also match any
> records containing both of the words 'fri' and 'tekst', without regard
> to whether they are next to each other or in completely different parts
> of the text being indexed. In many situations, this will lead to a lot
> of 'false' matches, seen from a user perspective.
>
> Ideas on how to handle this problem will be much appreciated.

this is where order by relevance should helps.


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Compound words giving undesirable results with tsearch2

From

Teodor Sigaev

Date:

30 May 2006, 15:22:43

> testdb=# select to_tsquery('default_norwegian', 'fritekst');
>           to_tsquery
> ------------------------------
>  'fritekst' | 'fri' & 'tekst'
> (1 row)
>
> Now, this will indeed match those records, but it will also match any
> records containing both of the words 'fri' and 'tekst', without regard
> to whether they are next to each other or in completely different parts
> of the text being indexed. In many situations, this will lead to a lot
> of 'false' matches, seen from a user perspective.

It's a special feature (piece from mail from our norwegian customer)

<quotation>
Let us take the compound 'fotballbane'. (Soccer field)
Split : 'fotball' 'fot' 'ball' 'bane'

Example record : "Vedlikehold av baner for fotballklubber"
(Literal translation : "Maintenance of fields for soccer clubs")

The search for 'fotballbane' ('fotballbane' & 'fotball' & 'fot' &
'ball') will not match, even though the record is precisely about this
sort of thing. 'fotballbane' | ('fotball' & 'bane') | ('fot' & 'ball' &
'bane') will match.
</quotation>

So, all variants to split compound words are joined with OR, words in one
variant are joined with AND.

If thats isn't desirable you can forbid word split for ispell (just comment z
flag) or use for searching different configuration of tsearch.


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Compound words giving undesirable results with tsearch2

From

Lars Haugseth

Date:

31 May 2006, 03:39:51

* oleg@sai.msu.su (Oleg Bartunov) wrote:
|
| On Tue, 30 May 2006, Lars Haugseth wrote:
|
| > I've setup a database using tsearch2, configured with support for compound
| > words according to the excellent guide found here:
| >
| > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words
| >
| > This works fine. There is however one drawback that I'd like to know
| > whether can be remedied. Let's say I want to search for records containing
| > the word 'fritekst', which is a compound Norwegian word meaning
| > 'free text'.
| >
| > testdb=# select to_tsquery('default_norwegian', 'fritekst');
| >          to_tsquery
| > ------------------------------
| > 'fritekst' | 'fri' & 'tekst'
| > (1 row)
| >
| > Now, this will indeed match those records, but it will also match any
| > records containing both of the words 'fri' and 'tekst', without regard
| > to whether they are next to each other or in completely different parts
| > of the text being indexed. In many situations, this will lead to a lot
| > of 'false' matches, seen from a user perspective.
| >
| > Ideas on how to handle this problem will be much appreciated.
|
| this is where order by relevance should helps.

Thank you for pointing me to this, I hadn't thought about that.

However, my first try with the rank_cd() function does not quite
produce the results I had expected:

 SELECT set_curcfg('default_norwegian');

 SELECT id, rank_cd(n, mytscol, to_tsquery('fritekst')) AS rank
   FROM mytable
  WHERE mytscol @@ to_tsquery('fritekst')
  ORDER BY rank DESC;

No matter what value I use for n here, a record where the compound word
'fritekst' appears gets a rank of 0, where as records where the words
'fri' and 'tekst' appears separately all gets a rank > 0, the closer
together, the higher the rank.

If I try to set the value of n to 0, I still get a rank of 0 for a
record containing 'fritekst', and 1 for all records containing 'fri'
and 'tekst'.

When using the rank() function instead of rank_cd() in the query above,
records with the word 'fritekst' seem to score better, but I still get
higher ranks for some records containing the separate words and not the
compound word.

--
Lars Haugseth

"If anyone disagrees with anything I say, I am quite prepared not only to
 retract it, but also to deny under oath that I ever said it." -Tom Lehrer