Thread: Compound words giving undesirable results with tsearch2
I've setup a database using tsearch2, configured with support for compound words according to the excellent guide found here: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words This works fine. There is however one drawback that I'd like to know whether can be remedied. Let's say I want to search for records containing the word 'fritekst', which is a compound Norwegian word meaning 'free text'. testdb=# select to_tsquery('default_norwegian', 'fritekst'); to_tsquery ------------------------------ 'fritekst' | 'fri' & 'tekst' (1 row) Now, this will indeed match those records, but it will also match any records containing both of the words 'fri' and 'tekst', without regard to whether they are next to each other or in completely different parts of the text being indexed. In many situations, this will lead to a lot of 'false' matches, seen from a user perspective. Ideas on how to handle this problem will be much appreciated. -- Lars Haugseth "If anyone disagrees with anything I say, I am quite prepared not only to retract it, but also to deny under oath that I ever said it." -Tom Lehrer
On Tue, 30 May 2006, Lars Haugseth wrote: > I've setup a database using tsearch2, configured with support for compound > words according to the excellent guide found here: > > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words > > This works fine. There is however one drawback that I'd like to know > whether can be remedied. Let's say I want to search for records containing > the word 'fritekst', which is a compound Norwegian word meaning > 'free text'. > > testdb=# select to_tsquery('default_norwegian', 'fritekst'); > to_tsquery > ------------------------------ > 'fritekst' | 'fri' & 'tekst' > (1 row) > > Now, this will indeed match those records, but it will also match any > records containing both of the words 'fri' and 'tekst', without regard > to whether they are next to each other or in completely different parts > of the text being indexed. In many situations, this will lead to a lot > of 'false' matches, seen from a user perspective. > > Ideas on how to handle this problem will be much appreciated. this is where order by relevance should helps. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> testdb=# select to_tsquery('default_norwegian', 'fritekst'); > to_tsquery > ------------------------------ > 'fritekst' | 'fri' & 'tekst' > (1 row) > > Now, this will indeed match those records, but it will also match any > records containing both of the words 'fri' and 'tekst', without regard > to whether they are next to each other or in completely different parts > of the text being indexed. In many situations, this will lead to a lot > of 'false' matches, seen from a user perspective. It's a special feature (piece from mail from our norwegian customer) <quotation> Let us take the compound 'fotballbane'. (Soccer field) Split : 'fotball' 'fot' 'ball' 'bane' Example record : "Vedlikehold av baner for fotballklubber" (Literal translation : "Maintenance of fields for soccer clubs") The search for 'fotballbane' ('fotballbane' & 'fotball' & 'fot' & 'ball') will not match, even though the record is precisely about this sort of thing. 'fotballbane' | ('fotball' & 'bane') | ('fot' & 'ball' & 'bane') will match. </quotation> So, all variants to split compound words are joined with OR, words in one variant are joined with AND. If thats isn't desirable you can forbid word split for ispell (just comment z flag) or use for searching different configuration of tsearch. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
* oleg@sai.msu.su (Oleg Bartunov) wrote: | | On Tue, 30 May 2006, Lars Haugseth wrote: | | > I've setup a database using tsearch2, configured with support for compound | > words according to the excellent guide found here: | > | > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words | > | > This works fine. There is however one drawback that I'd like to know | > whether can be remedied. Let's say I want to search for records containing | > the word 'fritekst', which is a compound Norwegian word meaning | > 'free text'. | > | > testdb=# select to_tsquery('default_norwegian', 'fritekst'); | > to_tsquery | > ------------------------------ | > 'fritekst' | 'fri' & 'tekst' | > (1 row) | > | > Now, this will indeed match those records, but it will also match any | > records containing both of the words 'fri' and 'tekst', without regard | > to whether they are next to each other or in completely different parts | > of the text being indexed. In many situations, this will lead to a lot | > of 'false' matches, seen from a user perspective. | > | > Ideas on how to handle this problem will be much appreciated. | | this is where order by relevance should helps. Thank you for pointing me to this, I hadn't thought about that. However, my first try with the rank_cd() function does not quite produce the results I had expected: SELECT set_curcfg('default_norwegian'); SELECT id, rank_cd(n, mytscol, to_tsquery('fritekst')) AS rank FROM mytable WHERE mytscol @@ to_tsquery('fritekst') ORDER BY rank DESC; No matter what value I use for n here, a record where the compound word 'fritekst' appears gets a rank of 0, where as records where the words 'fri' and 'tekst' appears separately all gets a rank > 0, the closer together, the higher the rank. If I try to set the value of n to 0, I still get a rank of 0 for a record containing 'fritekst', and 1 for all records containing 'fri' and 'tekst'. When using the rank() function instead of rank_cd() in the query above, records with the word 'fritekst' seem to score better, but I still get higher ranks for some records containing the separate words and not the compound word. -- Lars Haugseth "If anyone disagrees with anything I say, I am quite prepared not only to retract it, but also to deny under oath that I ever said it." -Tom Lehrer