Re: [tsvector] to_tsvector called multiple times - Mailing list pgsql-general

From Sven R. Kunze
Subject Re: [tsvector] to_tsvector called multiple times
Date
Msg-id 5564413F.605@tbz-pariv.de
Whole thread Raw
In response to Re: [tsvector] to_tsvector called multiple times  (Albe Laurenz <laurenz.albe@wien.gv.at>)
Responses Re: [tsvector] to_tsvector called multiple times
List pgsql-general
Thanks Albe for that detailed answer.

On 26.05.2015 11:01, Albe Laurenz wrote:
> Sven R. Kunze wrote:
>> the following stemming results made me curious:
>>
>> select to_tsvector('german', 'systeme'); > 'system':1
>> select to_tsvector('german', 'systemes'); > 'system':1
>> select to_tsvector('german', 'systems'); > 'system':1
>> select to_tsvector('german', 'systemen'); > 'system':1
>> select to_tsvector('german', 'system'); >  'syst':1
>>
>>
>> First of all, this seems to be a bug in the German stemmer. Where can I
>> fix it?
> As far as I understand, the stemmer is not perfect, it is just a "best
> effort" at German stemming.  It does not have a dictionary of valid German
> words, but uses an algorithm based on only the occurring letters.
>
> This web page describes the algorithm:
> http://snowball.tartarus.org/algorithms/german/stemmer.html
> I guess that the Snowball folks (and PostgreSQL) would be interested
> if you could come up with a better algorithm.

Thanks for that hint. I will go to
https://github.com/snowballstem/snowball/issues and try to explain my
problem there.

However, are you sure, I am using snowball? Maybe, I am reading the
documenation wrong:
http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html
but it seems as it depends on which packages (ispell, hunspell, myspell,
snowball + corresponding languages) my system has installed.

Is there an easy way to determine which of these packages PostgreSQL
uses AND what for?

> In this specific case, the stemmer goes wrong because "System" is a
> foreign word whose ending is atypical for German.  The algorithm cannot
> distinguish between "System" and, say, "lautem" or "bestem".
>
>> Second, and more importantly, as I understand it, the stemmed version of
>> a word should be considered normalized. That is, all other versions of
>> that stem should be mapped to it as well. The interesting problem here
>> is that PostgreSQL maps the stem itself ('system') to a completely
>> different stem ('syst').
>>
>> Should a stem not remain stable even when to_tsvector is called on it
>> multiple times?
> That's a possible position, but consider that a stem is not necessarily
> a valid German word.  If you treat it as a German word (by stemming it),
> the results might not be what you desire.
>
> For example:
>
> test=> select to_tsvector('german', 'linsen');
>   to_tsvector
> -------------
>   'lins':1
> (1 row)
>
> test=> select to_tsvector('german', 'lins');
>   to_tsvector
> -------------
>   'lin':1
> (1 row)

Sure. That might be the problem. It occurs to me that stems (if detected
as such) should be left alone.
In case a stem is real German word, it should be stemmed to itself anyway
If not, it might help not to stem in order to avoid errors.

> I guess that your real problem here is that a search for "system"
> will not find "systeme", which is indeed unfortunate.
> But until somebody can come up with a better stemming algorithm, cases
> like that can always occur.
>
> Yours,
> Laurenz Albe
This might pose a problem in the future of course. Thanks for pointing
this out as well.

Regards,

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@tbz-pariv.de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



pgsql-general by date:

Previous
From: "Sven R. Kunze"
Date:
Subject: Re: [tsvector] to_tsvector called multiple times
Next
From: Albe Laurenz
Date:
Subject: Re: [tsvector] to_tsvector called multiple times