Sven R. Kunze wrote:
> the following stemming results made me curious:
>
> select to_tsvector('german', 'systeme'); > 'system':1
> select to_tsvector('german', 'systemes'); > 'system':1
> select to_tsvector('german', 'systems'); > 'system':1
> select to_tsvector('german', 'systemen'); > 'system':1
> select to_tsvector('german', 'system'); > 'syst':1
>
>
> First of all, this seems to be a bug in the German stemmer. Where can I
> fix it?
As far as I understand, the stemmer is not perfect, it is just a "best
effort" at German stemming. It does not have a dictionary of valid German
words, but uses an algorithm based on only the occurring letters.
This web page describes the algorithm:
http://snowball.tartarus.org/algorithms/german/stemmer.html
I guess that the Snowball folks (and PostgreSQL) would be interested
if you could come up with a better algorithm.
In this specific case, the stemmer goes wrong because "System" is a
foreign word whose ending is atypical for German. The algorithm cannot
distinguish between "System" and, say, "lautem" or "bestem".
> Second, and more importantly, as I understand it, the stemmed version of
> a word should be considered normalized. That is, all other versions of
> that stem should be mapped to it as well. The interesting problem here
> is that PostgreSQL maps the stem itself ('system') to a completely
> different stem ('syst').
>
> Should a stem not remain stable even when to_tsvector is called on it
> multiple times?
That's a possible position, but consider that a stem is not necessarily
a valid German word. If you treat it as a German word (by stemming it),
the results might not be what you desire.
For example:
test=> select to_tsvector('german', 'linsen');
to_tsvector
-------------
'lins':1
(1 row)
test=> select to_tsvector('german', 'lins');
to_tsvector
-------------
'lin':1
(1 row)
I guess that your real problem here is that a search for "system"
will not find "systeme", which is indeed unfortunate.
But until somebody can come up with a better stemming algorithm, cases
like that can always occur.
Yours,
Laurenz Albe