Re: [tsvector] to_tsvector called multiple times - Mailing list pgsql-general

From Albe Laurenz
Subject Re: [tsvector] to_tsvector called multiple times
Date
Msg-id A737B7A37273E048B164557ADEF4A58B36615EEF@ntex2010i.host.magwien.gv.at
Whole thread Raw
In response to [tsvector] to_tsvector called multiple times  ("Sven R. Kunze" <srkunze@tbz-pariv.de>)
Responses Re: [tsvector] to_tsvector called multiple times  ("Sven R. Kunze" <srkunze@tbz-pariv.de>)
List pgsql-general
Sven R. Kunze wrote:
> the following stemming results made me curious:
> 
> select to_tsvector('german', 'systeme'); > 'system':1
> select to_tsvector('german', 'systemes'); > 'system':1
> select to_tsvector('german', 'systems'); > 'system':1
> select to_tsvector('german', 'systemen'); > 'system':1
> select to_tsvector('german', 'system'); >  'syst':1
> 
> 
> First of all, this seems to be a bug in the German stemmer. Where can I
> fix it?

As far as I understand, the stemmer is not perfect, it is just a "best
effort" at German stemming.  It does not have a dictionary of valid German
words, but uses an algorithm based on only the occurring letters.

This web page describes the algorithm:
http://snowball.tartarus.org/algorithms/german/stemmer.html
I guess that the Snowball folks (and PostgreSQL) would be interested
if you could come up with a better algorithm.

In this specific case, the stemmer goes wrong because "System" is a
foreign word whose ending is atypical for German.  The algorithm cannot
distinguish between "System" and, say, "lautem" or "bestem".

> Second, and more importantly, as I understand it, the stemmed version of
> a word should be considered normalized. That is, all other versions of
> that stem should be mapped to it as well. The interesting problem here
> is that PostgreSQL maps the stem itself ('system') to a completely
> different stem ('syst').
> 
> Should a stem not remain stable even when to_tsvector is called on it
> multiple times?

That's a possible position, but consider that a stem is not necessarily
a valid German word.  If you treat it as a German word (by stemming it),
the results might not be what you desire.

For example:

test=> select to_tsvector('german', 'linsen');
 to_tsvector
-------------
 'lins':1
(1 row)

test=> select to_tsvector('german', 'lins');
 to_tsvector
-------------
 'lin':1
(1 row)

I guess that your real problem here is that a search for "system"
will not find "systeme", which is indeed unfortunate.
But until somebody can come up with a better stemming algorithm, cases
like that can always occur.

Yours,
Laurenz Albe

pgsql-general by date:

Previous
From: "Sven R. Kunze"
Date:
Subject: [tsvector] to_tsvector called multiple times
Next
From: Oleg Bartunov
Date:
Subject: Re: [tsvector] to_tsvector called multiple times