Re: [tsvector] to_tsvector called multiple times - Mailing list pgsql-general
From | Sven R. Kunze |
---|---|
Subject | Re: [tsvector] to_tsvector called multiple times |
Date | |
Msg-id | 5564413F.605@tbz-pariv.de Whole thread Raw |
In response to | Re: [tsvector] to_tsvector called multiple times (Albe Laurenz <laurenz.albe@wien.gv.at>) |
Responses |
Re: [tsvector] to_tsvector called multiple times
|
List | pgsql-general |
Thanks Albe for that detailed answer. On 26.05.2015 11:01, Albe Laurenz wrote: > Sven R. Kunze wrote: >> the following stemming results made me curious: >> >> select to_tsvector('german', 'systeme'); > 'system':1 >> select to_tsvector('german', 'systemes'); > 'system':1 >> select to_tsvector('german', 'systems'); > 'system':1 >> select to_tsvector('german', 'systemen'); > 'system':1 >> select to_tsvector('german', 'system'); > 'syst':1 >> >> >> First of all, this seems to be a bug in the German stemmer. Where can I >> fix it? > As far as I understand, the stemmer is not perfect, it is just a "best > effort" at German stemming. It does not have a dictionary of valid German > words, but uses an algorithm based on only the occurring letters. > > This web page describes the algorithm: > http://snowball.tartarus.org/algorithms/german/stemmer.html > I guess that the Snowball folks (and PostgreSQL) would be interested > if you could come up with a better algorithm. Thanks for that hint. I will go to https://github.com/snowballstem/snowball/issues and try to explain my problem there. However, are you sure, I am using snowball? Maybe, I am reading the documenation wrong: http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html but it seems as it depends on which packages (ispell, hunspell, myspell, snowball + corresponding languages) my system has installed. Is there an easy way to determine which of these packages PostgreSQL uses AND what for? > In this specific case, the stemmer goes wrong because "System" is a > foreign word whose ending is atypical for German. The algorithm cannot > distinguish between "System" and, say, "lautem" or "bestem". > >> Second, and more importantly, as I understand it, the stemmed version of >> a word should be considered normalized. That is, all other versions of >> that stem should be mapped to it as well. The interesting problem here >> is that PostgreSQL maps the stem itself ('system') to a completely >> different stem ('syst'). >> >> Should a stem not remain stable even when to_tsvector is called on it >> multiple times? > That's a possible position, but consider that a stem is not necessarily > a valid German word. If you treat it as a German word (by stemming it), > the results might not be what you desire. > > For example: > > test=> select to_tsvector('german', 'linsen'); > to_tsvector > ------------- > 'lins':1 > (1 row) > > test=> select to_tsvector('german', 'lins'); > to_tsvector > ------------- > 'lin':1 > (1 row) Sure. That might be the problem. It occurs to me that stems (if detected as such) should be left alone. In case a stem is real German word, it should be stemmed to itself anyway If not, it might help not to stem in order to avoid errors. > I guess that your real problem here is that a search for "system" > will not find "systeme", which is indeed unfortunate. > But until somebody can come up with a better stemming algorithm, cases > like that can always occur. > > Yours, > Laurenz Albe This might pose a problem in the future of course. Thanks for pointing this out as well. Regards, -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srkunze@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543
pgsql-general by date: