Thread: [tsvector] to_tsvector called multiple times
Hi everybody, the following stemming results made me curious: select to_tsvector('german', 'systeme'); > 'system':1 select to_tsvector('german', 'systemes'); > 'system':1 select to_tsvector('german', 'systems'); > 'system':1 select to_tsvector('german', 'systemen'); > 'system':1 select to_tsvector('german', 'system'); > 'syst':1 First of all, this seems to be a bug in the German stemmer. Where can I fix it? Second, and more importantly, as I understand it, the stemmed version of a word should be considered normalized. That is, all other versions of that stem should be mapped to it as well. The interesting problem here is that PostgreSQL maps the stem itself ('system') to a completely different stem ('syst'). Should a stem not remain stable even when to_tsvector is called on it multiple times? -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srkunze@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543
Sven R. Kunze wrote: > the following stemming results made me curious: > > select to_tsvector('german', 'systeme'); > 'system':1 > select to_tsvector('german', 'systemes'); > 'system':1 > select to_tsvector('german', 'systems'); > 'system':1 > select to_tsvector('german', 'systemen'); > 'system':1 > select to_tsvector('german', 'system'); > 'syst':1 > > > First of all, this seems to be a bug in the German stemmer. Where can I > fix it? As far as I understand, the stemmer is not perfect, it is just a "best effort" at German stemming. It does not have a dictionary of valid German words, but uses an algorithm based on only the occurring letters. This web page describes the algorithm: http://snowball.tartarus.org/algorithms/german/stemmer.html I guess that the Snowball folks (and PostgreSQL) would be interested if you could come up with a better algorithm. In this specific case, the stemmer goes wrong because "System" is a foreign word whose ending is atypical for German. The algorithm cannot distinguish between "System" and, say, "lautem" or "bestem". > Second, and more importantly, as I understand it, the stemmed version of > a word should be considered normalized. That is, all other versions of > that stem should be mapped to it as well. The interesting problem here > is that PostgreSQL maps the stem itself ('system') to a completely > different stem ('syst'). > > Should a stem not remain stable even when to_tsvector is called on it > multiple times? That's a possible position, but consider that a stem is not necessarily a valid German word. If you treat it as a German word (by stemming it), the results might not be what you desire. For example: test=> select to_tsvector('german', 'linsen'); to_tsvector ------------- 'lins':1 (1 row) test=> select to_tsvector('german', 'lins'); to_tsvector ------------- 'lin':1 (1 row) I guess that your real problem here is that a search for "system" will not find "systeme", which is indeed unfortunate. But until somebody can come up with a better stemming algorithm, cases like that can always occur. Yours, Laurenz Albe
You can ask http://snowball.tartarus.org/ for stemmer. Meanwhile,
you can have small personal dictionary (before stemmer) with such exceptions, for example, use synonym template system system
On Tue, May 26, 2015 at 11:18 AM, Sven R. Kunze <srkunze@tbz-pariv.de> wrote:
Hi everybody,
the following stemming results made me curious:
select to_tsvector('german', 'systeme'); > 'system':1
select to_tsvector('german', 'systemes'); > 'system':1
select to_tsvector('german', 'systems'); > 'system':1
select to_tsvector('german', 'systemen'); > 'system':1
select to_tsvector('german', 'system'); > 'syst':1
First of all, this seems to be a bug in the German stemmer. Where can I fix it?
Second, and more importantly, as I understand it, the stemmed version of a word should be considered normalized. That is, all other versions of that stem should be mapped to it as well. The interesting problem here is that PostgreSQL maps the stem itself ('system') to a completely different stem ('syst').
Should a stem not remain stable even when to_tsvector is called on it multiple times?
--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@tbz-pariv.de
web: www.tbz-pariv.de
Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Thanks, Oleg. Unfortunately, that does not work quite well as German is comprised of many compound nouns.
In fact, I discovered that anomaly by searching through a domain-specific word table. For example: Waferhandlingsystem. There are many '*system' but the PostgreSQL does not allow me to have a suffix; only a prefix and only for to_tsquery (http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-SYNONYM-DICTIONARY).
Is there another possibility?
On 26.05.2015 11:05, Oleg Bartunov wrote:
In fact, I discovered that anomaly by searching through a domain-specific word table. For example: Waferhandlingsystem. There are many '*system' but the PostgreSQL does not allow me to have a suffix; only a prefix and only for to_tsquery (http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-SYNONYM-DICTIONARY).
Is there another possibility?
On 26.05.2015 11:05, Oleg Bartunov wrote:
OlegYou can ask http://snowball.tartarus.org/ for stemmer. Meanwhile,you can have small personal dictionary (before stemmer) with such exceptions, for example, use synonym template
system systemOn Tue, May 26, 2015 at 11:18 AM, Sven R. Kunze <srkunze@tbz-pariv.de> wrote:Hi everybody,
the following stemming results made me curious:
select to_tsvector('german', 'systeme'); > 'system':1
select to_tsvector('german', 'systemes'); > 'system':1
select to_tsvector('german', 'systems'); > 'system':1
select to_tsvector('german', 'systemen'); > 'system':1
select to_tsvector('german', 'system'); > 'syst':1
First of all, this seems to be a bug in the German stemmer. Where can I fix it?
Second, and more importantly, as I understand it, the stemmed version of a word should be considered normalized. That is, all other versions of that stem should be mapped to it as well. The interesting problem here is that PostgreSQL maps the stem itself ('system') to a completely different stem ('syst').
Should a stem not remain stable even when to_tsvector is called on it multiple times?
--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@tbz-pariv.de
web: www.tbz-pariv.de
Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
-- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srkunze@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543
Thanks Albe for that detailed answer. On 26.05.2015 11:01, Albe Laurenz wrote: > Sven R. Kunze wrote: >> the following stemming results made me curious: >> >> select to_tsvector('german', 'systeme'); > 'system':1 >> select to_tsvector('german', 'systemes'); > 'system':1 >> select to_tsvector('german', 'systems'); > 'system':1 >> select to_tsvector('german', 'systemen'); > 'system':1 >> select to_tsvector('german', 'system'); > 'syst':1 >> >> >> First of all, this seems to be a bug in the German stemmer. Where can I >> fix it? > As far as I understand, the stemmer is not perfect, it is just a "best > effort" at German stemming. It does not have a dictionary of valid German > words, but uses an algorithm based on only the occurring letters. > > This web page describes the algorithm: > http://snowball.tartarus.org/algorithms/german/stemmer.html > I guess that the Snowball folks (and PostgreSQL) would be interested > if you could come up with a better algorithm. Thanks for that hint. I will go to https://github.com/snowballstem/snowball/issues and try to explain my problem there. However, are you sure, I am using snowball? Maybe, I am reading the documenation wrong: http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html but it seems as it depends on which packages (ispell, hunspell, myspell, snowball + corresponding languages) my system has installed. Is there an easy way to determine which of these packages PostgreSQL uses AND what for? > In this specific case, the stemmer goes wrong because "System" is a > foreign word whose ending is atypical for German. The algorithm cannot > distinguish between "System" and, say, "lautem" or "bestem". > >> Second, and more importantly, as I understand it, the stemmed version of >> a word should be considered normalized. That is, all other versions of >> that stem should be mapped to it as well. The interesting problem here >> is that PostgreSQL maps the stem itself ('system') to a completely >> different stem ('syst'). >> >> Should a stem not remain stable even when to_tsvector is called on it >> multiple times? > That's a possible position, but consider that a stem is not necessarily > a valid German word. If you treat it as a German word (by stemming it), > the results might not be what you desire. > > For example: > > test=> select to_tsvector('german', 'linsen'); > to_tsvector > ------------- > 'lins':1 > (1 row) > > test=> select to_tsvector('german', 'lins'); > to_tsvector > ------------- > 'lin':1 > (1 row) Sure. That might be the problem. It occurs to me that stems (if detected as such) should be left alone. In case a stem is real German word, it should be stemmed to itself anyway If not, it might help not to stem in order to avoid errors. > I guess that your real problem here is that a search for "system" > will not find "systeme", which is indeed unfortunate. > But until somebody can come up with a better stemming algorithm, cases > like that can always occur. > > Yours, > Laurenz Albe This might pose a problem in the future of course. Thanks for pointing this out as well. Regards, -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srkunze@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543
Sven R. Kunze wrote: > However, are you sure, I am using snowball? Maybe, I am reading the > documenation wrong: test=> SELECT * FROM ts_debug('german', 'system'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+--------+---------------+-------------+--------- asciiword | Word, all ASCII | system | {german_stem} | german_stem | {syst} (1 row) test=> \dFd german_stem List of text search dictionaries Schema | Name | Description ------------+-------------+-------------------------------------- pg_catalog | german_stem | snowball stemmer for german language (1 row) > http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html > but it seems as it depends on which packages (ispell, hunspell, myspell, > snowball + corresponding languages) my system has installed. > > Is there an easy way to determine which of these packages PostgreSQL > uses AND what for? If you use a standard PostgreSQL distribution, you will have no ispell dictionary (as the documentation you quote says). You can always list all dictionaries with "\dFd" in psql. > Sure. That might be the problem. It occurs to me that stems (if detected > as such) should be left alone. > In case a stem is real German word, it should be stemmed to itself anyway > If not, it might help not to stem in order to avoid errors. Yes, but that would mean that you have a way to determine from a string whether it is a word or a stem or both, and the software does not do that. Yours, Laurenz Albe
Thanks. It seems as if I have use snowball. So, I go ahead and post my issue at github. Maybe, I have difficulties to understand the relationship/dependencies between all these 'maybe' available dictionary/parser/stemmer packages. What happens if I install all packages for a single language? (hunspell, myspell, ispell, snowball) Are they complementary? Do they replace each other? >>> \dFd List of text search dictionaries Schema | Name | Description ------------+-----------------+----------------------------------------------------------- pg_catalog | danish_stem | snowball stemmer for danish language pg_catalog | dutch_stem | snowball stemmer for dutch language pg_catalog | english_stem | snowball stemmer for english language pg_catalog | finnish_stem | snowball stemmer for finnish language pg_catalog | french_stem | snowball stemmer for french language pg_catalog | german_stem | snowball stemmer for german language pg_catalog | hungarian_stem | snowball stemmer for hungarian language pg_catalog | italian_stem | snowball stemmer for italian language pg_catalog | norwegian_stem | snowball stemmer for norwegian language pg_catalog | portuguese_stem | snowball stemmer for portuguese language pg_catalog | romanian_stem | snowball stemmer for romanian language pg_catalog | russian_stem | snowball stemmer for russian language pg_catalog | simple | simple dictionary: just lower case and check for stopword pg_catalog | spanish_stem | snowball stemmer for spanish language pg_catalog | swedish_stem | snowball stemmer for swedish language pg_catalog | turkish_stem | snowball stemmer for turkish language (16 rows) On 26.05.2015 12:09, Albe Laurenz wrote: > Sven R. Kunze wrote: >> However, are you sure, I am using snowball? Maybe, I am reading the >> documenation wrong: > test=> SELECT * FROM ts_debug('german', 'system'); > alias | description | token | dictionaries | dictionary | lexemes > -----------+-----------------+--------+---------------+-------------+--------- > asciiword | Word, all ASCII | system | {german_stem} | german_stem | {syst} > (1 row) > > test=> \dFd german_stem > List of text search dictionaries > Schema | Name | Description > ------------+-------------+-------------------------------------- > pg_catalog | german_stem | snowball stemmer for german language > (1 row) > >> http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html >> but it seems as it depends on which packages (ispell, hunspell, myspell, >> snowball + corresponding languages) my system has installed. >> >> Is there an easy way to determine which of these packages PostgreSQL >> uses AND what for? > If you use a standard PostgreSQL distribution, you will have no ispell > dictionary (as the documentation you quote says). > You can always list all dictionaries with "\dFd" in psql. > >> Sure. That might be the problem. It occurs to me that stems (if detected >> as such) should be left alone. >> In case a stem is real German word, it should be stemmed to itself anyway >> If not, it might help not to stem in order to avoid errors. > Yes, but that would mean that you have a way to determine from a string > whether it is a word or a stem or both, and the software does not do that. > > Yours, > Laurenz Albe > Regards, -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srkunze@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543
For future reference: https://github.com/snowballstem/snowball/issues/19 On 26.05.2015 12:29, Sven R. Kunze wrote: > Thanks. It seems as if I have use snowball. So, I go ahead and post my > issue at github. > > > Maybe, I have difficulties to understand the relationship/dependencies > between all these 'maybe' available dictionary/parser/stemmer packages. > > What happens if I install all packages for a single language? > (hunspell, myspell, ispell, snowball) > > Are they complementary? Do they replace each other? > > > >>> \dFd > List of text search dictionaries > Schema | Name | Description > ------------+-----------------+----------------------------------------------------------- > > pg_catalog | danish_stem | snowball stemmer for danish language > pg_catalog | dutch_stem | snowball stemmer for dutch language > pg_catalog | english_stem | snowball stemmer for english language > pg_catalog | finnish_stem | snowball stemmer for finnish language > pg_catalog | french_stem | snowball stemmer for french language > pg_catalog | german_stem | snowball stemmer for german language > pg_catalog | hungarian_stem | snowball stemmer for hungarian language > pg_catalog | italian_stem | snowball stemmer for italian language > pg_catalog | norwegian_stem | snowball stemmer for norwegian language > pg_catalog | portuguese_stem | snowball stemmer for portuguese language > pg_catalog | romanian_stem | snowball stemmer for romanian language > pg_catalog | russian_stem | snowball stemmer for russian language > pg_catalog | simple | simple dictionary: just lower case and > check for stopword > pg_catalog | spanish_stem | snowball stemmer for spanish language > pg_catalog | swedish_stem | snowball stemmer for swedish language > pg_catalog | turkish_stem | snowball stemmer for turkish language > (16 rows) > > > On 26.05.2015 12:09, Albe Laurenz wrote: >> Sven R. Kunze wrote: >>> However, are you sure, I am using snowball? Maybe, I am reading the >>> documenation wrong: >> test=> SELECT * FROM ts_debug('german', 'system'); >> alias | description | token | dictionaries | dictionary >> | lexemes >> -----------+-----------------+--------+---------------+-------------+--------- >> >> asciiword | Word, all ASCII | system | {german_stem} | german_stem >> | {syst} >> (1 row) >> >> test=> \dFd german_stem >> List of text search dictionaries >> Schema | Name | Description >> ------------+-------------+-------------------------------------- >> pg_catalog | german_stem | snowball stemmer for german language >> (1 row) >> >>> http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html >>> but it seems as it depends on which packages (ispell, hunspell, >>> myspell, >>> snowball + corresponding languages) my system has installed. >>> >>> Is there an easy way to determine which of these packages PostgreSQL >>> uses AND what for? >> If you use a standard PostgreSQL distribution, you will have no ispell >> dictionary (as the documentation you quote says). >> You can always list all dictionaries with "\dFd" in psql. >> >>> Sure. That might be the problem. It occurs to me that stems (if >>> detected >>> as such) should be left alone. >>> In case a stem is real German word, it should be stemmed to itself >>> anyway >>> If not, it might help not to stem in order to avoid errors. >> Yes, but that would mean that you have a way to determine from a string >> whether it is a word or a stem or both, and the software does not do >> that. >> >> Yours, >> Laurenz Albe >> > > Regards, > -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srkunze@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543
Sven R. Kunze wrote: > Maybe, I have difficulties to understand the relationship/dependencies > between all these 'maybe' available dictionary/parser/stemmer packages. > > What happens if I install all packages for a single language? (hunspell, > myspell, ispell, snowball) > > Are they complementary? Do they replace each other? They are all dictionaries. The "text search configuration" determines which dictionaries get applied to which kinds of words in which order. So if you introduce a new dictionary, you either have to modify an existing configuration or efine a new one to use it. Yours, Laurenz Albe
I think I understand now. Thus, the issue at hand could (maybe) be solved by passing words first to one of those more elaborate dictionaries (myspell, hunspell or ispell) and if still necessary then to snowball. Did I get this right? On 26.05.2015 13:38, Albe Laurenz wrote: > Sven R. Kunze wrote: >> Maybe, I have difficulties to understand the relationship/dependencies >> between all these 'maybe' available dictionary/parser/stemmer packages. >> >> What happens if I install all packages for a single language? (hunspell, >> myspell, ispell, snowball) >> >> Are they complementary? Do they replace each other? > They are all dictionaries. > > The "text search configuration" determines which dictionaries get > applied to which kinds of words in which order. > > So if you introduce a new dictionary, you either have to modify > an existing configuration or efine a new one to use it. > > Yours, > Laurenz Albe Regards, -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srkunze@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543
Sven R. Kunze wrote: > I think I understand now. > > Thus, the issue at hand could (maybe) be solved by passing words first > to one of those more elaborate dictionaries (myspell, hunspell or > ispell) and if still necessary then to snowball. > > Did I get this right? I have never experimented with ispell dictionaries, so I don't know if they replace a snowball dictionary or are used in addition to it. Yours, Laurenz Albe