Re: [to_tsvector] German Compound Words - Mailing list pgsql-general
From | Sven R. Kunze |
---|---|
Subject | Re: [to_tsvector] German Compound Words |
Date | |
Msg-id | 556C08DE.9000102@tbz-pariv.de Whole thread Raw |
In response to | [to_tsvector] German Compound Words ("Sven R. Kunze" <srkunze@tbz-pariv.de>) |
Responses |
Re: [to_tsvector] German Compound Words
|
List | pgsql-general |
I actually wanted to minimize the installation effort. Thus, I used the hunspell-de-de package of Debian/Ubuntu.
Give me a second for ispell.
Below, see the hunspell variant for Produktionsintervall/Produktionintervall:
=# select * from ts_debug('public.german_compound', 'Produktionsintervall');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------------------+-------------------------------+-------------+------------------------
asciiword | Word, all ASCII | Produktionsintervall | {german_hunspell,german_stem} | german_stem | {produktionsintervall}
(1 row)
=# select * from ts_debug('public.german_compound', 'Produktionintervall');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+---------------------+-------------------------------+-------------+-----------------------
asciiword | Word, all ASCII | Produktionintervall | {german_hunspell,german_stem} | german_stem | {produktionintervall}
PS: I post your answer to the list as well
On 28.05.2015 19:42, Oleg Bartunov wrote:
Give me a second for ispell.
Below, see the hunspell variant for Produktionsintervall/Produktionintervall:
=# select * from ts_debug('public.german_compound', 'Produktionsintervall');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------------------+-------------------------------+-------------+------------------------
asciiword | Word, all ASCII | Produktionsintervall | {german_hunspell,german_stem} | german_stem | {produktionsintervall}
(1 row)
=# select * from ts_debug('public.german_compound', 'Produktionintervall');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+---------------------+-------------------------------+-------------+-----------------------
asciiword | Word, all ASCII | Produktionintervall | {german_hunspell,german_stem} | german_stem | {produktionintervall}
PS: I post your answer to the list as well
On 28.05.2015 19:42, Oleg Bartunov wrote:
For readability it's better to useI remember there is problem with correct support of hunspell files. Did you try ispell files ?
select * from ts_debugAlso, I found this message http://www.postgresql.org/message-id/dm1ece$2gb5$1@news.hub.org Try this word - ProduktionintervallOn Thu, May 28, 2015 at 6:34 PM, Sven R. Kunze <srkunze@tbz-pariv.de> wrote:Sure. Here you are:
=# select ts_debug('public.german_compound', 'wasserkraft');
ts_debug
-----------------------------------------------------------------------------------------------------
(asciiword,"Word, all ASCII",wasserkraft,"{german_hunspell,german_stem}",german_stem,{wasserkraft})
=# select ts_debug('public.german_compound', 'schifffahrt');
ts_debug
---------------------------------------------------------------------------------------------------------
(asciiword,"Word, all ASCII",schifffahrt,"{german_hunspell,german_stem}",german_hunspell,{schifffahrt})
=# select ts_debug('public.german_compound', 'blindflansch');
ts_debug
-------------------------------------------------------------------------------------------------------
(asciiword,"Word, all ASCII",blindflansch,"{german_hunspell,german_stem}",german_stem,{blindflansch})
That is my testing configuration:
=# \dF+ german_compound
Text search configuration "public.german_compound"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+-----------------------------
asciihword | german_hunspell,german_stem
asciiword | german_hunspell,german_stem
email | simple
file | simple
float | simple
host | simple
hword | german_hunspell,german_stem
hword_asciipart | german_hunspell,german_stem
hword_numpart | simple
hword_part | german_hunspell,german_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | german_hunspell,german_stem
On 28.05.2015 17:24, Oleg Bartunov wrote:ts_debug() ?
=# select * from ts_debug('english', 'messages');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+----------------+--------------+----------
asciiword | Word, all ASCII | messages | {english_stem} | english_stem | {messag}On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze <srkunze@tbz-pariv.de> wrote:Hi everybody,
what do I need to do in order to enable compound word handling in PostgreSQL tsvector implementation?
I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package hunspell-de-de and already created a new dictionary as described here: http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY
CREATE TEXT SEARCH DICTIONARY german_hunspell (
TEMPLATE = ispell,
DictFile = de_de,
AffFile = de_de,
StopWords = german
);
Furthermore, created a new test text search configuration (copied from german) and updated all parser parts where the german_stem dictionary is used so that it uses german_hunspell first and then german_stem.
However, ts_vector still does not work for the compound words such as:
wasserkraft -> wasserkraft, kraft
schifffahrt -> schifffahrt, fahrt
blindflansch -> blindflansch, flansch
etc.
What have I done wrong here?
--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@tbz-pariv.de
web: www.tbz-pariv.de
Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general-- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srkunze@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543
-- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srkunze@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543
pgsql-general by date: