Thread: fulltext search and hunspell

fulltext search and hunspell

From
Jens Sauer
Date:
Hey,

I want to use hunspell as a dictionary for the full text search by

* using PostgresSQL 8.4.7
* installing hunspell-de-de, hunspell-de-med
* creating a dictionary:

CREATE TEXT SEARCH DICTIONARY german_hunspell (
    TEMPLATE = ispell,
    DictFile = de_de,
    AffFile = de_de,
    StopWords = german
);

* changing the config

ALTER TEXT SEARCH CONFIGURATION german
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
                      word, hword, hword_part
    WITH german_hunspell, german_stem;

* now testing the lexizer:

SELECT ts_lexize('german_hunspell', 'Schokaladenfarik');
 ts_lexize
-----------

(1 Zeile)

Shouldn't it be something like this:
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
   {sjokoladefabrikk,sjokolade,fabrikk}
(from the 8.4 documentation of PostgreSQL)


The dict and affix files in the tsearch_data directory were
automatically generated by pg_updatedicts.

Is this a problem of the splitting compound word functionality? Should
I use ispell instead of hunspell?

Thanks

Re: fulltext search and hunspell

From
Oleg Bartunov
Date:
Jens,

could you check affix file for
compoundwords  controlled z

also, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.

Oleg

On Mon, 7 Feb 2011, Jens Sauer wrote:

> Hey,
>
> I want to use hunspell as a dictionary for the full text search by
>
> * using PostgresSQL 8.4.7
> * installing hunspell-de-de, hunspell-de-med
> * creating a dictionary:
>
> CREATE TEXT SEARCH DICTIONARY german_hunspell (
>    TEMPLATE = ispell,
>    DictFile = de_de,
>    AffFile = de_de,
>    StopWords = german
> );
>
> * changing the config
>
> ALTER TEXT SEARCH CONFIGURATION german
>    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                      word, hword, hword_part
>    WITH german_hunspell, german_stem;
>
> * now testing the lexizer:
>
> SELECT ts_lexize('german_hunspell', 'Schokaladenfarik');
> ts_lexize
> -----------
>
> (1 Zeile)
>
> Shouldn't it be something like this:
> SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
>   {sjokoladefabrikk,sjokolade,fabrikk}
> (from the 8.4 documentation of PostgreSQL)
>
>
> The dict and affix files in the tsearch_data directory were
> automatically generated by pg_updatedicts.
>
> Is this a problem of the splitting compound word functionality? Should
> I use ispell instead of hunspell?
>
> Thanks
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: fulltext search and hunspell

From
Jens Sauer
Date:
Hey,

thanks for your answer.

First I checked the links in the tsearch_data directory
de_de.affix, and de_de.dict are symlinks to the corresponding files in
/var/cache/postgresql/dicts/
Then I recreated them by using pg_updatedicts.

This is an extract of the de_de.affix file:

# this is the affix file of the de_DE Hunspell dictionary
# derived from the igerman98 dictionary
#
# Version: 20091006 (build 20100127)
#
# Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
#
# License: GPLv2, GPLv3 or OASIS distribution license agreement
# There should be a copy of both of this licenses included
# with every distribution of this dictionary. Modified
# versions using the GPL may only include the GPL

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxzäüößáéêàâñESIJANRTOLCDUGMPHBYFVKWQXZÄÜÖÉ-.

PFX U Y 1
PFX U   0     un       .

PFX V Y 1
PFX V   0     ver      .

SFX F Y 35
[...]

I cannot find "compoundwords controlled z" there, so I manually added it.

[...]
# versions using the GPL may only include the GPL

compoundwords  controlled z

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxzäüößáéêàâñESIJANRTOLCDUGMPHBYFVKWQXZÄÜÖÉ-.
[...]

Then I restarted PostgreSQL.

Now I get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
FEHLER:  falsches Affixdateiformat für Flag
CONTEXT:  Zeile 18 in Konfigurationsdatei
»/usr/share/postgresql/8.4/tsearch_data/de_de.affix«: »PFX U Y 1
«
SQL-Funktion »ts_debug« Anweisung 1
SQL-Funktion »ts_debug« Anweisung 1

Which means:
ERROR: wrong Affixfileformat for flag
CONTEXT: Line 18 in Configuration ...

If I add
COMPOUNDFLAG Z
ONLYINCOMPOUND L

instead of "compoundwords  controlled z"

I didn't get an error:

SELECT * FROM ts_debug('Schokoladenfabrik');
   alias   |   description   |       token       |
dictionaries          | dictionary  |      lexemes
-----------+-----------------+-------------------+-------------------------------+-------------+-------------------
 asciiword | Word, all ASCII | Schokoladenfabrik |
{german_hunspell,german_stem} | german_stem | {schokoladenfabr}
(1 row)

But it seems that the hunspell dictionary is not working for compound words.

Maybe pg_updatedicts has a bug and generates affix files in the wrong format?

Jens

2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:
> Jens,
>
> could you check affix file for
> compoundwords  controlled z
>
> also, can you provide link to dictionary files, so we can check if they
> supported, since we have only rudiment support of hunspell.
> btw,it'd be nice to have output from ts_debug() to make sure dictionaries
> actually used.
>
> Oleg

Re: fulltext search and hunspell

From
Oleg Bartunov
Date:
Jens,

have you tried german compound dictionary from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Oleg
On Tue, 8 Feb 2011, Jens Sauer wrote:

> Hey,
>
> thanks for your answer.
>
> First I checked the links in the tsearch_data directory
> de_de.affix, and de_de.dict are symlinks to the corresponding files in
> /var/cache/postgresql/dicts/
> Then I recreated them by using pg_updatedicts.
>
> This is an extract of the de_de.affix file:
>
> # this is the affix file of the de_DE Hunspell dictionary
> # derived from the igerman98 dictionary
> #
> # Version: 20091006 (build 20100127)
> #
> # Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
> #
> # License: GPLv2, GPLv3 or OASIS distribution license agreement
> # There should be a copy of both of this licenses included
> # with every distribution of this dictionary. Modified
> # versions using the GPL may only include the GPL
>
> SET ISO8859-1
> TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
>
> PFX U Y 1
> PFX U   0     un       .
>
> PFX V Y 1
> PFX V   0     ver      .
>
> SFX F Y 35
> [...]
>
> I cannot find "compoundwords controlled z" there, so I manually added it.
>
> [...]
> # versions using the GPL may only include the GPL
>
> compoundwords  controlled z
>
> SET ISO8859-1
> TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
> [...]
>
> Then I restarted PostgreSQL.
>
> Now I get an error:
> SELECT * FROM ts_debug('Schokoladenfabrik');
> FEHLER:  falsches Affixdateiformat f?r Flag
> CONTEXT:  Zeile 18 in Konfigurationsdatei
> ?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1
> ?
> SQL-Funktion ?ts_debug? Anweisung 1
> SQL-Funktion ?ts_debug? Anweisung 1
>
> Which means:
> ERROR: wrong Affixfileformat for flag
> CONTEXT: Line 18 in Configuration ...
>
> If I add
> COMPOUNDFLAG Z
> ONLYINCOMPOUND L
>
> instead of "compoundwords  controlled z"
>
> I didn't get an error:
>
> SELECT * FROM ts_debug('Schokoladenfabrik');
>   alias   |   description   |       token       |
> dictionaries          | dictionary  |      lexemes
> -----------+-----------------+-------------------+-------------------------------+-------------+-------------------
> asciiword | Word, all ASCII | Schokoladenfabrik |
> {german_hunspell,german_stem} | german_stem | {schokoladenfabr}
> (1 row)
>
> But it seems that the hunspell dictionary is not working for compound words.
>
> Maybe pg_updatedicts has a bug and generates affix files in the wrong format?
>
> Jens
>
> 2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:
>> Jens,
>>
>> could you check affix file for
>> compoundwords  controlled z
>>
>> also, can you provide link to dictionary files, so we can check if they
>> supported, since we have only rudiment support of hunspell.
>> btw,it'd be nice to have output from ts_debug() to make sure dictionaries
>> actually used.
>>
>> Oleg
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: fulltext search and hunspell

From
Jens Sauer
Date:
Thanks for this tip,
the german compound directory from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ works fine.
I think the problem was the rudimentary support of hunspell dictionaries.

Thanks for your help and your great software!

Am 08.02.2011 11:34, schrieb Oleg Bartunov:
> Jens,
>
> have you tried german compound dictionary from
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
>
> Oleg
> On Tue, 8 Feb 2011, Jens Sauer wrote:
>
>> Hey,
>>
>> thanks for your answer.
>>
>> First I checked the links in the tsearch_data directory
>> de_de.affix, and de_de.dict are symlinks to the corresponding files in
>> /var/cache/postgresql/dicts/
>> Then I recreated them by using pg_updatedicts.
>>
>> This is an extract of the de_de.affix file:
>>
>> # this is the affix file of the de_DE Hunspell dictionary
>> # derived from the igerman98 dictionary
>> #
>> # Version: 20091006 (build 20100127)
>> #
>> # Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
>> #
>> # License: GPLv2, GPLv3 or OASIS distribution license agreement
>> # There should be a copy of both of this licenses included
>> # with every distribution of this dictionary. Modified
>> # versions using the GPL may only include the GPL
>>
>> SET ISO8859-1
>> TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
>>
>> PFX U Y 1
>> PFX U   0     un       .
>>
>> PFX V Y 1
>> PFX V   0     ver      .
>>
>> SFX F Y 35
>> [...]
>>
>> I cannot find "compoundwords controlled z" there, so I manually added
>> it.
>>
>> [...]
>> # versions using the GPL may only include the GPL
>>
>> compoundwords  controlled z
>>
>> SET ISO8859-1
>> TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
>> [...]
>>
>> Then I restarted PostgreSQL.
>>
>> Now I get an error:
>> SELECT * FROM ts_debug('Schokoladenfabrik');
>> FEHLER:  falsches Affixdateiformat f?r Flag
>> CONTEXT:  Zeile 18 in Konfigurationsdatei
>> ?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1
>> ?
>> SQL-Funktion ?ts_debug? Anweisung 1
>> SQL-Funktion ?ts_debug? Anweisung 1
>>
>> Which means:
>> ERROR: wrong Affixfileformat for flag
>> CONTEXT: Line 18 in Configuration ...
>>
>> If I add
>> COMPOUNDFLAG Z
>> ONLYINCOMPOUND L
>>
>> instead of "compoundwords  controlled z"
>>
>> I didn't get an error:
>>
>> SELECT * FROM ts_debug('Schokoladenfabrik');
>>   alias   |   description   |       token       |
>> dictionaries          | dictionary  |      lexemes
>> -----------+-----------------+-------------------+-------------------------------+-------------+-------------------
>>
>> asciiword | Word, all ASCII | Schokoladenfabrik |
>> {german_hunspell,german_stem} | german_stem | {schokoladenfabr}
>> (1 row)
>>
>> But it seems that the hunspell dictionary is not working for compound
>> words.
>>
>> Maybe pg_updatedicts has a bug and generates affix files in the wrong
>> format?
>>
>> Jens
>>
>> 2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:
>>> Jens,
>>>
>>> could you check affix file for
>>> compoundwords  controlled z
>>>
>>> also, can you provide link to dictionary files, so we can check if they
>>> supported, since we have only rudiment support of hunspell.
>>> btw,it'd be nice to have output from ts_debug() to make sure
>>> dictionaries
>>> actually used.
>>>
>>> Oleg
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83