German ispell dictionary: error parsing affix file - Mailing list pgsql-general

From Christof König
Subject German ispell dictionary: error parsing affix file
Date
Msg-id f1d0fa760908200915v8eaab5dwb45077eee6117f6a@mail.gmail.com
Whole thread Raw
List pgsql-general
Hi,

I'm trying to get a German ispell dictionary to work with
postgresql 8.3.7 which supports compound words. I tried
the following three dictionaries:

- http://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries/de_DE_frami.zip
(for OpenOffice 2),
- http://extensions.services.openoffice.org/project/dict-de_DE_frami
(for OpenOffice 3) and
- http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz.

Each file was converted to UTF-8 via iconv. I created the
dictionary with the following command:

CREATE TEXT SEARCH DICTIONARY german_ispell (
    Template = ispell,
    DictFile = de_de_frami,
    AffFile = de_de_frami,
    StopWords = german
);

Then I test it via:

SELECT ts_lexize('german_ispell', 'haustür');

which should result in 'haus' and 'tür'. The first two
dictionaries return nothing at all. Compound words don't seem
to work with those two.

The third one works if I remove all lines containing any umlauts
from de_de_frami.affix and returns 'haus' and 'tür'. If I do not
remove all lines containing umlauts from the affix file I get a
syntax error during parsing:

ERROR:  syntax error
CONTEXT:  line 224 of configuration file
"/usr/local/share/postgresql/tsearch_data/de_de_frami.affix": "   ABE
  > -ABE,äBIN
"

Problem seems to be that postgresql runs on OpenBSD, which
does not support any locale but C. The affix file contains umlauts
and is encoded in UTF-8 as required by postgresql. But the
parsing fails probably due to the method parse_affentry in spell.c
and the method t_isalpha used within that function.

In t_isalpha there is:

if (clen == 1 || lc_ctype_is_c())
    return isalpha(TOUCHAR(ptr))

which fails for the umlauts in the affix file. is there any reason to
check for a lc_ctype of C here. The affix file is in UTF-8 and each line
is converted to the encoding used by the database. Why is there
a check for the C locale?

Or am I completly wrong and this is not the reason, the parsing of
the affix file fails.

Thanks for your help.

Christof

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: unique index for periods
Next
From: Craig Ringer
Date:
Subject: Re: ERROR: could not access file "$libdir/xxid": No such file or directory