tsearch2, ispell, utf-8 and german special characters - Mailing list pgsql-general

From Markus Wollny
Subject tsearch2, ispell, utf-8 and german special characters
Date
Msg-id 2266D0630E43BB4290742247C891057505BF2D18@dozer.computec.de
Whole thread Raw
Responses Re: tsearch2, ispell, utf-8 and german special characters  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-general
Hi!
 
Sorry to bother you, but I just don't know how to get tsearch2 configured correctly for my setup. I've got a 7.4.3 database-cluster initdb'ed with de_DE@euro as locale, the database is with Unicode encoding.
 
I made and installed contrib/tsearch2 after installing the dump/reload-patch http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz as advised by the docs. So far everything is looking good, I have generated a snowball stemmer dictionary and an ispell dictionary as described in the docs and created a new configuration 'default_german' as described.
 
This is working somehow:
SELECT to_tsvector('default_german',
                           'tsearch2 erlernen ist wie zur Schule zu gehen');
-> 'gehen':10 'schulen':8 'erlernen':3 'tsearch2':2
 
though I don't quite understand why "Schule" is converted to "schulen" and not the other way round, but so be it. My problem lies, as every so often, with the non-ascii-characters, namely german umlauts and the ß.
 
SELECT to_tsvector('default_german',
                           'ich muß tsearch2 begreifen ');
 
returns null. So does any phrase which contains ÄÖÜäüß or anything that's beyond ASCII.
 
Another thing is the ISpell functionality; the docs are quite vague on this part when it comes to explaining which file(s) to use to create german.med. In ISpell conventions, umlauts seem to be represented as A" a" O" o" U" u" and thus when doing
 
SELECT lexize('de_ispell', 'Äther');
I receive NULL
 
whereas
SELECT lexize('de_ispell', 'A"ther');
gives me {"a\"ther"}
as result.
 
I downloaded igerman98-20030222.tar.bz2 from http://j3e.de/ispell/igerman98/dict/ which seems to be the recommended ISpell dictionary distribution for the german language as noted on http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell-dictionaries.html#German-dicts
 
Of course there are no german.0 or german.1 files in this distribution which would be the obvious counterparts to english.0 and english.1 mentioned in the tsearch2-docs; there is however a file all.words built on installation, which seems to be the basis for building the hash-file later on. The first few lines of this file are
 
A"bte/N
A"btissin/F
a"chten/DIXY
A"chtens
A"chtung/P
a"chzen/DIXY
a"chzt/EGPX
A"cker/N
In order to get the .med-File I did sort -u -t/ +0f -1 +0 -T /usr/tmp -o german.med all.words
 
There is an option to generate another wordlist via make isowordlist - but this didn't resolve the umlaut-issue either, neither in the standard encoding provided in the package nor after conversion to UTF-8 (I tried both with and without a BOM).
 
Now has anybody actually managed to get a working configuration with tsearch2 and german language support in a unicode-database? What am I doing wrong? I just can't find any more hints in the docs, and there's a topic on the OpenFTS-Mailinglist with somewhat similar issues ( http://sourceforge.net/mailarchive/forum.php?thread_id=3979419&forum_id=7671 ), but nothing which would actually help to resolve it.
 
Kind regards
 
   Markus

pgsql-general by date:

Previous
From: "Scott Marlowe"
Date:
Subject: Re: Stored procedures and "pseudo" fields..
Next
From: Peter Eisentraut
Date:
Subject: Re: tsearch2, ispell, utf-8 and german special characters