Thread: tsearch2, ispell, utf-8 and german special characters

tsearch2, ispell, utf-8 and german special characters

From

"Markus Wollny"

Date:

20 July 2004, 13:52:54

Hi!

Sorry to bother you, but I just don't know how to get tsearch2 configured correctly for my setup. I've got a 7.4.3 database-cluster initdb'ed with de_DE@euro as locale, the database is with Unicode encoding.

I made and installed contrib/tsearch2 after installing the dump/reload-patch http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz as advised by the docs. So far everything is looking good, I have generated a snowball stemmer dictionary and an ispell dictionary as described in the docs and created a new configuration 'default_german' as described.

This is working somehow:

SELECT to_tsvector('default_german',
'tsearch2 erlernen ist wie zur Schule zu gehen');

-> 'gehen':10 'schulen':8 'erlernen':3 'tsearch2':2

though I don't quite understand why "Schule" is converted to "schulen" and not the other way round, but so be it. My problem lies, as every so often, with the non-ascii-characters, namely german umlauts and the ß.

SELECT to_tsvector('default_german',
'ich muß tsearch2 begreifen ');

returns null. So does any phrase which contains ÄÖÜäüß or anything that's beyond ASCII.

Another thing is the ISpell functionality; the docs are quite vague on this part when it comes to explaining which file(s) to use to create german.med. In ISpell conventions, umlauts seem to be represented as A" a" O" o" U" u" and thus when doing

SELECT lexize('de_ispell', 'Äther');

I receive NULL

whereas

SELECT lexize('de_ispell', 'A"ther');

gives me {"a\"ther"}

as result.

I downloaded igerman98-20030222.tar.bz2 from http://j3e.de/ispell/igerman98/dict/ which seems to be the recommended ISpell dictionary distribution for the german language as noted on http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell-dictionaries.html#German-dicts

Of course there are no german.0 or german.1 files in this distribution which would be the obvious counterparts to english.0 and english.1 mentioned in the tsearch2-docs; there is however a file all.words built on installation, which seems to be the basis for building the hash-file later on. The first few lines of this file are

A"bte/N
A"btissin/F
a"chten/DIXY
A"chtens
A"chtung/P
a"chzen/DIXY
a"chzt/EGPX
A"cker/N

In order to get the .med-File I did sort -u -t/ +0f -1 +0 -T /usr/tmp -o german.med all.words

There is an option to generate another wordlist via make isowordlist - but this didn't resolve the umlaut-issue either, neither in the standard encoding provided in the package nor after conversion to UTF-8 (I tried both with and without a BOM).

Now has anybody actually managed to get a working configuration with tsearch2 and german language support in a unicode-database? What am I doing wrong? I just can't find any more hints in the docs, and there's a topic on the OpenFTS-Mailinglist with somewhat similar issues ( http://sourceforge.net/mailarchive/forum.php?thread_id=3979419&forum_id=7671 ), but nothing which would actually help to resolve it.

Kind regards

Markus

Re: tsearch2, ispell, utf-8 and german special characters

From

Peter Eisentraut

Date:

20 July 2004, 14:26:48

Markus Wollny wrote:
> Sorry to bother you, but I just don't know how to get tsearch2
> configured correctly for my setup. I've got a 7.4.3 database-cluster
> initdb'ed with de_DE@euro as locale, the database is with Unicode
> encoding.

This doesn't work (correctly).  Either you use de_DE@euro and LATIN9, or
you use de_DE.utf8 and UNICODE.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: tsearch2, ispell, utf-8 and german special characters

From

"Markus Wollny"

Date:

21 July 2004, 04:40:05

Thanks for your answer. It's probably not sufficient to adjust the current locale settings of the system, so I'll have
todump, re-initdb and reload - am I correct or is there some procedure involving less downtime than that?  

> -----Ursprüngliche Nachricht-----
> Von: Peter Eisentraut [mailto:peter_e@gmx.net]
> Gesendet: Dienstag, 20. Juli 2004 19:27
> An: Markus Wollny; pgsql-general@postgresql.org;
> openfts-general@lists.sourceforge.net
> Betreff: Re: [GENERAL] tsearch2, ispell, utf-8 and german
> special characters
>
> Markus Wollny wrote:
> > Sorry to bother you, but I just don't know how to get tsearch2
> > configured correctly for my setup. I've got a 7.4.3
> database-cluster
> > initdb'ed with de_DE@euro as locale, the database is with Unicode
> > encoding.
>
> This doesn't work (correctly).  Either you use de_DE@euro and
> LATIN9, or you use de_DE.utf8 and UNICODE.
>
> --
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
>
>

Re: tsearch2, ispell, utf-8 and german special characters

From

Peter Eisentraut

Date:

21 July 2004, 07:17:49

Am Mittwoch, 21. Juli 2004 09:36 schrieb Markus Wollny:
> Thanks for your answer. It's probably not sufficient to adjust the current
> locale settings of the system, so I'll have to dump, re-initdb and reload -
> am I correct or is there some procedure involving less downtime than that?

Sorry, no.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: tsearch2, ispell, utf-8 and german special characters

From

"Markus Wollny"

Date:

21 July 2004, 09:27:19

Hi!

Okay, I changed locale via initdb and I've got it working to some extent now.

Now I've got some problem with the ISpell-dictionary and the stopwords-list. Both have been compiled with
de_DE.utf8-locale.

When I
SELECT to_tsvector('default_german',
                           'Jeden Tag wirst Du ein bisschen älter, aber Du lernst');

I get
'tag':2 'aber':8 'eint':5 'lernen':10 'älter':7 'bisschen':6

I've got three questions regarding this result:
1. both 'ein' and 'aber' are included in the stopwords-file, but they show up in the result, whereas 'jeden', 'wirst',
'du'are removed correctly - why is the stopword-list ignored for the former two? 
2. why does 'ein' appear as 'eint'?
3. is this result actually no cause of alarm, so can I deploy tsearch2 to my production databases nevertheless?

I'm using http://j3e.de/ispell/igerman98/dict/igerman98-20030222.tar.bz2 (the latest version of Heinz Knutzen's
dictionary)and I've edited its Makefile to use de_DE.utf8 in the locale settings; all.words was indeed the file used to
generatethe hash, so I guess that I can now be more or less sure that I've actually followed the instructions in the
docsprecisely. I dropped any references to the german snowball stemmer dictionary which I had configured as fallback,
socurrently there's only this one dictionary configured for ts_name default_german and tok_alias lhword, lpard_hword,
lword(the remaining tog_alias entries are set to use the simple dictionary). 

Kind regards

    Markus

> -----Ursprüngliche Nachricht-----
> Von: Peter Eisentraut [mailto:peter_e@gmx.net]
> Gesendet: Mittwoch, 21. Juli 2004 12:17
> An: Markus Wollny
> Cc: pgsql-general@postgresql.org;
> openfts-general@lists.sourceforge.net
> Betreff: Re: AW: [GENERAL] tsearch2, ispell, utf-8 and german
> special characters
>
> Am Mittwoch, 21. Juli 2004 09:36 schrieb Markus Wollny:
> > Thanks for your answer. It's probably not sufficient to adjust the
> > current locale settings of the system, so I'll have to
> dump, re-initdb
> > and reload - am I correct or is there some procedure
> involving less downtime than that?
>
> Sorry, no.
>
> --
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
>

Re: tsearch2, ispell, utf-8 and german special characters

From

Oleg Bartunov

Date:

21 July 2004, 10:34:57

Marcus,

it'd be easier for others if you show your tsearch2 configuration.
btw, what version of pgsql and tsearch2 (any patches applied ?)
Since I don't know german I could provide a little help, but I'd like
to have some words from you when you get all things working right,
so other people would appreciate your experience.

I wouldn't use tsearch2 in production until you understand your problem and
get tsearch2 works correctly.


    Oleg

On Wed, 21 Jul 2004, Markus Wollny wrote:

> Hi!
>
> Okay, I changed locale via initdb and I've got it working to some extent now.
>
> Now I've got some problem with the ISpell-dictionary and the stopwords-list. Both have been compiled with
de_DE.utf8-locale.
>
> When I
> SELECT to_tsvector('default_german',
>                            'Jeden Tag wirst Du ein bisschen ?lter, aber Du lernst');
>
> I get
> 'tag':2 'aber':8 'eint':5 'lernen':10 '?lter':7 'bisschen':6
>
> I've got three questions regarding this result:
> 1. both 'ein' and 'aber' are included in the stopwords-file, but they show up in the result, whereas 'jeden',
'wirst','du' are removed correctly - why is the stopword-list ignored for the former two? 
> 2. why does 'ein' appear as 'eint'?
> 3. is this result actually no cause of alarm, so can I deploy tsearch2 to my production databases nevertheless?
>
> I'm using http://j3e.de/ispell/igerman98/dict/igerman98-20030222.tar.bz2 (the latest version of Heinz Knutzen's
dictionary)and I've edited its Makefile to use de_DE.utf8 in the locale settings; all.words was indeed the file used to
generatethe hash, so I guess that I can now be more or less sure that I've actually followed the instructions in the
docsprecisely. I dropped any references to the german snowball stemmer dictionary which I had configured as fallback,
socurrently there's only this one dictionary configured for ts_name default_german and tok_alias lhword, lpard_hword,
lword(the remaining tog_alias entries are set to use the simple dictionary). 
>
> Kind regards
>
>     Markus
>
> > -----Urspr?ngliche Nachricht-----
> > Von: Peter Eisentraut [mailto:peter_e@gmx.net]
> > Gesendet: Mittwoch, 21. Juli 2004 12:17
> > An: Markus Wollny
> > Cc: pgsql-general@postgresql.org;
> > openfts-general@lists.sourceforge.net
> > Betreff: Re: AW: [GENERAL] tsearch2, ispell, utf-8 and german
> > special characters
> >
> > Am Mittwoch, 21. Juli 2004 09:36 schrieb Markus Wollny:
> > > Thanks for your answer. It's probably not sufficient to adjust the
> > > current locale settings of the system, so I'll have to
> > dump, re-initdb
> > > and reload - am I correct or is there some procedure
> > involving less downtime than that?
> >
> > Sorry, no.
> >
> > --
> > Peter Eisentraut
> > http://developer.postgresql.org/~petere/
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: tsearch2, ispell, utf-8 and german special characters

From

"Markus Wollny"

Date:

21 July 2004, 12:07:35

Hi!

I managed to resolve the issue with the unrecognized stop-word 'aber': The stopword-file was utf-8-encoded WITH a Byte
OrderMark (BOM) - which is not recognized correctly (i.e. ignored), so the first word of the stopword-file, which is
'aber'was not recognized correctly. After removing the BOM, 'aber' was correctly filtered out as a stop-word. 

The issue with the unrecognized stop-word 'ein' which is converted by to_tsvector to 'eint' remains however. Now here's
asmuch detail as I can provide: 

We're using PostgreSQL 7.4.3, initdb'ed to a de_DE.utf8 locale; the database is in UNICODE encoding. I used the
tsearch2-moduleprovided in the /contrib-directory of the pg7.3.4-sources; I applied the patch from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz.OS is SuSE 7.3, LC_ALL and all other
locale-variablesare set to de_DE.utf8. Ispell is Version 3.1.20 10/10/95, patch 1.  

Here's my tsearch2-config:
=========================================
select * from pg_ts_cfg:
    ts_name;prs_name;locale
    default;default;C
    default_russian;default;ru_RU.KOI8-R
    simple;default;
    default_german;default;de_DE.utf8

select * from pg_ts_cfgmap where ts_name='default_german':
    ts_name;tok_alias;dict_name
    default_german;url;{simple}
    default_german;host;{simple}
    default_german;sfloat;{simple}
    default_german;uri;{simple}
    default_german;int;{simple}
    default_german;float;{simple}
    default_german;email;{simple}
    default_german;word;{simple}
    default_german;hword;{simple}
    default_german;nlword;{simple}
    default_german;nlpart_hword;{simple}
    default_german;part_hword;{simple}
    default_german;nlhword;{simple}
    default_german;file;{simple}
    default_german;uint;{simple}
    default_german;version;{simple}
    default_german;lhword;{de_ispell}
    default_german;lpart_hword;{de_ispell}
    default_german;lword;{de_ispell}

select * from pg_ts_dict:
    dict_name;dict_init;dict_initoption;dict_lexize;dict_comment
    simple;dex_init(text);;dex_lexize(internal,internal,integer);Simple example of dictionary.

en_stem;snb_en_init(text);/var/lib/pgsql/data/base/contrib/english.stop;snb_lexize(internal,internal,integer);English
Stemmer.Snowball. 

ru_stem;snb_ru_init(text);/var/lib/pgsql/data/base/contrib/russian.stop;snb_lexize(internal,internal,integer);Russian
Stemmer.Snowball. 
    ispell_template;spell_init(text);;spell_lexize(internal,internal,integer);ISpell interface. Must have .dict and
.afffiles 
    synonym;syn_init(text);;syn_lexize(internal,internal,integer);Example of synonym dictionary

de_ispell;spell_init(text);DictFile="/usr/lib/ispell/german.med",AffFile="/usr/lib/ispell/german.aff",StopFile="/var/lib/pgsql/data/base/contrib/german.stop";spell_lexize(internal,internal,integer);

select * from pg_ts_parser:
    prs_name;prs_start;prs_nexttoken;prs_end;prs_headline;prs_lextype;prs_comment

default;prsd_start(internal,integer);prsd_getlexeme(internal,internal,internal);prsd_end(internal);prsd_headline(internal,internal,internal);prsd_lextype(internal);Parser
fromOpenFTS v0.34 
=========================================
ISpell-Dictionary:
To generate the german ISpell-Dictionary, I did
wget http://j3e.de/ispell/igerman98/dict/igerman98-20030222.tar.bz2
bunzip2 igerman98-20030222.tar.bz2
tar -xvf igerman98-20030222.tar
cd igerman98-20030222
joe Makefile
[ there I set
LANG = de_DE.utf8
LC_ALL = de_DE.utf8
LC_COLLATE = de_DE.utf8
]
make
make install
sort -u -t/ +0f -1 +0 -T /usr/tmp -o german.med all.words
cp german.med /usr/lib/ispell/
=========================================
The stopwords-file is just a plain text-file in UTF-8 encoding with one word per line, like this:
aber
alle
allem
allen
aller
[...]
wollen
wollte
zu
zum
zur
zwar
zwischen

All in all that's 262 words, one on each line. Though the ß-characters (sharp s) in the file looks broken when doing
catgerman.stop, everything looks fine in vim and I can enter the character correctly on the commandline - I suspect
there'ssomething wrong with my SSH terminal (PuTTY) or some misconfiguration between bash and PuTTY.  
=========================================

I hope I have provided all the necessary information needed to help me clarify whether or not to deploy tsearch2 or
whatto do in order to receive consistent results. I'd be happy to contribute to the docs for implementing tsearch2 for
agerman unicode database, once all issues are resolved.  

Thank you very much for your help!

Kind regards


    Markus


> -----Ursprüngliche Nachricht-----
> Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
> Gesendet: Mittwoch, 21. Juli 2004 15:34
> An: Markus Wollny
> Cc: pgsql-general@postgresql.org;
> openfts-general@lists.sourceforge.net
> Betreff: Re: [GENERAL] tsearch2, ispell, utf-8 and german
> special characters
>
> Marcus,
>
> it'd be easier for others if you show your tsearch2 configuration.
> btw, what version of pgsql and tsearch2 (any patches applied
> ?) Since I don't know german I could provide a little help,
> but I'd like to have some words from you when you get all
> things working right, so other people would appreciate your
> experience.
>
> I wouldn't use tsearch2 in production until you understand
> your problem and get tsearch2 works correctly.
>
>
>     Oleg
>
> On Wed, 21 Jul 2004, Markus Wollny wrote:
>
> > Hi!
> >
> > Okay, I changed locale via initdb and I've got it working
> to some extent now.
> >
> > Now I've got some problem with the ISpell-dictionary and
> the stopwords-list. Both have been compiled with de_DE.utf8-locale.
> >
> > When I
> > SELECT to_tsvector('default_german',
> >                            'Jeden Tag wirst Du ein bisschen ?lter,
> > aber Du lernst');
> >
> > I get
> > 'tag':2 'aber':8 'eint':5 'lernen':10 '?lter':7 'bisschen':6
> >
> > I've got three questions regarding this result:
> > 1. both 'ein' and 'aber' are included in the
> stopwords-file, but they show up in the result, whereas
> 'jeden', 'wirst', 'du' are removed correctly - why is the
> stopword-list ignored for the former two?
> > 2. why does 'ein' appear as 'eint'?
> > 3. is this result actually no cause of alarm, so can I
> deploy tsearch2 to my production databases nevertheless?
> >
> > I'm using
> http://j3e.de/ispell/igerman98/dict/igerman98-20030222.tar.bz2
>  (the latest version of Heinz Knutzen's dictionary) and I've
> edited its Makefile to use de_DE.utf8 in the locale settings;
> all.words was indeed the file used to generate the hash, so I
> guess that I can now be more or less sure that I've actually
> followed the instructions in the docs precisely. I dropped
> any references to the german snowball stemmer dictionary
> which I had configured as fallback, so currently there's only
> this one dictionary configured for ts_name default_german and
> tok_alias lhword, lpard_hword, lword (the remaining tog_alias
> entries are set to use the simple dictionary).
> >
> > Kind regards
> >
> >     Markus