Home > mailing lists

Re: TSearch2 / German compound words / UTF-8 - Mailing list pgsql-general

From	Teodor Sigaev
Subject	Re: TSearch2 / German compound words / UTF-8
Date	November 23, 2005 09:37:44
Msg-id	43844675.6060803@sigaev.ru Whole thread Raw
In response to	TSearch2 / German compound words / UTF-8 (Hannes Dorbath <light@theendofthetunnel.de>)
List	pgsql-general

Tree view

> Tsearch/isepll is not able to break this word into parts, because of the
> "s" in "Produktion/s/intervall". Misspelling the word as
> "Produktionintervall" fixes it:

It should be affixes marked as 'affix in middle of compound word',
Flag is '~', example look in norsk dictionary:

flag ~\\:
     [^S]           >        S              #~ advarsel > advarsels-

BTW, we develop and debug compound word support on norsk (norwegian) dictionary,
so look for example there. But we don't know Norwegian, norwegians helped us  :)



>
>
> The second thing is with UTF-8:
>
> I know there is no, or no full support yet, but I need to get it as good
> as it's possible /now/. Is there anything in CVS that I might be able to
> backport to my version or other tips? My setup works, as for the dict
> and the stop word files, but I fear the stemming and mapping of umlauts
> and other special chars does not as it should. I tried recoding the
> german.aff to UTF-8 as well, but that breaks it with an regex error
> sometimes:

Now in CVS it is deep alpha version and now only text parser is UTF-compliant,
we continue development...


>
> fts=# SELECT ts_debug('dass');
> ERROR:  Regex error in '[^sãŸ]$': brackets [] not balanced
> CONTEXT:  SQL function "ts_debug" statement 1
>
> This seems while it tries to map ss to ß, but anyway, I fear, I didn't
> anything good with that.
>
> As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second
> snowball dict. The first lines of the stem.h I used start with:
>
>> extern struct SN_env * german_ISO_8859_1_create_env(void);
Can you use ISO-8859-1?

> So I guess this will not work exactly well with UTF-8 ;p Is there any
> other stem.h I could use? Google hasn't returned much for me :/

http://snowball.tartarus.org/

Snowball can generate UTF parser:
http://snowball.tartarus.org/runtime/use.html:
     F1 [-o[utput] F2]
        [-s[yntax]]
        [-w[idechars]]  [-u[tf8]] <-------- that's it!
        [-j[ava]]  [-n[ame] C]
        [-ep[refix] S1]  [-vp[refix] S2]
        [-i[nclude] D]
        [-r[untime] P]
At least for Russian there is 2 parsers, for KOI8 and UTF, (
http://snowball.tartarus.org/algorithms/russian/stem.sbl
http://snowball.tartarus.org/algorithms/russian/stem-Unicode.sbl
), diff shows that they different only in stringdef section. So you can make UTF
parser for german.
BUT, I'm afraid that Snowball uses widechar, and postgres use multibyte for UTF
internally.



--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

pgsql-general by date:

From: Oleg Bartunov
Date: 23 November 2005, 09:31:58
Subject: Re: tsearch2: more than one index per table?

From: Richard van den Berg
Date: 23 November 2005, 09:57:33
Subject: pg_ctl start leaves dos window open

Re: TSearch2 / German compound words / UTF-8 - Mailing list pgsql-general

Previous

Next