Re: TSearch2 / German compound words / UTF-8 - Mailing list pgsql-general

From Oleg Bartunov
Subject Re: TSearch2 / German compound words / UTF-8
Date
Msg-id Pine.GSO.4.63.0511231333110.29329@ra.sai.msu.su
Whole thread Raw
In response to TSearch2 / German compound words / UTF-8  (Hannes Dorbath <light@theendofthetunnel.de>)
List pgsql-general
On Wed, 23 Nov 2005, Hannes Dorbath wrote:

> Hi,
>
> I'm on PG 8.0.4, initDB and locale set to de_DE.UTF-8, FreeBSD.
>
> My TSearch config is based on "Tsearch2 and Unicode/UTF-8" by Markus Wollny
> (http://tinyurl.com/a6po4).
>
> The following files are used:
>
> http://hannes.imos.net/german.med          [UTF-8]
> http://hannes.imos.net/german.aff          [ANSI]
> http://hannes.imos.net/german.stop         [UTF-8]
> http://hannes.imos.net/german.stop.ispell  [UTF-8]
>
> german.med is from "ispell-german-compound.tar.gz", available on the TSearch2
> site, recoded to UTF-8.
>
> The first problem is with german compound words and does not have to do
> anything with UTF-8:
>
> In german often an "s" is used to "link" two words into an compound word.
> This is true for many german compound words. TSearch/ispell is not able to
> break those words up, only exact matches work.
>
> An example with "Produktionsintervall" (production interval):
>
> fts=# SELECT ts_debug('Produktionsintervall');
>                                             ts_debug
> --------------------------------------------------------------------------------------------------
> (default_german,lword,"Latin
> word",Produktionsintervall,"{de_ispell,de}",'produktionsintervall')
>
> Tsearch/isepll is not able to break this word into parts, because of the "s"
> in "Produktion/s/intervall". Misspelling the word as "Produktionintervall"
> fixes it:
>
> fts=# SELECT ts_debug('Produktionintervall');
>                                                      ts_debug
> ---------------------------------------------------------------------------------------------------------------------
> (default_german,lword,"Latin
> word",Produktionintervall,"{de_ispell,de}","'ion' 'produkt' 'intervall'
> 'produktion'")
>
> How can I fix this / get TSearch to remove/stem the last "s" on a word before
> (re-)searching the dict? Can I modify my dict or hack something else? This is
> a bit of a show stopper :/


I think the right way is to fix affix file, i.e. add appropriate rule,
but this is out of our skill :) You, probable, should send your
complains/suggestions to erstellt von transam email: transam45@gmx.net
(see german.aff)

>
>
> The second thing is with UTF-8:
>
> I know there is no, or no full support yet, but I need to get it as good as
> it's possible /now/. Is there anything in CVS that I might be able to
> backport to my version or other tips? My setup works, as for the dict and the
> stop word files, but I fear the stemming and mapping of umlauts and other
> special chars does not as it should. I tried recoding the german.aff to UTF-8
> as well, but that breaks it with an regex error sometimes:
>
> fts=# SELECT ts_debug('dass');
> ERROR:  Regex error in '[^s??]$': brackets [] not balanced
> CONTEXT:  SQL function "ts_debug" statement 1
>
> This seems while it tries to map ss to ?, but anyway, I fear, I didn't
> anything good with that.

Similar problem was discussed
http://sourceforge.net/mailarchive/forum.php?thread_id=6271285&forum_id=7671


>
> As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second
> snowball dict. The first lines of the stem.h I used start with:
>
>> extern struct SN_env * german_ISO_8859_1_create_env(void);
>
> So I guess this will not work exactly well with UTF-8 ;p Is there any other
> stem.h I could use? Google hasn't returned much for me :/
>

As we mentioned several times, tsearch2 doesn't supports UTF-8 and
is working only by accident :) We've got working parser with full UTF-8
support, but we need to rewrite interfaces to dictionaries, so there is nothing
useful to the moment. All changes are available in CVS HEAD (8.2dev).

Backpatch for 8.1 will be available from our site as soon as we complete
UTF-8 support for CVS HEAD. We have no deadlines yet, but we have discussed
support of this project with OpenACS community (grant from University of
Mannheim), so it's possible that we could complete it really soon
(we have no answer yet).


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

pgsql-general by date:

Previous
From: Richard van den Berg
Date:
Subject: pg_ctl start leaves dos window open
Next
From: "A.j. Langereis"
Date:
Subject: Re: PREPARE in bash scripts