TSearch2 / German compound words / UTF-8 - Mailing list pgsql-general
From | Hannes Dorbath |
---|---|
Subject | TSearch2 / German compound words / UTF-8 |
Date | |
Msg-id | dm1ece$2gb5$1@news.hub.org Whole thread Raw |
Responses |
TSearch2 / UTF-8 and stat() function
Re: TSearch2 / German compound words / UTF-8 Re: TSearch2 / German compound words / UTF-8 |
List | pgsql-general |
Hi, I'm on PG 8.0.4, initDB and locale set to de_DE.UTF-8, FreeBSD. My TSearch config is based on "Tsearch2 and Unicode/UTF-8" by Markus Wollny (http://tinyurl.com/a6po4). The following files are used: http://hannes.imos.net/german.med [UTF-8] http://hannes.imos.net/german.aff [ANSI] http://hannes.imos.net/german.stop [UTF-8] http://hannes.imos.net/german.stop.ispell [UTF-8] german.med is from "ispell-german-compound.tar.gz", available on the TSearch2 site, recoded to UTF-8. The first problem is with german compound words and does not have to do anything with UTF-8: In german often an "s" is used to "link" two words into an compound word. This is true for many german compound words. TSearch/ispell is not able to break those words up, only exact matches work. An example with "Produktionsintervall" (production interval): fts=# SELECT ts_debug('Produktionsintervall'); ts_debug -------------------------------------------------------------------------------------------------- (default_german,lword,"Latin word",Produktionsintervall,"{de_ispell,de}",'produktionsintervall') Tsearch/isepll is not able to break this word into parts, because of the "s" in "Produktion/s/intervall". Misspelling the word as "Produktionintervall" fixes it: fts=# SELECT ts_debug('Produktionintervall'); ts_debug --------------------------------------------------------------------------------------------------------------------- (default_german,lword,"Latin word",Produktionintervall,"{de_ispell,de}","'ion' 'produkt' 'intervall' 'produktion'") How can I fix this / get TSearch to remove/stem the last "s" on a word before (re-)searching the dict? Can I modify my dict or hack something else? This is a bit of a show stopper :/ The second thing is with UTF-8: I know there is no, or no full support yet, but I need to get it as good as it's possible /now/. Is there anything in CVS that I might be able to backport to my version or other tips? My setup works, as for the dict and the stop word files, but I fear the stemming and mapping of umlauts and other special chars does not as it should. I tried recoding the german.aff to UTF-8 as well, but that breaks it with an regex error sometimes: fts=# SELECT ts_debug('dass'); ERROR: Regex error in '[^sãŸ]$': brackets [] not balanced CONTEXT: SQL function "ts_debug" statement 1 This seems while it tries to map ss to ß, but anyway, I fear, I didn't anything good with that. As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second snowball dict. The first lines of the stem.h I used start with: > extern struct SN_env * german_ISO_8859_1_create_env(void); So I guess this will not work exactly well with UTF-8 ;p Is there any other stem.h I could use? Google hasn't returned much for me :/ Thanks for reading and all our time. I'll consider the donate button after I get this working ;/ -- Regards, Hannes Dorbath
pgsql-general by date: