Re: TSearch2 / German compound words / UTF-8 - Mailing list pgsql-general

From Alexander Presber
Subject Re: TSearch2 / German compound words / UTF-8
Date
Msg-id 7C945F17-1564-4232-BADE-F61D9D7395F2@weisshuhn.de
Whole thread Raw
In response to Re: TSearch2 / German compound words / UTF-8  (Teodor Sigaev <teodor@sigaev.ru>)
Responses Re: TSearch2 / German compound words / UTF-8  (Teodor Sigaev <teodor@sigaev.ru>)
Re: TSearch2 / German compound words / UTF-8  (Teodor Sigaev <teodor@sigaev.ru>)
Re: TSearch2 / German compound words / UTF-8  (Teodor Sigaev <teodor@sigaev.ru>)
List pgsql-general
Hello,

Thanks for your efforts, I still don't get it to work.
I now tried the norwegian example. My encoding is ISO-8859 (I never
used UTF-8, because I thought it would be slower, the thread name is
a bit misleading).

So I am using an ISO-8859-9 database:

   ~/cvs/ssd% psql -l

      Name    | Eigentümer | Kodierung
   -----------+------------+-----------
    postgres  | postgres   | LATIN9
    tstest    | aljoscha   | LATIN9

and a norwegian, ISO-8859 encoded dictionary and aff-file:

   ~% file tsearch/dict/ispell_no/norwegian.dict
   tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
   ~% file tsearch/dict/ispell_no/norwegian.aff
   tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text

the aff-file contains the lines:

   compoundwords controlled z
   ...
   #            to compounds only:
   flag ~\\:
      [^S]    > S

and the dictionary containins:

   overtrekk/BCW\z

   (meaning: word can be compound part, intermediary "s" is allowed)

My configuration is:

   tstest=# SELECT * FROM tsearch2.pg_ts_cfg;
     ts_name  | prs_name |   locale
   -----------+----------+------------
    simple    | default  | de_DE@euro
    german    | default  | de_DE@euro
    norwegian | default  | de_DE@euro


Now the test:

   tstest=# SELECT tsearch2.lexize('ispell_no','overtrekksgrill');
    lexize
   --------

   (1 Zeile)

BUT:

   tstest=# SELECT tsearch2.lexize('ispell_no','overtrekkgrill');
                  lexize
   ------------------------------------
    {over,trekk,grill,overtrekk,grill}
   (1 Zeile)


It simply doesn't work. No UTF-8 is involved.

Sincerely yours,

Alexander Presber

P.S.: Henning: Sorry for bothering you with the CC, just ignore it,
if you like.


Am 27.01.2006 um 18:17 schrieb Teodor Sigaev:

> contrib_regression=# insert into pg_ts_dict values (
>          'norwegian_ispell',
>           (select dict_init from pg_ts_dict where
> dict_name='ispell_template'),
>           'DictFile="/usr/local/share/ispell/norsk.dict" ,'
>           'AffFile ="/usr/local/share/ispell/norsk.aff"',
>          (select dict_lexize from pg_ts_dict where
> dict_name='ispell_template'),
>          'Norwegian ISpell dictionary'
>    );
> INSERT 16681 1
> contrib_regression=# select lexize('norwegian_ispell','politimester');
>                   lexize
> ------------------------------------------
>  {politimester,politi,mester,politi,mest}
> (1 row)
>
> contrib_regression=# select lexize
> ('norwegian_ispell','sjokoladefabrikk');
>                 lexize
> --------------------------------------
>  {sjokoladefabrikk,sjokolade,fabrikk}
> (1 row)
>
> contrib_regression=# select lexize
> ('norwegian_ispell','overtrekksgrilldresser');
>          lexize
> -------------------------
>  {overtrekk,grill,dress}
> (1 row)
> % psql -l
>            List of databases
>         Name        | Owner  | Encoding
> --------------------+--------+----------
>  contrib_regression | teodor | KOI8
>  postgres           | pgsql  | KOI8
>  template0          | pgsql  | KOI8
>  template1          | pgsql  | KOI8
> (4 rows)
>
>
> I'm afraid that UTF-8 problem. We just committed in CVS HEAD
> multibyte support for tsearch2, so you can try it.
>
> Pls, notice, the dict, aff stopword files should be in server
> encoding. Snowball sources for german (and other) in UTF8 can be
> founded in http://snowball.tartarus.org/dist/libstemmer_c.tgz
>
> To all: May be, we should put all snowball's stemmers (for all
> available languages and encodings) to tsearch2 directory?
>
> --
> Teodor Sigaev                                   E-mail:
> teodor@sigaev.ru
>                                                    WWW: http://
> www.sigaev.ru/


pgsql-general by date:

Previous
From: Marcos
Date:
Subject: Re: Take advantage of PREPARE (Pool of Conections)
Next
From: Tom Lane
Date:
Subject: Re: Implicit conversion from string to timestamp