Re: tsearch2: enable non ascii stop words with C locale - Mailing list pgsql-hackers

From Teodor Sigaev
Subject Re: tsearch2: enable non ascii stop words with C locale
Date
Msg-id 45D07FCF.7020407@sigaev.ru
Whole thread Raw
In response to tsearch2: enable non ascii stop words with C locale  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: tsearch2: enable non ascii stop words with C locale  (Tatsuo Ishii <ishii@sraoss.co.jp>)
List pgsql-hackers
> Currently tsearch2 does not accept non ascii stop words if locale is
> C. Included patches should fix the problem. Patches against PostgreSQL
> 8.2.3.

I'm not sure about correctness of patch's description.

First, p_islatin() function is used only in words/lexemes parser, not stop-word 
code. Second, p_islatin() function is used for catching lexemes like URL or HTML 
entities, so, it's important to define real latin characters. And it works 
right: it calls p_isalpha (already patched for your case),  then it calls 
p_isascii which should be correct for any encodings with C-locale.
Third (and last):
contrib_regression=# show server_encoding; server_encoding
----------------- UTF8
contrib_regression=# show lc_ctype; lc_ctype
---------- C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD); lexize
-------- {}

Russian characters with UTF8 take two bytes.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


pgsql-hackers by date:

Previous
From: mark@mark.mielke.cc
Date:
Subject: Re: HOT for PostgreSQL 8.3
Next
From: Alvaro Herrera
Date:
Subject: DROP DATABASE and prepared xacts