Home > mailing lists

Re: tsearch2: enable non ascii stop words with C locale - Mailing list pgsql-hackers

From	Teodor Sigaev
Subject	Re: tsearch2: enable non ascii stop words with C locale
Date	February 12, 2007 10:55:20
Msg-id	45D07FCF.7020407@sigaev.ru Whole thread Raw
In response to	tsearch2: enable non ascii stop words with C locale (Tatsuo Ishii <ishii@postgresql.org>)
Responses	Re: tsearch2: enable non ascii stop words with C locale
List	pgsql-hackers

Tree view

> Currently tsearch2 does not accept non ascii stop words if locale is
> C. Included patches should fix the problem. Patches against PostgreSQL
> 8.2.3.

I'm not sure about correctness of patch's description.

First, p_islatin() function is used only in words/lexemes parser, not stop-word 
code. Second, p_islatin() function is used for catching lexemes like URL or HTML 
entities, so, it's important to define real latin characters. And it works 
right: it calls p_isalpha (already patched for your case),  then it calls 
p_isascii which should be correct for any encodings with C-locale.
Third (and last):
contrib_regression=# show server_encoding; server_encoding
----------------- UTF8
contrib_regression=# show lc_ctype; lc_ctype
---------- C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD); lexize
-------- {}

Russian characters with UTF8 take two bytes.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

pgsql-hackers by date:

From: mark@mark.mielke.cc
Date: 12 February 2007, 10:36:19
Subject: Re: HOT for PostgreSQL 8.3

From: Alvaro Herrera
Date: 12 February 2007, 11:29:28
Subject: DROP DATABASE and prepared xacts

Re: tsearch2: enable non ascii stop words with C locale - Mailing list pgsql-hackers

Previous

Next