> Currently tsearch2 does not accept non ascii stop words if locale is
> C. Included patches should fix the problem. Patches against PostgreSQL
> 8.2.3.
I'm not sure about correctness of patch's description.
First, p_islatin() function is used only in words/lexemes parser, not stop-word
code. Second, p_islatin() function is used for catching lexemes like URL or HTML
entities, so, it's important to define real latin characters. And it works
right: it calls p_isalpha (already patched for your case), then it calls
p_isascii which should be correct for any encodings with C-locale.
Third (and last):
contrib_regression=# show server_encoding; server_encoding
----------------- UTF8
contrib_regression=# show lc_ctype; lc_ctype
---------- C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD); lexize
-------- {}
Russian characters with UTF8 take two bytes.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/