Home > mailing lists

Re: tsearch2: enable non ascii stop words with C locale - Mailing list pgsql-hackers

From	Tatsuo Ishii
Subject	Re: tsearch2: enable non ascii stop words with C locale
Date	February 12, 2007 22:23:17
Msg-id	20070213.082314.74752487.t-ishii@sraoss.co.jp Whole thread Raw
In response to	Re: tsearch2: enable non ascii stop words with C locale (Teodor Sigaev <teodor@sigaev.ru>)
Responses	Re: tsearch2: enable non ascii stop words with C locale
List	pgsql-hackers

Tree view

> > Currently tsearch2 does not accept non ascii stop words if locale is
> > C. Included patches should fix the problem. Patches against PostgreSQL
> > 8.2.3.
> 
> I'm not sure about correctness of patch's description.
> 
> First, p_islatin() function is used only in words/lexemes parser, not stop-word 
> code.

I know. My guess is the parser does not read the stop word file at
least with default configuration.

> Second, p_islatin() function is used for catching lexemes like URL or HTML 
> entities, so, it's important to define real latin characters. And it works 
> right: it calls p_isalpha (already patched for your case),  then it calls 
> p_isascii which should be correct for any encodings with C-locale.

original p_islatin is defined as follows:

static int
p_islatin(TParser * prs)
{return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0;
}

So if a character is not ASCII, it returns 0 even if p_isalpha returns
1. Is this what you expect?

> Third (and last):
> contrib_regression=# show server_encoding;
>   server_encoding
> -----------------
>   UTF8
> contrib_regression=# show lc_ctype;
>   lc_ctype
> ----------
>   C
> contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD);
>   lexize
> --------
>   {}
> 
> Russian characters with UTF8 take two bytes.

In our case, we added JAPANESE_STOP_WORD into english.stop then:

select to_tsvector(JAPANESE_STOP_WORD)

which returns words even they are in JAPANESE_STOP_WORD.

And with the patches the problem was solved.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

pgsql-hackers by date:

From: Tommy Gildseth
Date: 12 February 2007, 22:16:06
Subject: Re: XML export function signatures

From: Jeremy Drake
Date: 12 February 2007, 22:31:56
Subject: Re: pgsql: Fix backend crash in parsing incorrect tsquery.

Re: tsearch2: enable non ascii stop words with C locale - Mailing list pgsql-hackers

Previous

Next