Thread: tsearch2: enable non ascii stop words with C locale

tsearch2: enable non ascii stop words with C locale

From
Tatsuo Ishii
Date:
Hi,

Currently tsearch2 does not accept non ascii stop words if locale is
C. Included patches should fix the problem. Patches against PostgreSQL
8.2.3.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
*** wordparser/parser.c~    2007-01-16 00:16:11.000000000 +0900
--- wordparser/parser.c    2007-02-10 18:04:59.000000000 +0900
***************
*** 246,251 ****
--- 246,266 ---- static int p_islatin(TParser * prs) {
+     if (prs->usewide)
+     {
+         if (lc_ctype_is_c())
+         {
+             unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar);
+ 
+             /*
+              * any non-ascii symbol with multibyte encoding
+              * with C-locale is a latin character
+              */
+             if ( c > 0x7f )
+                 return 1;
+         }
+     }
+      return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0; }

Re: tsearch2: enable non ascii stop words with C locale

From
Teodor Sigaev
Date:
> Currently tsearch2 does not accept non ascii stop words if locale is
> C. Included patches should fix the problem. Patches against PostgreSQL
> 8.2.3.

I'm not sure about correctness of patch's description.

First, p_islatin() function is used only in words/lexemes parser, not stop-word 
code. Second, p_islatin() function is used for catching lexemes like URL or HTML 
entities, so, it's important to define real latin characters. And it works 
right: it calls p_isalpha (already patched for your case),  then it calls 
p_isascii which should be correct for any encodings with C-locale.
Third (and last):
contrib_regression=# show server_encoding; server_encoding
----------------- UTF8
contrib_regression=# show lc_ctype; lc_ctype
---------- C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD); lexize
-------- {}

Russian characters with UTF8 take two bytes.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: tsearch2: enable non ascii stop words with C locale

From
Tatsuo Ishii
Date:
> > Currently tsearch2 does not accept non ascii stop words if locale is
> > C. Included patches should fix the problem. Patches against PostgreSQL
> > 8.2.3.
> 
> I'm not sure about correctness of patch's description.
> 
> First, p_islatin() function is used only in words/lexemes parser, not stop-word 
> code.

I know. My guess is the parser does not read the stop word file at
least with default configuration.

> Second, p_islatin() function is used for catching lexemes like URL or HTML 
> entities, so, it's important to define real latin characters. And it works 
> right: it calls p_isalpha (already patched for your case),  then it calls 
> p_isascii which should be correct for any encodings with C-locale.

original p_islatin is defined as follows:

static int
p_islatin(TParser * prs)
{return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0;
}

So if a character is not ASCII, it returns 0 even if p_isalpha returns
1. Is this what you expect?

> Third (and last):
> contrib_regression=# show server_encoding;
>   server_encoding
> -----------------
>   UTF8
> contrib_regression=# show lc_ctype;
>   lc_ctype
> ----------
>   C
> contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD);
>   lexize
> --------
>   {}
> 
> Russian characters with UTF8 take two bytes.

In our case, we added JAPANESE_STOP_WORD into english.stop then:

select to_tsvector(JAPANESE_STOP_WORD)

which returns words even they are in JAPANESE_STOP_WORD.

And with the patches the problem was solved.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: tsearch2: enable non ascii stop words with C locale

From
Teodor Sigaev
Date:
> I know. My guess is the parser does not read the stop word file at
> least with default configuration.

Parser should not read stopword file: its deal for dictionaries.

>
> So if a character is not ASCII, it returns 0 even if p_isalpha returns
> 1. Is this what you expect?
No, p_islatin should return true only for latin characters, not for national ones.

>
> In our case, we added JAPANESE_STOP_WORD into english.stop then:
> select to_tsvector(JAPANESE_STOP_WORD)
> which returns words even they are in JAPANESE_STOP_WORD.
> And with the patches the problem was solved.

Pls, show your configuration for lexemes/dictionaries. I suspect that you have 
en_stem dictionary on for lword lexemes type. Better way is to use 'simple' 
distionary (it's support stopword the same way as en_stem does) and set it for
nlword, word, part_hword, nlpart_hword, hword, nlhword lexeme's types. Note, 
leave unchanged en_stem for any latin word.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: tsearch2: enable non ascii stop words with C locale

From
Tatsuo Ishii
Date:
> > I know. My guess is the parser does not read the stop word file at
> > least with default configuration.
> 
> Parser should not read stopword file: its deal for dictionaries.

I'll come up with more detailed info, explaining why stopword file is
not read.

> > So if a character is not ASCII, it returns 0 even if p_isalpha returns
> > 1. Is this what you expect?
> No, p_islatin should return true only for latin characters, not for national ones.

Precise definition for "latin" in C locale please. Are you saying that
single byte encoding with range 0-7f? is "latin"? If so, it seems they
are exacty same as ASCII.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> > In our case, we added JAPANESE_STOP_WORD into english.stop then:
> > select to_tsvector(JAPANESE_STOP_WORD)
> > which returns words even they are in JAPANESE_STOP_WORD.
> > And with the patches the problem was solved.
> 
> Pls, show your configuration for lexemes/dictionaries. I suspect that you have 
> en_stem dictionary on for lword lexemes type. Better way is to use 'simple' 
> distionary (it's support stopword the same way as en_stem does) and set it for
> nlword, word, part_hword, nlpart_hword, hword, nlhword lexeme's types. Note, 
> leave unchanged en_stem for any latin word.
> 
> -- 
> Teodor Sigaev                                   E-mail: teodor@sigaev.ru


Re: tsearch2: enable non ascii stop words with C locale

From
Teodor Sigaev
Date:
> Precise definition for "latin" in C locale please. Are you saying that
> single byte encoding with range 0-7f? is "latin"? If so, it seems they
> are exacty same as ASCII.

p_islatin returns true for ASCII alpha characters.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/