Home > mailing lists

Re: Latin vs non-Latin words in text search parsing - Mailing list pgsql-hackers

From	Alvaro Herrera
Subject	Re: Latin vs non-Latin words in text search parsing
Date	October 21, 2007 19:00:06
Msg-id	20071021215953.GA12111@alvh.no-ip.org Whole thread Raw
In response to	Latin vs non-Latin words in text search parsing (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Latin vs non-Latin words in text search parsing Re: Latin vs non-Latin words in text search parsing
List	pgsql-hackers

Tree view

Tom Lane wrote:

> ISTM that perhaps a more generally useful definition would be
> 
> lword        Only ASCII letters
> nlword        Entirely letters per iswalpha(), but not lword
> word        Entirely alphanumeric per iswalnum(), but not nlword
>         (hence, includes at least one digit)
> 
> However, I am no linguist and maybe I'm missing something.

I tend to agree with the need to redefine the categories.  I am not sure
I agree with this particular definition though.  I would think that a
"latin word" should include ASCII letters and accented letters, and a
non-latin word would be one that included only non-ASCII chars.

alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');Alias |  Description  |   Token   |
Dictionaries |      Lexized token       
 
-------+---------------+-----------+----------------+--------------------------word  | Word          | añadido   |
{spanish_stem}| spanish_stem: {añad}blank | Space symbols |           | {}             | word  | Word          | añadió
  | {spanish_stem} | spanish_stem: {añad}blank | Space symbols |           | {}             | word  | Word          |
añadidura| {spanish_stem} | spanish_stem: {añadidur}
 
(5 lignes)

I would think those would all fit in the "latin word" category.  This
example is more interesting because it shows a word categorized
differently just because the plural loses the accent:

alvherre=# select * from ts_debug('spanish', 'caracteres carácter');Alias |  Description  |   Token    |  Dictionaries
|     Lexized token       
 
-------+---------------+------------+----------------+--------------------------lword | Latin word    | caracteres |
{spanish_stem}| spanish_stem: {caracter}blank | Space symbols |            | {}             | word  | Word          |
carácter  | {spanish_stem} | spanish_stem: {caract}
 
(3 lignes)

I am not sure if there are any western european languages were words can
only be formed with non-ascii chars.  At least in spanish accents tend
to be rare.  However, I would think this is also wrong:

alvherre=# select * from ts_debug('french', 'à');Alias  |  Description   | Token | Dictionaries  |  Lexized token  
--------+----------------+-------+---------------+-----------------nlword | Non-latin word | à     | {french_stem} |
french_stem:{}
 
(1 ligne)

I don't think this is much of a problem, this particular word being
(most likely) a stopword.

So, how about

lword        Entirely letters per iswalpha, with at least one ASCII
nlword        Entirely letters per iswalpha
word        Entirely alphanumeric per iswalnum, but not nlword

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

pgsql-hackers by date:

From: Tom Lane
Date: 21 October 2007, 17:48:03
Subject: Latin vs non-Latin words in text search parsing

From: Tom Lane
Date: 21 October 2007, 19:46:47
Subject: Re: Latin vs non-Latin words in text search parsing

Re: Latin vs non-Latin words in text search parsing - Mailing list pgsql-hackers

Previous

Next