Home > mailing lists

Re: Latin vs non-Latin words in text search parsing - Mailing list pgsql-hackers

From	Tatsuo Ishii
Subject	Re: Latin vs non-Latin words in text search parsing
Date	October 23, 2007 19:47:28
Msg-id	20071024.074558.51700790.t-ishii@sraoss.co.jp Whole thread Raw
In response to	Latin vs non-Latin words in text search parsing (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Latin vs non-Latin words in text search parsing
List	pgsql-hackers

Tree view

Just for clarification.

Are you going to make these changes in the 8.3 beta test period?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> If I am reading the state machine in wparser_def.c correctly, the
> three classifications of words that the default parser knows are
> 
> lword        Composed entirely of ASCII letters
> nlword        Composed entirely of non-ASCII letters
>         (where "letter" is defined by iswalpha())
> word        Entirely alphanumeric (per iswalnum()), but not above
>         cases
> 
> This classification is probably sane enough for dealing with mixed
> Russian/English text --- IIUC, Russian words will come entirely from
> the Cyrillic alphabet which has no overlap with ASCII letters.  But
> I'm thinking it'll be quite inconvenient for other European languages
> whose alphabets include the base ASCII letters plus other stuff such
> as accented letters.  They will have a lot of words that fall into
> the catchall "word" category, which will mean they have to index
> mixed alpha-and-number words in order to catch all native words.
> 
> ISTM that perhaps a more generally useful definition would be
> 
> lword        Only ASCII letters
> nlword        Entirely letters per iswalpha(), but not lword
> word        Entirely alphanumeric per iswalnum(), but not nlword
>         (hence, includes at least one digit)
> 
> However, I am no linguist and maybe I'm missing something.
> 
> Comments?
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

pgsql-hackers by date:

From: Josh Berkus
Date: 23 October 2007, 19:21:32
Subject: Re: Feature Freeze date for 8.4

From: David Fetter
Date: 23 October 2007, 19:48:02
Subject: Re: Feature Freeze date for 8.4

Re: Latin vs non-Latin words in text search parsing - Mailing list pgsql-hackers

Previous

Next