Re: Latin vs non-Latin words in text search parsing - Mailing list pgsql-hackers

From Gregory Stark
Subject Re: Latin vs non-Latin words in text search parsing
Date
Msg-id 871wbn1luo.fsf@oxford.xeocode.com
Whole thread Raw
In response to Re: Latin vs non-Latin words in text search parsing  ("Heikki Linnakangas" <heikki@enterprisedb.com>)
Responses Re: Latin vs non-Latin words in text search parsing
List pgsql-hackers
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> Alvaro Herrera wrote:
>> Tom Lane wrote:
>>
>>> ISTM that perhaps a more generally useful definition would be
>>>
>>> lword        Only ASCII letters
>>> nlword        Entirely letters per iswalpha(), but not lword
>>> word        Entirely alphanumeric per iswalnum(), but not nlword
>>>         (hence, includes at least one digit)
>> ...
>> I am not sure if there are any western european languages were words can
>> only be formed with non-ascii chars.
>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.

For what it's worth I did the same search last night and found three French
words including "çà" -- which admittedly is likely to be a noise word. Other
dictionaries such as Italian and Irish also have one-letter words like this.
The only other with multi-letter words is actually Faroese with "íð" and "óð".

> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?

I'm not very familiar with the use case here. Is there a good reason to want
to abbreviate these names? I think I would expect "ascii", "word", and "token"
for the three categories Tom describes.

> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.

I also wonder about languages like Arabic and Hindi which do have words but
I'm not sure if they use white space as simply as in latin languages.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: FD_SETSIZE limitation in Windows hamstringing pgbench.c
Next
From: Magnus Hagander
Date:
Subject: Re: FD_SETSIZE limitation in Windows hamstringing pgbench.c