Re: Latin vs non-Latin words in text search parsing - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: Latin vs non-Latin words in text search parsing |
Date | |
Msg-id | 20071021215953.GA12111@alvh.no-ip.org Whole thread Raw |
In response to | Latin vs non-Latin words in text search parsing (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Latin vs non-Latin words in text search parsing
Re: Latin vs non-Latin words in text search parsing |
List | pgsql-hackers |
Tom Lane wrote: > ISTM that perhaps a more generally useful definition would be > > lword Only ASCII letters > nlword Entirely letters per iswalpha(), but not lword > word Entirely alphanumeric per iswalnum(), but not nlword > (hence, includes at least one digit) > > However, I am no linguist and maybe I'm missing something. I tend to agree with the need to redefine the categories. I am not sure I agree with this particular definition though. I would think that a "latin word" should include ASCII letters and accented letters, and a non-latin word would be one that included only non-ASCII chars. alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');Alias | Description | Token | Dictionaries | Lexized token -------+---------------+-----------+----------------+--------------------------word | Word | añadido | {spanish_stem}| spanish_stem: {añad}blank | Space symbols | | {} | word | Word | añadió | {spanish_stem} | spanish_stem: {añad}blank | Space symbols | | {} | word | Word | añadidura| {spanish_stem} | spanish_stem: {añadidur} (5 lignes) I would think those would all fit in the "latin word" category. This example is more interesting because it shows a word categorized differently just because the plural loses the accent: alvherre=# select * from ts_debug('spanish', 'caracteres carácter');Alias | Description | Token | Dictionaries | Lexized token -------+---------------+------------+----------------+--------------------------lword | Latin word | caracteres | {spanish_stem}| spanish_stem: {caracter}blank | Space symbols | | {} | word | Word | carácter | {spanish_stem} | spanish_stem: {caract} (3 lignes) I am not sure if there are any western european languages were words can only be formed with non-ascii chars. At least in spanish accents tend to be rare. However, I would think this is also wrong: alvherre=# select * from ts_debug('french', 'à');Alias | Description | Token | Dictionaries | Lexized token --------+----------------+-------+---------------+-----------------nlword | Non-latin word | à | {french_stem} | french_stem:{} (1 ligne) I don't think this is much of a problem, this particular word being (most likely) a stopword. So, how about lword Entirely letters per iswalpha, with at least one ASCII nlword Entirely letters per iswalpha word Entirely alphanumeric per iswalnum, but not nlword -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
pgsql-hackers by date: