Re: snowball ASCII stemmer configuration - Mailing list pgsql-hackers

From Tom Lane
Subject Re: snowball ASCII stemmer configuration
Date
Msg-id 1301915.1592318237@sss.pgh.pa.us
Whole thread Raw
In response to Re: snowball ASCII stemmer configuration  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: snowball ASCII stemmer configuration
Re: snowball ASCII stemmer configuration
List pgsql-hackers
I wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> Moreover, AFAIK, the following other languages do not use Latin-based 
>> alphabets:

>> arabic      arabic      \
>> greek       greek       \
>> nepali      nepali      \
>> tamil       tamil       \

> Hmm.  I think all of those entries are ones that got added by me while
> absorbing post-2007 Snowball updates, and I confess that I did not think
> about this point.  Maybe these should be changed.

After further reflection, I think these are indeed mistakes and we should
change them all.  The argument for the Russian/English case, AIUI, is
"if we come across an all-ASCII word, it is most certainly not Russian,
and the most likely Latin-based language is English".  Given the world
as it is, I think the same argument works for all non-Latin-alphabet
languages.  Obviously specific applications might have a different idea
of the best fallback language, but that's why we let users make their
own text search configurations.  For general-purpose use, falling back
to English seems reasonable.  And we can be dead certain that applying
a Greek stemmer to an ASCII word will do nothing useful, so the
configuration choice shown above is unhelpful.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Transactions involving multiple postgres foreign servers, take2
Next
From: Georgios
Date:
Subject: Use TableAm API in pg_table_size