Re: snowball ASCII stemmer configuration - Mailing list pgsql-hackers

From Tom Lane
Subject Re: snowball ASCII stemmer configuration
Date
Msg-id 1333705.1592322037@sss.pgh.pa.us
Whole thread Raw
In response to Re: snowball ASCII stemmer configuration  (Mark Dilger <mark.dilger@enterprisedb.com>)
List pgsql-hackers
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I am a bit surprised to see that you are right about this, because non-latin languages often have
transliteration/romanizationschemes for writing the language in the Latin alphabet, developed before computers had wide
spreadadoption of non-ASCII character sets, and still in use today for text messaging.  I expected to find stemming
rulesfor transliterated words, but can't find any indication of that, neither in the postgres sources, nor in the
snowballsources I pulled from their repo.  Is there some architectural separation of stemming from transliteration such
thatwe'd never need to worry about it?  If snowball ever published stemmers for transliterated text, we might have to
revisitthis issue, but for now your proposed change sounds fine to me. 

Agreed, if the Snowball stemmers worked on romanized texts then the
situation would be different.  But they don't, AFAICS.  Don't know
if that is architectural, or a policy decision, or just lack of
round tuits.

The thing that I actually find a bit shaky in this area is our
architectural decision to route words to different dictionaries
depending on whether they are all-ASCII or not.  AIUI that was
done purely on the basis of the Russian/English case; it would
fail badly if say you wanted to separate Russian from French.
However, I have no great desire to revisit that design right now.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: snowball ASCII stemmer configuration
Next
From: Fujii Masao
Date:
Subject: Re: Review for GetWALAvailability()