On Fri, 23 Aug 2002, Christopher Kings-Lynne wrote:
> Hi guys,
>
> Hate to keep coming up with these bugs without patches - but I really don't
> have time to look into the source code atm :(
>
> OK, attached is an example of the problem. Notice how trademarks and
> copyright symbols are being indexed along with the word. This means that if
> someone searches for 'balance' in the above data set, they won't find
> anything.
>
> I'm not sure how this would be handled. In the English language, it'd
> probably be safe to say that high ascii characters would be stripped from
> the index? But you'd want to leave accents and stuff in I guess. Tricky.
Rather tricky. The problem is that we don't know how to get flex to works
with locale. Parser recognizes latin words ([a-zA-Z]), nonLatin ([\0200-\0377])
and mixed words ([a-zA-Z\0200-\0377]). Your case (balanceR) is the mixed word.
The right way is to have locale aware parser to properly recognize words.
We incline to refuse a flex.
>
> Anyway, just bringing it to your attention...
>
> Chris
>
Regards, Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83