Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド' - Mailing list pgsql-bugs

From Pavel Stehule
Subject Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'
Date
Msg-id CAFj8pRALjAQmCjQ+NiCPpob+dAprBFPb2XqZPeYDHEjdJmYK9A@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'  (Francisco Olarte <folarte@peoplecall.com>)
Responses Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'  (Francisco Olarte <folarte@peoplecall.com>)
List pgsql-bugs
Hi

st 29. 11. 2023 v 9:13 odesílatel Francisco Olarte <folarte@peoplecall.com> napsal:
Hi Jeff:

On Wed, 29 Nov 2023 at 03:40, Jeff Janes <jeff.janes@gmail.com> wrote:

I am not going to generally discuss this:
> But isn't it generally the case that removing accents might make you land on a different word with a different meaning?

But this one is a bad example,
> 'ano' and  'año' for example mean different things in Spanish (but unaccent removes it anyway, at least in one out of four attempts to get the non-7-bit-ASCII wedged through my terminal and into the function).

N and Ñ are different letters in spanish. It looks like an accent, can
be typed as such and some unaccent rules in some programs may make
them equal, Ñ is as different from N as it is from Z ( I am spanish,
and in case you want some authority link see
https://www.rae.es/dpd/%C3%B1 ). It has it own pages in the dictionary
( even on paper, I just checked in case my memory fails ).

We used to have also CH and LL as letters, but they were dropped
"recently" ( that meaning this century, I'm getting old ).

On the other "accents", à,è,ì,ò, ù  can generally be unaccented w/o
problem, although they may change meaning in some corner cases I do
not remember seen them do that since the special examples in school.
Other thing is ü, which is used on our "special" handling of hard/soft
vowels after g, i.e., you do not pronounce the u in "reguero" ( bot
modify how you pronounce the g, differently from agente ), but in
"agüero" you do pronounce it.

But Ñ is a proper letter, you cannot break it. Our alphabet goes m-n-ñ-o-p-q.

Some users use unaccent for transformation to 7bit ASCII. 

In the Czech language I can find more examples, where removing diacritics means significant loss and the meaning of the world should be based only on context.

Žár (the heat) -> zar
Zář (the shine) -> zar
Být (to be) -> byt
Byt (the flat)-> byt

And for unaccent we expected this loss.

So my question is, can the unaccent function be used for transformation to 7bit ASCII or is it wrong usage?

Regards

Pavel
 

Francisco Olarte.

P.S. to really sound spanish, we would have picked up "cono" for the
examples :-p

FO


pgsql-bugs by date:

Previous
From: Francisco Olarte
Date:
Subject: Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'
Next
From: Peter Eisentraut
Date:
Subject: Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'