Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド' - Mailing list pgsql-bugs

From Jeff Janes
Subject Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'
Date
Msg-id CAMkU=1xvF9NMPJgXTULGYw-5KqH5xduEPDqOT7gvbH2SRWJK-A@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'  (Michael Paquier <michael@paquier.xyz>)
Responses Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'  (Francisco Olarte <folarte@peoplecall.com>)
List pgsql-bugs


On Tue, Nov 28, 2023 at 8:06 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Nov 28, 2023 at 09:58:35AM -0500, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
>> PostgreSQL's unaccent module does not use Unicode normalisation, but only a
>> simple search-and-replace dictionary. The dictionary, unaccent.rules
>> (https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules)
>>   , does not contain these Japanese  characters, thus  its unable to remove
>> the diacritic signs.  Can someone please guide when we can expect these
>> Japanese characters will be added.
>
> unaccent.rules, as distributed, is just an example.  It is not meant
> to be exhaustive or authoritative.

FWIW, I'm quite fluent in Japanese and was discussing a bit this
around me and, like me, folks were kind of troubled with the concept
that these should be considered as "accents", because it would
entirely change the meaning of what each Hiragana and Katakana means.

But isn't it generally the case that removing accents might make you land on a different word with a different meaning?

'ano' and  'año' for example mean different things in Spanish (but unaccent removes it anyway, at least in one out of four attempts to get the non-7-bit-ASCII wedged through my terminal and into the function).

That doesn't mean that unaccent is required to do it, of course. But the possibility of changing the meaning doesn't seem like a reason not to do it.

Cheers,

Jeff

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'
Next
From: zhihuifan1213@163.com
Date:
Subject: Re: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem