Re: BUG #18362: unaccent rules and Old Greek text - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: BUG #18362: unaccent rules and Old Greek text
Date
Msg-id ZdvLGeJ1BsXRkrdQ@paquier.xyz
Whole thread Raw
In response to Re: BUG #18362: unaccent rules and Old Greek text  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: BUG #18362: unaccent rules and Old Greek text
List pgsql-bugs
On Sun, Feb 25, 2024 at 04:21:36PM +1300, Thomas Munro wrote:
> On Sun, Feb 25, 2024 at 11:14 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> So, there are reasons to keep the current unaccent.rules as it is, but...
>> there are other reasons to add a few lines to it, f.e. after line 955 and
>> insert five greek vowels with Oxia
>> Please add:
>> ά       α
>> έ       ε
>> ή       η
>> ί       ι
>> ό       ο
>> ύ       υ
>> ώ       ω

Correct me if I'm wrong of course, but reading a bit on the matter at
[1], letters with Tonos or Oxia are actually equivalent since 1986,
and we only include character with Tonos in our unaccent.rules.

> We don't exactly maintain this list manually, we extract it from
> Unicode source data.  Can you see what needs to be adjusted in here to
> achieve that goal?

See commits like e3dd7c06e627 or 59f47fb98dab for some references.
Unfortunately, we've been using as policy to not backpatch any changes
to the in-core rules file, and you can plug in your own file.  Saying
that, these additions sound like a natural addition seen from here.

> Perhaps a new range or something like that?

It seems to me that it is a bit more complicated than that, because
Unicode.data decomposes the characters with Oxia as characters with
Tonos, and then characters with Tonos are decomposed with the "base"
alphabet characters + Tonos.  We do a recursive lookup at the unicode
table in get_plain_letter() and is_letter_with_marks(), so it seems to
me that we're not missing much, and I suspect that there should be no
need for a new custom range of characters..

Cees, perhaps you would like to get a shot at that?

[1]: https://en.wikipedia.org/wiki/Greek_diacritics#Unicode
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: BUG #18362: unaccent rules and Old Greek text
Next
From: Michael Paquier
Date:
Subject: Re: BUG #18362: unaccent rules and Old Greek text