On Sun, Feb 25, 2024 at 04:21:36PM +1300, Thomas Munro wrote:
> On Sun, Feb 25, 2024 at 11:14 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> So, there are reasons to keep the current unaccent.rules as it is, but...
>> there are other reasons to add a few lines to it, f.e. after line 955 and
>> insert five greek vowels with Oxia
>> Please add:
>> ά α
>> έ ε
>> ή η
>> ί ι
>> ό ο
>> ύ υ
>> ώ ω
Correct me if I'm wrong of course, but reading a bit on the matter at
[1], letters with Tonos or Oxia are actually equivalent since 1986,
and we only include character with Tonos in our unaccent.rules.
> We don't exactly maintain this list manually, we extract it from
> Unicode source data. Can you see what needs to be adjusted in here to
> achieve that goal?
See commits like e3dd7c06e627 or 59f47fb98dab for some references.
Unfortunately, we've been using as policy to not backpatch any changes
to the in-core rules file, and you can plug in your own file. Saying
that, these additions sound like a natural addition seen from here.
> Perhaps a new range or something like that?
It seems to me that it is a bit more complicated than that, because
Unicode.data decomposes the characters with Oxia as characters with
Tonos, and then characters with Tonos are decomposed with the "base"
alphabet characters + Tonos. We do a recursive lookup at the unicode
table in get_plain_letter() and is_letter_with_marks(), so it seems to
me that we're not missing much, and I suspect that there should be no
need for a new custom range of characters..
Cees, perhaps you would like to get a shot at that?
[1]: https://en.wikipedia.org/wiki/Greek_diacritics#Unicode
--
Michael