Jeff Davis wrote:
> While "full" case mapping sounds more complex, there are actually
> very few cases to consider and they are covered in another (small)
> data file. That data file covers ~100 code points that convert to
> multiple code points when the case changes (e.g. "ß" -> "SS"), 7
> code points that have context-sensitive mappings, and three locales
> which have special conversions ("lt", "tr", and "az") for a few code
> points.
But there are CLDR mappings on top of that.
According to the Unicode FAQ
https://unicode.org/faq/casemap_charprop.html#5
Q: Does the default case mapping work for every language? What
about the default case folding?
[...]
To make case mapping language sensitive, the Unicode Standard
specificially allows implementations to tailor the mappings for
each language, but does not provide the necessary data. The file
SpecialCasing.txt is included in the Standard as a guide to a few
of the more important individual character mappings needed for
specific languages, notably the Greek script and the Turkic
languages. However, for most language-specific mappings and
tailoring, users should refer to CLDR and other resources.
In particular "el" (modern greek) has case mapping rules that
ICU seems to implement, but "el" is missing from the list
("lt", "tr", and "az") you identified.
The CLDR case mappings seem to be found in
https://github.com/unicode-org/cldr/tree/main/common/transforms
in *-Lower.xml and *-Upper.xml
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite