Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From Daniel Verite
Subject Re: Built-in CTYPE provider
Date
Msg-id d26df384-2fa7-4f50-b703-b0b6706dbeff@manitou-mail.org
Whole thread Raw
In response to Built-in CTYPE provider  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Built-in CTYPE provider
Re: Built-in CTYPE provider
List pgsql-hackers
    Jeff Davis wrote:

> While "full" case mapping sounds more complex, there are actually
> very few cases to consider and they are covered in another (small)
> data file. That data file covers ~100 code points that convert to
> multiple code points when the case changes (e.g. "ß" -> "SS"), 7
> code points that have context-sensitive mappings, and three locales
> which have special conversions ("lt", "tr", and "az") for a few code
> points.

But there are CLDR mappings on top of that.

According to the Unicode FAQ

   https://unicode.org/faq/casemap_charprop.html#5

   Q: Does the default case mapping work for every language? What
   about the default case folding?

   [...]

   To make case mapping language sensitive, the Unicode Standard
   specificially allows implementations to tailor the mappings for
   each language, but does not provide the necessary data. The file
   SpecialCasing.txt is included in the Standard as a guide to a few
   of the more important individual character mappings needed for
   specific languages, notably the Greek script and the Turkic
   languages. However, for most language-specific mappings and
   tailoring, users should refer to CLDR and other resources.

In particular "el" (modern greek) has case mapping rules that
ICU seems to implement, but "el" is missing from the list
("lt", "tr", and "az") you identified.

The CLDR case mappings seem to be found in
https://github.com/unicode-org/cldr/tree/main/common/transforms
in *-Lower.xml and *-Upper.xml


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



pgsql-hackers by date:

Previous
From: Sacha Hottinger
Date:
Subject: AW: Building PosgresSQL with LLVM fails on Solaris 11.4
Next
From: Emre Hasegeli
Date:
Subject: "pgoutput" options missing on documentation