Re: case insensitive collation of Greek's sigma - Mailing list pgsql-general

From Jakub Jedelsky
Subject Re: case insensitive collation of Greek's sigma
Date
Msg-id CAC1JxDQi+z47rdv1szaxyrhAL8-wheZgTggjdj5AQAL4F=xR7w@mail.gmail.com
Whole thread Raw
In response to Re: case insensitive collation of Greek's sigma  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: case insensitive collation of Greek's sigma
List pgsql-general
On Wed, Dec 1, 2021 at 8:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> Running lower() like this is really the wrong thing to do.  We should be
> doing "case folding" instead, which normalizes these differences for the
> purpose of case-insensitive comparisons.

That just begs the question: if tolower (or towlower) isn't the
appropriate API, what is?  Perhaps ICU has something for a more
generalized notion of case-similarity, but I'm not aware of any such
thing in the POSIX API.

BTW, I think it's only accidental that the regex example shown upthread
gets the right answer.  In that example, what's happening is that we
consider a letter in a case-insensitive regex to match itself, or
tolower() of itself, or toupper() of itself.  Both σ and ς have Σ
as toupper() so they both work.  But if you'd written Σ in the regex,
only one of σ and ς would match that as a data character.  (Haven't
actually tested this, but given the way the code works I'm pretty
sure it's so.)  Again, it's hard to see how to do better atop a POSIX
locale library.

Thanks for digging into the issue.
 
Based on GNU docs [1] the POSIX APIs are not ready for that. Anyway, is it possible to keep current behaviour with lowercase in POSIX as a fallback and have the correct solution for ICU? I think (not an expert though) there should be already working code for case folding for some time already.

Text files are nowadays usually encoded in Unicode, and may consist of very different scripts – from Latin letters to Chinese Hanzi –, with many kinds of special characters – accents, right-to-left writing marks, hyphens, Roman numbers, and much more. But the POSIX platform APIs for text do not contain adequate functions for dealing with particular properties of many Unicode characters. In fact, the POSIX APIs for text have several assumptions at their base which don't hold for Unicode text.
"""

pgsql-general by date:

Previous
From: Avi Weinberg
Date:
Subject: Logical Replication - When to Enable Disabled Subscription and When to Create a New One
Next
From: Gianni Ceccarelli
Date:
Subject: Re: case insensitive collation of Greek's sigma