Home > mailing lists

Re: case insensitive collation of Greek's sigma - Mailing list pgsql-general

From	Jakub Jedelsky
Subject	Re: case insensitive collation of Greek's sigma
Date	December 2, 2021 13:26:39
Msg-id	CAC1JxDQi+z47rdv1szaxyrhAL8-wheZgTggjdj5AQAL4F=xR7w@mail.gmail.com Whole thread
In response to	Re: case insensitive collation of Greek's sigma (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: case insensitive collation of Greek's sigma
List	pgsql-general

Tree view

On Wed, Dec 1, 2021 at 8:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> Running lower() like this is really the wrong thing to do. We should be
> doing "case folding" instead, which normalizes these differences for the
> purpose of case-insensitive comparisons.

That just begs the question: if tolower (or towlower) isn't the
appropriate API, what is? Perhaps ICU has something for a more
generalized notion of case-similarity, but I'm not aware of any such
thing in the POSIX API.

BTW, I think it's only accidental that the regex example shown upthread
gets the right answer. In that example, what's happening is that we
consider a letter in a case-insensitive regex to match itself, or
tolower() of itself, or toupper() of itself. Both σ and ς have Σ
as toupper() so they both work. But if you'd written Σ in the regex,
only one of σ and ς would match that as a data character. (Haven't
actually tested this, but given the way the code works I'm pretty
sure it's so.) Again, it's hard to see how to do better atop a POSIX
locale library.

Thanks for digging into the issue.

Based on GNU docs [1] the POSIX APIs are not ready for that. Anyway, is it possible to keep current behaviour with lowercase in POSIX as a fallback and have the correct solution for ICU? I think (not an expert though) there should be already working code for case folding for some time already.

[1] https://www.gnu.org/software/libunistring/
"""

Text files are nowadays usually encoded in Unicode, and may consist of very different scripts – from Latin letters to Chinese Hanzi –, with many kinds of special characters – accents, right-to-left writing marks, hyphens, Roman numbers, and much more. But the POSIX platform APIs for text do not contain adequate functions for dealing with particular properties of many Unicode characters. In fact, the POSIX APIs for text have several assumptions at their base which don't hold for Unicode text.

"""

pgsql-general by date:

From: Avi Weinberg
Date: 02 December 2021, 10:11:24
Subject: Logical Replication - When to Enable Disabled Subscription and When to Create a New One

From: Gianni Ceccarelli
Date: 02 December 2021, 14:04:04
Subject: Re: case insensitive collation of Greek's sigma

Re: case insensitive collation of Greek's sigma - Mailing list pgsql-general

Previous

Next