Re: Collation version tracking for macOS - Mailing list pgsql-hackers

From Jeremy Schneider
Subject Re: Collation version tracking for macOS
Date
Msg-id bb4e241c-c591-b2dd-0283-9e1ee5a6c7da@ardentperf.com
Whole thread Raw
In response to Re: Collation version tracking for macOS  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Collation version tracking for macOS
Re: Collation version tracking for macOS
List pgsql-hackers
On 6/7/22 12:53 PM, Peter Geoghegan wrote:
> 
> Collations by their very nature are unlikely to change all that much.
> Obviously they can and do change, but the details are presumably
> pretty insignificant to a native speaker. 


This idea does seem to persist. It's not as frequent as timezones, but
collation rules reflect local dialects and customs, and there are
changes quite regularly for a variety of reasons. A brief perusal of
CLDR changelogs and CLDR jiras can give some insight here:

https://github.com/unicode-org/cldr


https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/?jql=project%20%3D%20%22CLDR%22%20AND%20text%20~%20%22collation%22%20ORDER%20BY%20created%20DESC

The difference between the unicode consortium and the GNU C Library is
that unicode is maintained by people who are specifically interested in
working with language and internationalization challenges. I've spoken
to a glibc maintainer who directly told me that they dislike working
with the collation code, and try to avoid it. It's not even ISO 14651
anymore with so many custom glibc-specific changes layered on top. I
looked at the first few commits in the glibc source that were
responsible for the big 2.28 changes - there were a serious of quite a
few commits and some were so large they wouldn't even load in the github
API.

Here's one such commit:

https://github.com/bminor/glibc/commit/9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4

It's reasonable to expect that Red Hat and Debian will keep things
stable on one particular major, and to expect that every new major OS
version will update to the latest collation algorithms and locale data
for glibc.

Another misunderstanding that seems to persist is that this only relates
to exotic locales or that it's only the 2.28 version.

My github repo is out-of-date (I know of more cases that I still need to
publish) but the old data already demonstrates changes to the root/DUCET
collation rules (evident in en_US without any tailoring) for glibc
versions 2.13, 2.21 and 2.26

https://github.com/ardentperf/glibc-unicode-sorting/

If a PosgreSQL user is unlucky enough to have one of those unicode
characters stored in a table, they can get broken indexes even if they
only use the default US english locale, and without touching glibc 2.28
- and all you need is an index on a field where end users can type any
string input.


> It's pretty clear that glibc as a project doesn't take the issue very
> seriously, because they see it as a problem of the GUI sorting a table
> in a way that seems slightly suboptimal to scholars of a natural
> language. 

I disagree that glibc maintainers are doing anything wrong.

While the quality of glibc collations aren't great when compared with
CLDR, I think the glibc maintainers have done versioning exactly right:
they are clear about which patches are allowed to contain collation
updates, and the OS distributions are able to ensure stability on major
OS release. I haven't yet found a Red Hat minor release that changed
glibc collation.

-Jeremy


-- 
http://about.me/jeremy_schneider



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Collation version tracking for macOS
Next
From: David Rowley
Date:
Subject: Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list