Re: Collation version tracking for macOS - Mailing list pgsql-hackers
From | Jeremy Schneider |
---|---|
Subject | Re: Collation version tracking for macOS |
Date | |
Msg-id | bb4e241c-c591-b2dd-0283-9e1ee5a6c7da@ardentperf.com Whole thread Raw |
In response to | Re: Collation version tracking for macOS (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Collation version tracking for macOS
Re: Collation version tracking for macOS |
List | pgsql-hackers |
On 6/7/22 12:53 PM, Peter Geoghegan wrote: > > Collations by their very nature are unlikely to change all that much. > Obviously they can and do change, but the details are presumably > pretty insignificant to a native speaker. This idea does seem to persist. It's not as frequent as timezones, but collation rules reflect local dialects and customs, and there are changes quite regularly for a variety of reasons. A brief perusal of CLDR changelogs and CLDR jiras can give some insight here: https://github.com/unicode-org/cldr https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/?jql=project%20%3D%20%22CLDR%22%20AND%20text%20~%20%22collation%22%20ORDER%20BY%20created%20DESC The difference between the unicode consortium and the GNU C Library is that unicode is maintained by people who are specifically interested in working with language and internationalization challenges. I've spoken to a glibc maintainer who directly told me that they dislike working with the collation code, and try to avoid it. It's not even ISO 14651 anymore with so many custom glibc-specific changes layered on top. I looked at the first few commits in the glibc source that were responsible for the big 2.28 changes - there were a serious of quite a few commits and some were so large they wouldn't even load in the github API. Here's one such commit: https://github.com/bminor/glibc/commit/9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 It's reasonable to expect that Red Hat and Debian will keep things stable on one particular major, and to expect that every new major OS version will update to the latest collation algorithms and locale data for glibc. Another misunderstanding that seems to persist is that this only relates to exotic locales or that it's only the 2.28 version. My github repo is out-of-date (I know of more cases that I still need to publish) but the old data already demonstrates changes to the root/DUCET collation rules (evident in en_US without any tailoring) for glibc versions 2.13, 2.21 and 2.26 https://github.com/ardentperf/glibc-unicode-sorting/ If a PosgreSQL user is unlucky enough to have one of those unicode characters stored in a table, they can get broken indexes even if they only use the default US english locale, and without touching glibc 2.28 - and all you need is an index on a field where end users can type any string input. > It's pretty clear that glibc as a project doesn't take the issue very > seriously, because they see it as a problem of the GUI sorting a table > in a way that seems slightly suboptimal to scholars of a natural > language. I disagree that glibc maintainers are doing anything wrong. While the quality of glibc collations aren't great when compared with CLDR, I think the glibc maintainers have done versioning exactly right: they are clear about which patches are allowed to contain collation updates, and the OS distributions are able to ensure stability on major OS release. I haven't yet found a Red Hat minor release that changed glibc collation. -Jeremy -- http://about.me/jeremy_schneider
pgsql-hackers by date: