Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
Date
Msg-id CAH2-Wzn0idkTAqz5xpSC_AiiyBVaZTKMQfzqsyQPkxh8TSP0yA@mail.gmail.com
Whole thread Raw
In response to Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-bugs
On Thu, Aug 17, 2017 at 6:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> My argument for doing this is very simple: ICU/CLDR/BCP 47 provides
> stability guarantees for locales, not collations [1]. For example, as
> we discussed, de_BE didn't actually go away -- it just stopped being a
> distinct collation within ICU, for reasons that are implementation
> defined.

I have data to back this up. I attach 2 files: one is a listing of
locale XML files from within CLDR 1.9's ./common/main/, dating from
December 2010, and the other is a similar listing for CLDR 3.1, dating
from April 2017. This roughly covers every ICU version we'll support
on day 1. The listing is sorted alphabetically, to ease comparison.

Summary:

$ cat locale_list_cldr-19.txt | wc -l
605
$ cat locale_list_cldr-31.txt | wc -l
722
$ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep
"^-[a-zA-Z]" | wc -l
144
$ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep
"^+[a-zA-Z]" | wc -l
261

So, there have been 144 locales removed in that time, and 261 added.
My proposal to standardize on using all locales ICU makes available,
rather than all behaviorally distinct collations, clearly does not
ensure perfect stability. It does actually work pretty well in
practice, though. The number 144 is misleadingly high. If you actually
look at what went away in detail, it looks like there is a lot of
script variants of the same language/country code. Plus, the changes
themselves are non-technical in nature.

The churn seems to be in part due to geopolitical changes, such as 5
years [1] passing after the dissolution of Serbia and Montenegro.
However, it is mostly due to switching from ISO 639-1 to ISO 639-3
codes in cases where a finer distinction about cultural preferences
needed to be made (note that they still only list *macro*
language/region/script combinations as distinct collations). For
example, Kurdish went from being "ku-" to 3 different macro languages:
"ckb-" (Central Kurdish), "kmr-" (Northern Kurdish), and "sdh-"
(Southern Kurdish). Wikipedia says of ISO 639-3: "Because it provides
comprehensive language coverage, giving equal opportunity for all
languages, and because of its wide adoption in information
technologies, ISO 639-3 provides an important technology component
addressing the digital divide problem". We can hope that it will be
the last such revision ever needed, because this digital divide
problem is solved once and for all, at least as far as these standards
go.

CLDR prefers to use ISO 639-1 language codes for compatibility [2],
which is why the language codes are mostly still 2 letters (ISO
639-1). "en" did not change to "eng", because there was no cultural
reason to do so, and thus there was a 1:1 mapping between "en" and
"eng" anyway. Regions/countries will only change due to rare
geopolitical events.

In summary, I think that these changes are fairly low impact in
practice, and are entirely explainable by political changes and
cultural controversies. They really are minimal, because CLDR/ICU
really does take the stability of collation names seriously. We can
and should ensure that locales like "de_BE" are available in every ICU
version, because that is an inexcusable technical oversight, and is
not due to a cultural or political issue.

[1] http://cldr.unicode.org/index/process/cldr-data-retention-policy
[2] http://www.unicode.org/reports/tr35/#unicode_language_subtag_validity
-- 
Peter Geoghegan

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

pgsql-bugs by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [HACKERS] [BUGS] [postgresql 10 beta3] unrecognized node type: 90
Next
From: Peter Eisentraut
Date:
Subject: Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values