Re: Collation version tracking for macOS - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Collation version tracking for macOS
Date
Msg-id CA+hUKGJtmxV43_zjRdJxxEzpAZoQ5BUhzM2N9_Njh85oTt564g@mail.gmail.com
Whole thread Raw
In response to Re: Collation version tracking for macOS  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Collation version tracking for macOS
List pgsql-hackers
On Wed, Nov 30, 2022 at 1:32 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote:
> > On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > > Here's what I found for the 'ar' locale (firstminor/lastminor are
> > > the
> > > icu library versions, firstcollversion/lastcollversion are their
> > > respective collation versions for the given locale):
> > >
> > >  firstminor | lastminor | firstcollversion | lastcollversion
> > > ------------+-----------+------------------+-----------------
> > >  60.1       | 60.3      | 153.80.32        | 153.80.32.1
> > >  64.1       | 64.2      | 153.96.35        | 153.97.35.8
> > >  68.1       | 68.2      | 153.14.38        | 153.14.38.8
> > > (3 rows)
> >
> > Right, this fits with what I said earlier: the third component is
> > CLDR
> > major, fourth component is CLDR minor except from ICU 61 on the CLDR
> > minor is << 3'd (X.X.38.8 means CLDR 38.1).
>
> What about 64.1 -> 64.2? That changed the *second* component from 96 ->
> 97. Are we agreed that collations can materially change in minor ICU
> releases?

That means that the Unicode/UCA version switched from 12 to 12.1, so
that's a confirmed sighting of a UCA minor version bump within one ICU
major version.  Let's see what the purpose of that Unicode minor
release was[1]:

"Unicode 12.1 adds exactly one character, for a total of 137,929 characters.

The new character added to Version 12.1 is:

U+32FF SQUARE ERA NAME REIWA

Version 12.1 adds that single character to enable software to be
rapidly updated to support the new Japanese era name in calendrical
systems and date formatting. The new Japanese era name was officially
announced on April 1, 2019, and is effective as of May 1, 2019."

Wow!

Wikipedia says[2] "the "rei" character 令 has never appeared before".

The sort order of characters that didn't previously exist is a special
topic.  In theory they can't hurt you because you shouldn't have been
using them, but PostgreSQL doesn't enforce that (other systems do), so
you could be exposed to a change from whatever default ordering the
non-existent codepoint had for random implementation reasons to some
deliberate ordering which may or may not be the same.

Are all Unicode/UCA minor versions of that type?  I dunno.  Something
to research, but [3] is far too vague and [4] is about other problems.

[1] https://unicode.org/versions/Unicode12.1.0/
[2] https://en.wikipedia.org/wiki/Reiwa
[3] https://www.unicode.org/versions/#major_minor
[4] https://www.unicode.org/policies/stability_policy.html



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Add LZ4 compression in pg_dump
Next
From: Thomas Munro
Date:
Subject: Re: Collation version tracking for macOS