Re: pg_collation.collversion for C.UTF-8 - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: pg_collation.collversion for C.UTF-8
Date
Msg-id CA+hUKGKr-b33uw_3nUEa80afT0RKy0D+oo41ztRLyuby4oQX8g@mail.gmail.com
Whole thread Raw
In response to Re: pg_collation.collversion for C.UTF-8  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: pg_collation.collversion for C.UTF-8
List pgsql-hackers
On Sat, Jun 17, 2023 at 10:03 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote:
> > Hmm, OK let's explore that.  What could we do that would be helpful
> > here, without affecting users of the "true" C.UTF-8 for the rest of
> > time?
>
> Where is the "true" C.UTF-8 defined?

By "true" I just meant glibc's official one, in contrast to the
imposter from Debian oldstable's patches.  It's not defined by any
standard, but we only know how to record versions for glibc, FreeBSD
and Windows, and we know what the first two of those do for that
locale because they tell us (see below).  For Windows, the manual's
BNF-style description of acceptable strings doesn't appear to accept
C.UTF-8 (but I haven't tried it).

> I assume you mean that the collation order can't (shouldn't, anyway)
> change. But what about the ctype (upper/lower/initcap) behavior? Is
> that also locked down for all time, or could it change if some new
> unicode characters are added?

Fair point.  Considering that our collversion effectively functions as
a proxy for ctype version too, Daniel's patch makes a certain amount
of sense.

Our versioning is nominally based only on the collation category, not
locales more generally or any other category they contain (nominally,
as in: we named it collversion, and our code and comments and
discussions so far only contemplated collations in this context).
But, clearly, changes to underlying ctype data could also cause a
constraint CHECK (x ~ '[[:digit:]]') or a partial index with WHERE
(upper(x) <> 'ẞ') to be corrupted, which I'd considered to be a
separate topic, but Daniel's patch would cover with the same
mechanism.  (Actually I just learned that [[:digit:]] is a bad example
on a glibc system, because they appear to have hardcoded a test for
[0-9] into their iswdigit_l() implementation, but FreeBSD gives the
Unicode answer, which is subject to change, and other classes may work
better on glibc.)

> Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C,
> but leave LC_CTYPE=C.UTF-8 as-is?

Yes.  The basic idea, at least for these two OSes, is that every
category behaves as if set to C, except LC_CTYPE.  For implementation
reasons the glibc people don't quite describe it that way[1]: for
LC_COLLATE, they decode to codepoints first and then compare those
using a new codepath they had to write for release 2.35, while FreeBSD
skips that useless step and compares raw UTF-8 bytes like
LC_COLLATE=C[2].  Which is the same, as RFC 3692 tells us:

   o  The byte-value lexicographic sorting order of UTF-8 strings is the
      same as if ordered by character numbers.  Of course this is of
      limited interest since a sort order based on character numbers is
      almost never culturally valid.

It is interesting to note that LC_COLLATE=C, LC_CTYPE=C.UTF-8 is
equivalent, but would not get version warnings with Daniel's patch,
revealing that it's only a proxy.  But recording ctype version
separately would be excessive.

For completeness, Solaris also has C.UTF-8.  I can't read about what
it does, the release notes are behind a signup thing.  *shrug*  I
can't find any other systems that have it.

[1] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8
[2] https://reviews.freebsd.org/D17833



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: RFC: logical publication via inheritance root?
Next
From: Amit Kapila
Date:
Subject: Re: [DOC] Update ALTER SUBSCRIPTION documentation v3