Re: pg_collation.collversion for C.UTF-8 - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: pg_collation.collversion for C.UTF-8 |
Date | |
Msg-id | CA+hUKGKr-b33uw_3nUEa80afT0RKy0D+oo41ztRLyuby4oQX8g@mail.gmail.com Whole thread Raw |
In response to | Re: pg_collation.collversion for C.UTF-8 (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: pg_collation.collversion for C.UTF-8
|
List | pgsql-hackers |
On Sat, Jun 17, 2023 at 10:03 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote: > > Hmm, OK let's explore that. What could we do that would be helpful > > here, without affecting users of the "true" C.UTF-8 for the rest of > > time? > > Where is the "true" C.UTF-8 defined? By "true" I just meant glibc's official one, in contrast to the imposter from Debian oldstable's patches. It's not defined by any standard, but we only know how to record versions for glibc, FreeBSD and Windows, and we know what the first two of those do for that locale because they tell us (see below). For Windows, the manual's BNF-style description of acceptable strings doesn't appear to accept C.UTF-8 (but I haven't tried it). > I assume you mean that the collation order can't (shouldn't, anyway) > change. But what about the ctype (upper/lower/initcap) behavior? Is > that also locked down for all time, or could it change if some new > unicode characters are added? Fair point. Considering that our collversion effectively functions as a proxy for ctype version too, Daniel's patch makes a certain amount of sense. Our versioning is nominally based only on the collation category, not locales more generally or any other category they contain (nominally, as in: we named it collversion, and our code and comments and discussions so far only contemplated collations in this context). But, clearly, changes to underlying ctype data could also cause a constraint CHECK (x ~ '[[:digit:]]') or a partial index with WHERE (upper(x) <> 'ẞ') to be corrupted, which I'd considered to be a separate topic, but Daniel's patch would cover with the same mechanism. (Actually I just learned that [[:digit:]] is a bad example on a glibc system, because they appear to have hardcoded a test for [0-9] into their iswdigit_l() implementation, but FreeBSD gives the Unicode answer, which is subject to change, and other classes may work better on glibc.) > Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C, > but leave LC_CTYPE=C.UTF-8 as-is? Yes. The basic idea, at least for these two OSes, is that every category behaves as if set to C, except LC_CTYPE. For implementation reasons the glibc people don't quite describe it that way[1]: for LC_COLLATE, they decode to codepoints first and then compare those using a new codepath they had to write for release 2.35, while FreeBSD skips that useless step and compares raw UTF-8 bytes like LC_COLLATE=C[2]. Which is the same, as RFC 3692 tells us: o The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers. Of course this is of limited interest since a sort order based on character numbers is almost never culturally valid. It is interesting to note that LC_COLLATE=C, LC_CTYPE=C.UTF-8 is equivalent, but would not get version warnings with Daniel's patch, revealing that it's only a proxy. But recording ctype version separately would be excessive. For completeness, Solaris also has C.UTF-8. I can't read about what it does, the release notes are behind a signup thing. *shrug* I can't find any other systems that have it. [1] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 [2] https://reviews.freebsd.org/D17833
pgsql-hackers by date: