Thread: pg_collation.collversion for C.UTF-8
Hi, get_collation_actual_version() in pg_locale.c currently excludes C.UTF-8 (and more generally C.*) from versioning, which makes pg_collation.collversion being empty for these collations. char * get_collation_actual_version(char collprovider, const char *collcollate) { .... if (collprovider == COLLPROVIDER_LIBC && pg_strcasecmp("C", collcollate) != 0 && pg_strncasecmp("C.", collcollate, 2) != 0 && pg_strcasecmp("POSIX", collcollate) != 0) This seems to be based on the idea that C.* collations provide an immutable sort like "C", but it appears that it's not the case. For instance, consider how these C.UTF-8 comparisons differ between recent linux systems: U+1D400 = Mathematical Bold Capital A Debian 9.13 (glibc 2.24) => select 'A' < E'\U0001D400' collate "C.UTF-8"; ?column? ---------- t Debian 10.13 (glibc 2.28) => select 'A' < E'\U0001D400' collate "C.UTF-8"; ?column? ---------- f Debian 11.6 (glibc 2.31) => select 'A' < E'\U0001D400' collate "C.UTF-8"; ?column? ---------- f Ubuntu 22.04 (glibc 2.35) => select 'A' < E'\U0001D400' collate "C.UTF-8"; ?column? ---------- t So I suggest the attached patch to no longer exclude these collations from the generic versioning. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Attachment
On Wed, Apr 19, 2023 at 12:36 AM Daniel Verite <daniel@manitou-mail.org> wrote: > This seems to be based on the idea that C.* collations provide an > immutable sort like "C", but it appears that it's not the case. Hmm. It seems I added that exemption initially for FreeBSD only in ca051d8b101, and then merged the cases for several OSes in beb4480c853. It's extremely surprising to me that the sort order changed. I expected the sort order to be code point order: https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 One interesting thing is that it seems that it might have been independently invented by Debian (?) and then harmonised with glibc 2.35: https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1871363.html Was the earlier Debian version buggy, or did it simply have a different idea of what the sort order should be, intentionally? Ugh. From your examples, we can see that the older Debian system did not have A < [some 4 digit code point], while the later version did (as expected). If so then it might be tempting to *not* do what you're suggesting, since the stated goal of the thing is to be stable from now on. But it changed once in the early years of its existence! Annoying. Many OSes have a locale with this name. I don't know this history, who did it first etc, but now I am wondering if they all took the "obvious" interpretation, that it should be code-point based, extrapolating from "C" (really memcmp order): https://unix.stackexchange.com/questions/597962/how-widespread-is-the-c-utf-8-locale
On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote: > Many OSes have a locale with this name. I don't know this history, > who did it first etc, but now I am wondering if they all took the > "obvious" interpretation, that it should be code-point based, > extrapolating from "C" (really memcmp order): memcmp() is not the same as code-point order in all encodings, right? I've been thinking that we should have a "provider=none" for the special cases that use memcmp(). It's not using libc as a collation provider; it's really postgres in control of the semantics. That would clean up the documentation and the code a bit, and make it more clear which locales are being passed to the provider and which ones aren't. If we are passing it to a provider (e.g. "C.UTF-8"), we shouldn't make unnecessary assumptions about what the provider will do with it. For what it's worth, in my recent ICU language tag work, I special- cased ICU locales with language "C" or "POSIX" to map to "en-US-u-va- posix", disregarding everything else (collation attributes, etc.). I believe that's the right thing based on the behavior I observed: for the POSIX variant of en-US, ICU seems to disregard other things such as case insensitivity. But it still ultimately goes to the provider and ICU has particular rules for that locale -- I don't assume memcpy-like semantics or code point order. Regards, Jeff Davis
On Wed, Apr 19, 2023 at 1:30 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote: > > Many OSes have a locale with this name. I don't know this history, > > who did it first etc, but now I am wondering if they all took the > > "obvious" interpretation, that it should be code-point based, > > extrapolating from "C" (really memcmp order): > > memcmp() is not the same as code-point order in all encodings, right? Right. I wasn't trying to suggest that *we* should assume that, I was just thinking out loud about how a libc implementor would surely think that a "C.encoding" should work, in the spirit of "C", given that the standard doesn't tell us IIUC. It looks like for technical reasons inside glibc, that couldn't be done before 2.35: https://sourceware.org/bugzilla/show_bug.cgi?id=17318 That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied by the glibc project) isn't supposed to be versioned, but it's extremely unfortunate that a bunch of OSes (Debian and maybe more) have been sorting text in some other order under that name for years. > I've been thinking that we should have a "provider=none" for the > special cases that use memcmp(). It's not using libc as a collation > provider; it's really postgres in control of the semantics. Yeah, interesting idea.
Thomas Munro wrote: > It looks like for technical reasons > inside glibc, that couldn't be done before 2.35: > > https://sourceware.org/bugzilla/show_bug.cgi?id=17318 > > That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied > by the glibc project) isn't supposed to be versioned, but it's > extremely unfortunate that a bunch of OSes (Debian and maybe more) > have been sorting text in some other order under that name for > years. Yes. This is consistent with Debian/Ubuntu patches in glibc/localedata/locales/C glibc-2.35 is not patched, and upstream has this: LC_COLLATE % The keyword 'codepoint_collation' in any part of any LC_COLLATE % immediately discards all collation information and causes the % locale to use strcmp/wcscmp for collation comparison. This is % exactly what is needed for C (ASCII) or C.UTF-8. codepoint_collation END LC_COLLATE But in older versions, glibc doesn't have the locales/C data file. Debian adds it in debian/patches/localedata/C with that kind of content: * glibc 2.31 Debian 11 LC_COLLATE order_start forward <U0000> .. <U007F> <U0080> .. <U00FF> etc... But as explained in the above-linked bugzilla entry, that did not result in true byte-comparison semantics, for several reasons that got fixed in 2.35. So this looks like a solved problem for anyone starting to use these collation with glibc 2.35 or newer (or other OSes that don't have a compatibility issue with them in the first place). But Debian/Ubuntu users upgrading from the older C.* to 2.35+ will not be having the normal warning about the need to reindex. I understand that my proposal to version C.* like any other collation might be erring on the side of caution, but ignoring these collation changes on at least one major OS does not feel right either. Maybe we should consider doing platform-dependent checks? Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Wed, 2023-04-19 at 14:07 +1200, Thomas Munro wrote: > That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied > by > the glibc project) isn't supposed to be versioned, but it's extremely > unfortunate that a bunch of OSes (Debian and maybe more) have been > sorting text in some other order under that name for years. What should we do with locales like C.UTF-8 in both libc and ICU? We either need to capture it and use the memcmp/pg_ascii code paths so it doesn't use the provider at all (like C); or if we send it to the provider, we can't have too many expectations about what will be done with it (even if we know what "should" happen). If we capture it like the C locale, then where do we draw the line? Any locale that begins with "C."? What if the language part is C but there is some other part to the locale? What about lower case? Should all of these apply the same way except with POSIX? What about backwards compatibility? If we pass it to the provider: * ICU: Recent versions of ICU don't recognize C.UTF-8 at all, and if you try to open it, you'll get the root collator (with warning or error, which is not great for such a common locale name). ICU versions 63 and earlier recognize C.UTF-8 as en-US-u-va-posix (a.k.a. en_US_POSIX), which has some adjustments to match expectations of C sorting (e.g. upper case first). * libc: problems as raised in this thread. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > What should we do with locales like C.UTF-8 in both libc and ICU? I vote for passing those to the existing C-specific code paths, whereever we have any (not sure that we do for <ctype.h> functionality). The semantics are quite well-defined and I can see no good coming of allowing either provider to mess with them. > If we capture it like the C locale, then where do we draw the line? Any > locale that begins with "C."? What if the language part is C but there > is some other part to the locale? What about lower case? Should all of > these apply the same way except with POSIX? What about backwards > compatibility? Probably "C", or "C.anything", or "POSIX", or "POSIX.anything". Case-independent might be good, but we haven't accepted such in the past, so I don't feel strongly about it. (Arguably, passing lower case "c" to the provider would provide an "out" to anybody who dislikes our choices here.) regards, tom lane
On Thu, 2023-05-25 at 14:48 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > What should we do with locales like C.UTF-8 in both libc and ICU? > > I vote for passing those to the existing C-specific code paths, Great, this would be a big step toward solving the ICU usability issues in this thread: https://postgr.es/m/000b01d97465%24c34bbd60%2449e33820%24%40pcorp.us > Probably "C", or "C.anything", or "POSIX", or "POSIX.anything". > Case-independent might be good, but we haven't accepted such in > the past, so I don't feel strongly about it. (Arguably, passing > lower case "c" to the provider would provide an "out" to anybody > who dislikes our choices here.) Patch attached with your suggestions. It's based on the first patch in the series I posted here: https://postgr.es/m/a4388fa3acabf7794ac39fdb471ad97eebdfbe11.camel@j-davis.com We still need to consider backwards compatibility. If someone has a collation with locale name C.UTF-8 in an earlier version, any change to the interpretation of that locale name after an upgrade carries a corruption risk. The risks are different in ICU vs libc: For ICU: iculocale=C in an earlier version was a mistake that must have been explicitly requested by the user. However, if such a mistake was made, the indexes would have been created using the ICU root locale, which is very different from the C locale. So reinterpreting iculocale=C as memcmp() would be likely to result in index corruption. Patch 0002 (also based on a patch from the series linked above) solves this with a pg_upgrade check for iculocale=C in versions 15 and earlier. The upgrade check is not likely to affect many users, and those it does affect have a mis-defined collation and would benefit from the check. For libc: this change may affect any user who happened to have LANG=C.UTF-8 in their environment at initdb time, which is probably a lot of users, and some buildfarm members. However, the average risk seems to be much lower, because we've gone a long time with the assumption that C.UTF-8 has the same behavior as C, and this only recently came up. Also, I'm not sure how obscure the cases are even if there is a difference; perhaps they don't often occur in practice? It's not clear to me how we mitigate this risk further, though. Regards, Jeff Davis
Attachment
On Fri, 2023-05-26 at 10:43 -0700, Jeff Davis wrote: > We still need to consider backwards compatibility. If someone has a > collation with locale name C.UTF-8 in an earlier version, any change > to > the interpretation of that locale name after an upgrade carries a > corruption risk. The risks are different in ICU vs libc: ... > For libc: this change may affect any user who happened to have > LANG=C.UTF-8 in their environment at initdb time, which is probably a > lot of users, and some buildfarm members. However, the average risk > seems to be much lower, because we've gone a long time with the > assumption that C.UTF-8 has the same behavior as C, and this only > recently came up. Also, I'm not sure how obscure the cases are even > if > there is a difference; perhaps they don't often occur in practice? > It's > not clear to me how we mitigate this risk further, though. We can avoid this risk by converting C.anything or POSIX.anything to plain "C" or "POSIX", respectively, for new collations before storing the string in the catalog. For upgraded collations, we can preserve the existing locale name. When opening the locale, we would still only recognize plain "C" and "POSIX" as the C locale. That would be more consistent behavior for new users, without creating a backwards compatibility problem for existing users who happened to create a collation with C.UTF-8. For ICU users, we'd still need the upgrade check, because even the "C" locale was not implemented with memcmp in prior versions. But I think that's fine and should be done anyway, as the behavior in that case was incorrect and was almost certainly a mistake by the user. Regards, Jeff Davis
Jeff Davis wrote: > > For libc: this change may affect any user who happened to have > > LANG=C.UTF-8 in their environment at initdb time, which is probably a > > lot of users, and some buildfarm members. However, the average risk > > seems to be much lower, because we've gone a long time with the > > assumption that C.UTF-8 has the same behavior as C, and this only > > recently came up. Currently, neither lc_collate_is_c() nor lookup_collation_cache() think that C.UTF-8 is a C collation, since they do that kind of test: if (strcmp(localeptr, "C") == 0) result = true; else if (strcmp(localeptr, "POSIX") == 0) result = true; else result = false; What is relatively new (v15) is that we compute a version for libc collations in get_collation_actual_version(), with code that assumes that C.* does not need a version, implying that it's immune to Unicode changes. What came up in this thread is that this assumption is not true for at least one major platform: Debian/Ubuntu for releases occurring before 2022 (glibc < 2.35). > We can avoid this risk by converting C.anything or POSIX.anything to > plain "C" or "POSIX", respectively, for new collations before storing > the string in the catalog. For upgraded collations, we can preserve the > existing locale name. When opening the locale, we would still only > recognize plain "C" and "POSIX" as the C locale. Then Postgres would not sort the same as the operating system with the same locale, at least on some OS. Concerning glibc, after waiting a few years, glibc<2.35 will be obsolete, and C.UTF-8 sorting like C will happen by itself. But in the meantime, personally I don't quite see why Postgres should start forcing C.UTF-8 to sort differently in the database than in the OS. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Mon, 2023-06-05 at 19:43 +0200, Daniel Verite wrote: > But in the meantime, personally I don't quite see why Postgres should > start forcing C.UTF-8 to sort differently in the database than in the > OS. I can see both points of view. It could be surprising to users if C.UTF-8 does not sort like C/memcmp, or surprising if it changes out from under them. It could also be surprising that it wouldn't sort like the current OS's libc interpretation of C.UTF-8. What about ICU? How should provider=icu locale=C.UTF-8 behave? We could: a. Just pass it to the provider and see what happens (older versions of ICU would interpret it as en-US-u-va-posix; newer versions would give the root locale). b. Consistently interpret it as en-US-u-va-posix. c. Don't pass it to the provider at all and treat it with memcmp semantics. Regards, Jeff Davis
On 6/6/23 15:23, Jeff Davis wrote: > On Mon, 2023-06-05 at 19:43 +0200, Daniel Verite wrote: >> But in the meantime, personally I don't quite see why Postgres should >> start forcing C.UTF-8 to sort differently in the database than in the >> OS. > > I can see both points of view. It could be surprising to users if > C.UTF-8 does not sort like C/memcmp, or surprising if it changes out > from under them. It could also be surprising that it wouldn't sort like > the current OS's libc interpretation of C.UTF-8. > > What about ICU? How should provider=icu locale=C.UTF-8 behave? We > could: > > a. Just pass it to the provider and see what happens (older versions of > ICU would interpret it as en-US-u-va-posix; newer versions would give > the root locale). > > b. Consistently interpret it as en-US-u-va-posix. > > c. Don't pass it to the provider at all and treat it with memcmp > semantics. Personally I think this should be (a). However we should also clearly document that the semantics of such is provider/OS dependent and therefore may not be what is expected/desired. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Jeff Davis wrote: > What about ICU? How should provider=icu locale=C.UTF-8 behave? We > could: > > a. Just pass it to the provider and see what happens (older versions of > ICU would interpret it as en-US-u-va-posix; newer versions would give > the root locale). > > b. Consistently interpret it as en-US-u-va-posix. > > c. Don't pass it to the provider at all and treat it with memcmp > semantics. I think b) and c) are quite problematic. First, en-US-u-va-posix does not sort like C.UTF-8 in glibc. For one thing it seems that en-US-u-va-posix assigns zero weights to some codepoints, which makes it semantically definitely different. For instance consider ZERO WIDTH SPACE (U+200B): postgres=# select 'ab' < E'a\u200Ba' COLLATE "C.utf8"; ?column? ---------- t postgres=# select 'ab' < E'a\u200Ba' COLLATE "en-US-u-va-posix-x-icu"; ?column? ---------- f Even if ICU folks refer to u-va-posix as approximating POSIX (as in [1]), for our purpose, either it sorts by codepoints or it does not, and it clearly does not. One consequence is that en-US-u-va-posix-x-icu needs to be versioned and indexes depending on it need to be rebuilt on upgrades. OTOH the goal with C.UTF-8, that is achieved in glibc>=2.35, is to not need that. Also it's not just about sorting. The semantics for the ctype-kind functions are also different. Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we only match ASCII characters 0-9, or 10 codepoints. With "en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the digit characters in all languages, plus a bunch of variants for mathematical symbols. For instance consider U+FF10 (Fullwidth Digit Zero): postgres=# select E'\uff10' collate "C.utf8" ~ '\d'; ?column? ---------- f postgres=# select E'\uff10' collate "en-US-u-va-posix-x-icu" ~ '\d'; ?column? ---------- t If someone dumps their C.UTF-8 database to reload into an ICU/en-US-u-va-posix database, there is no guarantee that it even reloads because of semantic differences occuring in constraints. In general it will surely reload, but the apps might not behave the same with the new database in a way that might be problematic. It's fine if that's what they want and they explicitly ask for this conversion, but it's not fine if it's postgres that has quietly decided that for them. About c) "don't pass it to the operators", it would be doable for sorting (ignoring the "glibc before 2.35 does not sort like that" issue) but not for the ctype-kind functions, where postgres' own code doesn't have the Unicode knowledge. About a) "just pass it to the provider", that seems better than b) or c), but still, when a user asks for provider=icu locale=C.UTF-8, it's a very probably a pilot error. To me the user would be best served by a warning, if not an error, informing them that it's quite probably not the combination they want. [1] https://sourceforge.net/p/icu/mailman/icu-support/thread/CAN49p6pvQKP93j8LMn3zBWhpk-T0qYD0TCuiHMv6Z3UPGFh3QQ%40mail.gmail.com/#msg35638356 Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
I wrote: > Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we > only match ASCII characters 0-9, or 10 codepoints. With > "en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the > digit characters in all languages, plus a bunch of variants for > mathematical symbols. BTW this not specifically a C.UTF-8 versus "en-US-u-va-posix-x-icu" difference. If think that any glibc-based locale will consider that \d in a regexp means [0-9], and that any ICU locale will make \d match a much larger variety of characters. While moving to ICU by default, we should expect that differences like that will affect apps in a way that might be more or less disruptive. Another known difference it that upper() with ICU does not do a character-by-character conversion, for instance: WITH words(w) as (values('muß'),('final')) SELECT w, length(w), upper(w collate "C.utf8") as "upper (libc)", length(upper(w collate "C.utf8")), upper(w collate "en-x-icu") as "upper (ICU)", length(upper(w collate "en-x-icu")) FROM words; w | length | upper libc | length | upper ICU | length ------+--------+------------+--------+-----------+-------- muß | 3 | MUß | 3 | MUSS | 4 final | 4 | fiNAL | 4 | FINAL | 5 The fact that the resulting string is larger that the original might cause problems. In general, we can't abstract from the fact that ICU semantics are different. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 06.06.23 21:23, Jeff Davis wrote: > What about ICU? How should provider=icu locale=C.UTF-8 behave? We > could: It should be an error. > a. Just pass it to the provider and see what happens (older versions of > ICU would interpret it as en-US-u-va-posix; newer versions would give > the root locale). This, but with an error instead of a warning.
On Wed, 2023-06-07 at 23:28 +0200, Peter Eisentraut wrote: > On 06.06.23 21:23, Jeff Davis wrote: > > What about ICU? How should provider=icu locale=C.UTF-8 behave? We > > could: > > It should be an error. > > > a. Just pass it to the provider and see what happens (older > > versions of > > ICU would interpret it as en-US-u-va-posix; newer versions would > > give > > the root locale). > > This, but with an error instead of a warning. If we do that, a plain "initdb -D data" will fail if LANG=C.UTF-8. Perhaps that's fine, but certainly some buildfarm members would complain. I'm not sure how many users would be affected. Regards, Jeff Davis
On Sun, Apr 23, 2023 at 5:22 AM Daniel Verite <daniel@manitou-mail.org> wrote: > I understand that my proposal to version C.* like any other collation > might be erring on the side of caution, but ignoring these collation > changes on at least one major OS does not feel right either. > Maybe we should consider doing platform-dependent checks? Hmm, OK let's explore that. What could we do that would be helpful here, without affecting users of the "true" C.UTF-8 for the rest of time? This is a Debian (+ downstream distro) only problem as far as we know so far, and only for Debian 11 and older. Debian 12 was just released the other day as the new stable, so the window of opportunity to actually help anyone with our actions in this area is now beginning to close, because we're really talking about new databases initdb'd and new COLLATIONs CREATEd on Debian old stable after our next (August) release. The window may be much wider for long term Ubuntu releases. You're right that we could do something platform-specific to help with that: we could change that code so that it captures the version for C.* #if __GLIBC_MAJOR__ == 2 && __GLIBC_MINOR__ < 35 (or we could parse the string returned by the runtime version function). I don't recall immediately what our warning code does if it sees a change from versioned to not versioned, but we could make sure it does what you want here. That way we wouldn't burden all future users of C.* with version warnings (because it'll be empty), but we'd catch those doing Debian 11 -> 12 upgrades, and whatever Ubuntu upgrades that corresponds to, etc. Is it worth it?
On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote: > Hmm, OK let's explore that. What could we do that would be helpful > here, without affecting users of the "true" C.UTF-8 for the rest of > time? Where is the "true" C.UTF-8 defined? I assume you mean that the collation order can't (shouldn't, anyway) change. But what about the ctype (upper/lower/initcap) behavior? Is that also locked down for all time, or could it change if some new unicode characters are added? Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C, but leave LC_CTYPE=C.UTF-8 as-is? Regards, Jeff Davis
On Sat, Jun 17, 2023 at 10:03 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote: > > Hmm, OK let's explore that. What could we do that would be helpful > > here, without affecting users of the "true" C.UTF-8 for the rest of > > time? > > Where is the "true" C.UTF-8 defined? By "true" I just meant glibc's official one, in contrast to the imposter from Debian oldstable's patches. It's not defined by any standard, but we only know how to record versions for glibc, FreeBSD and Windows, and we know what the first two of those do for that locale because they tell us (see below). For Windows, the manual's BNF-style description of acceptable strings doesn't appear to accept C.UTF-8 (but I haven't tried it). > I assume you mean that the collation order can't (shouldn't, anyway) > change. But what about the ctype (upper/lower/initcap) behavior? Is > that also locked down for all time, or could it change if some new > unicode characters are added? Fair point. Considering that our collversion effectively functions as a proxy for ctype version too, Daniel's patch makes a certain amount of sense. Our versioning is nominally based only on the collation category, not locales more generally or any other category they contain (nominally, as in: we named it collversion, and our code and comments and discussions so far only contemplated collations in this context). But, clearly, changes to underlying ctype data could also cause a constraint CHECK (x ~ '[[:digit:]]') or a partial index with WHERE (upper(x) <> 'ẞ') to be corrupted, which I'd considered to be a separate topic, but Daniel's patch would cover with the same mechanism. (Actually I just learned that [[:digit:]] is a bad example on a glibc system, because they appear to have hardcoded a test for [0-9] into their iswdigit_l() implementation, but FreeBSD gives the Unicode answer, which is subject to change, and other classes may work better on glibc.) > Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C, > but leave LC_CTYPE=C.UTF-8 as-is? Yes. The basic idea, at least for these two OSes, is that every category behaves as if set to C, except LC_CTYPE. For implementation reasons the glibc people don't quite describe it that way[1]: for LC_COLLATE, they decode to codepoints first and then compare those using a new codepath they had to write for release 2.35, while FreeBSD skips that useless step and compares raw UTF-8 bytes like LC_COLLATE=C[2]. Which is the same, as RFC 3692 tells us: o The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers. Of course this is of limited interest since a sort order based on character numbers is almost never culturally valid. It is interesting to note that LC_COLLATE=C, LC_CTYPE=C.UTF-8 is equivalent, but would not get version warnings with Daniel's patch, revealing that it's only a proxy. But recording ctype version separately would be excessive. For completeness, Solaris also has C.UTF-8. I can't read about what it does, the release notes are behind a signup thing. *shrug* I can't find any other systems that have it. [1] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 [2] https://reviews.freebsd.org/D17833
On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote: > > > Would it be correct to interpret LC_COLLATE=C.UTF-8 as > > LC_COLLATE=C, > > but leave LC_CTYPE=C.UTF-8 as-is? > > Yes. The basic idea, at least for these two OSes, is that every > category behaves as if set to C, except LC_CTYPE. If that's true, and we version C.UTF-8, then users could still get the behavior they want, a stable collation order, and benefit from the optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8. The only caveat is to be careful with things that depend on ctype in indexes and constraints. While still a problem, it's a smaller problem than unversioned collation. We should think a little more about solving it, because I think there's a strong case to be made that a default collation of C and a database ctype of something else is a good combination (it makes less sense for a case-insensitive collation, but those aren't allowed as a default collation). In any case, we're better off following the rule "version anything that goes to any external provider, period". And by "version", I really mean a best effort, because we don't always have great information, but I think it's better to record what we do have than not. We have just seen too many examples of weird behavior. On top of that, it's simply inconsistent to assume that C=C.UTF-8 for collation version, but not for the collation implementation. Users might get frustrated that the collation for C.UTF-8 is versioned, of course. But I don't think it will affect anyone for quite some time, because existing users will have a datcollversion=NULL; so they won't get the warnings until they refresh the versions (or create new collations/databases), and then after that upgrade libc. Right? So they should have time to adjust to use LC_COLLATE=C if that's what they want. An alternative would be to define lc_collate_is_c("C.UTF-8") == true while leaving lc_ctype_is_c("C.UTF-8") == false and get_collation_actual_version("C.UTF-8") == NULL. In that case we would not be passing it to an external provider, so we don't have to version it. But that might be a little too magical and I'm not inclined to do that. Another alternative would be to implement C.UTF-8 internally according to the "true" semantics, if they are truly simple and well-defined and stable. But I don't think ctype=C.UTF-8 is actually stable because new characters can be added, right? Regards, Jeff Davis
Thomas Munro wrote: > What could we do that would be helpful here, without affecting users > of the "true" C.UTF-8 for the rest of time? This is a Debian (+ > downstream distro) only problem as far as we know so far, and only > for Debian 11 and older. It seems to include RedHat-based distros as well. According to https://bugzilla.redhat.com/show_bug.cgi?id=902094 C.utf8 was added in 2015 and backported down to Fedora 22. RHEL8 / CentOS 8 / Rocky8 provide glibc 2.28 with a C.utf8 locale. We can reasonably suspect that they've been using the same kind of patches as Debian before version 12, with not all codepoints being sorted bytewise. RHEL9 comes with glibc 2.34 according to distrowatch [1] and the announcement [2], so presumably it also lacks the "real" C.utf8 with bytewise sorting that glibc 2.35 upstream added. [1] https://distrowatch.com/table.php?distribution=redhat [2] https://developers.redhat.com/articles/2022/05/18/whats-new-red-hat-enterprise-linux-9 Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Tue, Jun 20, 2023 at 6:48 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote: > > > Would it be correct to interpret LC_COLLATE=C.UTF-8 as > > > LC_COLLATE=C, > > > but leave LC_CTYPE=C.UTF-8 as-is? > > > > Yes. The basic idea, at least for these two OSes, is that every > > category behaves as if set to C, except LC_CTYPE. > > If that's true, and we version C.UTF-8, then users could still get the > behavior they want, a stable collation order, and benefit from the > optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8. > > The only caveat is to be careful with things that depend on ctype in > indexes and constraints. While still a problem, it's a smaller problem > than unversioned collation. We should think a little more about solving > it, because I think there's a strong case to be made that a default > collation of C and a database ctype of something else is a good > combination (it makes less sense for a case-insensitive collation, but > those aren't allowed as a default collation). > > In any case, we're better off following the rule "version anything that > goes to any external provider, period". And by "version", I really mean > a best effort, because we don't always have great information, but I > think it's better to record what we do have than not. We have just seen > too many examples of weird behavior. On top of that, it's simply > inconsistent to assume that C=C.UTF-8 for collation version, but not > for the collation implementation. Yeah, OK, you're convincing me. It's hard to decide because our model is basically wrong so it's only warning you about potential ctype changes by happy coincidence, but even in respect of sort order it was probably a mistake to start second-guessing what libc is doing, and with that observation about the C/C.UTF-8 combination, at least an end-user has a way to opt in/out of this choice. I'll try to write a concise commit message for Daniel's patch explaining all this and we can see about squeaking it into beta2. > Use rs might get frustrated that the collation for C.UTF-8 is versioned, > of course. But I don't think it will affect anyone for quite some time, > because existing users will have a datcollversion=NULL; so they won't > get the warnings until they refresh the versions (or create new > collations/databases), and then after that upgrade libc. Right? So they > should have time to adjust to use LC_COLLATE=C if that's what they > want. Yeah. > An alternative would be to define lc_collate_is_c("C.UTF-8") == true > while leaving lc_ctype_is_c("C.UTF-8") == false and > get_collation_actual_version("C.UTF-8") == NULL. In that case we would > not be passing it to an external provider, so we don't have to version > it. But that might be a little too magical and I'm not inclined to do > that. Agreed, let's not do any more of that sort of thing. > Another alternative would be to implement C.UTF-8 internally according > to the "true" semantics, if they are truly simple and well-defined and > stable. But I don't think ctype=C.UTF-8 is actually stable because new > characters can be added, right? Correct.