Thread: pg_collation.collversion for C.UTF-8

pg_collation.collversion for C.UTF-8

From

"Daniel Verite"

Date:

18 April 2023, 12:35:50

 Hi,

get_collation_actual_version() in pg_locale.c currently
excludes C.UTF-8 (and more generally C.*) from versioning,
which makes pg_collation.collversion being empty for these
collations.

char *
get_collation_actual_version(char collprovider, const char *collcollate)
{
....
    if (collprovider == COLLPROVIDER_LIBC &&
        pg_strcasecmp("C", collcollate) != 0 &&
        pg_strncasecmp("C.", collcollate, 2) != 0 &&
        pg_strcasecmp("POSIX", collcollate) != 0)

This seems to be based on the idea that C.* collations provide an
immutable sort like "C", but it appears that it's not the case.

For instance, consider how these C.UTF-8 comparisons differ between
recent linux systems:

U+1D400 = Mathematical Bold Capital A

Debian 9.13 (glibc 2.24)
=> select  'A' < E'\U0001D400' collate "C.UTF-8";
 ?column?
----------
 t

Debian 10.13 (glibc 2.28)
=> select  'A' < E'\U0001D400' collate "C.UTF-8";
 ?column?
----------
 f

Debian 11.6 (glibc 2.31)
=> select  'A' < E'\U0001D400' collate "C.UTF-8";
 ?column?
----------
 f

Ubuntu 22.04 (glibc 2.35)
=> select  'A' < E'\U0001D400' collate "C.UTF-8";
 ?column?
----------
 t

So I suggest the attached patch to no longer exclude these collations
from the generic versioning.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachment

versioning-c-dot-something-collations.patch

Re: pg_collation.collversion for C.UTF-8

From

Thomas Munro

Date:

18 April 2023, 19:48:05

On Wed, Apr 19, 2023 at 12:36 AM Daniel Verite <daniel@manitou-mail.org> wrote:
> This seems to be based on the idea that C.* collations provide an
> immutable sort like "C", but it appears that it's not the case.

Hmm.  It seems I added that exemption initially for FreeBSD only in
ca051d8b101, and then merged the cases for several OSes in
beb4480c853.

It's extremely surprising to me that the sort order changed.  I
expected the sort order to be code point order:

https://sourceware.org/glibc/wiki/Proposals/C.UTF-8

One interesting thing is that it seems that it might have been
independently invented by Debian (?) and then harmonised with glibc
2.35:

https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1871363.html

Was the earlier Debian version buggy, or did it simply have a
different idea of what the sort order should be, intentionally?  Ugh.
From your examples, we can see that the older Debian system did not
have A < [some 4 digit code point], while the later version did (as
expected).  If so then it might be tempting to *not* do what you're
suggesting, since the stated goal of the thing is to be stable from
now on.  But it changed once in the early years of its existence!
Annoying.

Many OSes have a locale with this name.  I don't know this history,
who did it first etc, but now I am wondering if they all took the
"obvious" interpretation, that it should be code-point based,
extrapolating from "C" (really memcmp order):

https://unix.stackexchange.com/questions/597962/how-widespread-is-the-c-utf-8-locale

Re: pg_collation.collversion for C.UTF-8

From

Jeff Davis

Date:

19 April 2023, 01:30:13

On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote:
> Many OSes have a locale with this name.  I don't know this history,
> who did it first etc, but now I am wondering if they all took the
> "obvious" interpretation, that it should be code-point based,
> extrapolating from "C" (really memcmp order):

memcmp() is not the same as code-point order in all encodings, right?

I've been thinking that we should have a "provider=none" for the
special cases that use memcmp(). It's not using libc as a collation
provider; it's really postgres in control of the semantics.

That would clean up the documentation and the code a bit, and make it
more clear which locales are being passed to the provider and which
ones aren't.

If we are passing it to a provider (e.g. "C.UTF-8"), we shouldn't make
unnecessary assumptions about what the provider will do with it.

For what it's worth, in my recent ICU language tag work, I special-
cased ICU locales with language "C" or "POSIX" to map to "en-US-u-va-
posix", disregarding everything else (collation attributes, etc.). I
believe that's the right thing based on the behavior I observed: for
the POSIX variant of en-US, ICU seems to disregard other things such as
case insensitivity. But it still ultimately goes to the provider and
ICU has particular rules for that locale -- I don't assume memcpy-like
semantics or code point order.

Regards,
    Jeff Davis

Re: pg_collation.collversion for C.UTF-8

From

Thomas Munro

Date:

19 April 2023, 02:07:13

On Wed, Apr 19, 2023 at 1:30 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote:
> > Many OSes have a locale with this name.  I don't know this history,
> > who did it first etc, but now I am wondering if they all took the
> > "obvious" interpretation, that it should be code-point based,
> > extrapolating from "C" (really memcmp order):
>
> memcmp() is not the same as code-point order in all encodings, right?

Right.  I wasn't trying to suggest that *we* should assume that, I was
just thinking out loud about how a libc implementor would surely think
that a "C.encoding" should work, in the spirit of "C", given that the
standard doesn't tell us IIUC.  It looks like for technical reasons
inside glibc, that couldn't be done before 2.35:

https://sourceware.org/bugzilla/show_bug.cgi?id=17318

That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied by
the glibc project) isn't supposed to be versioned, but it's extremely
unfortunate that a bunch of OSes (Debian and maybe more) have been
sorting text in some other order under that name for years.

> I've been thinking that we should have a "provider=none" for the
> special cases that use memcmp(). It's not using libc as a collation
> provider; it's really postgres in control of the semantics.

Yeah, interesting idea.

Re: pg_collation.collversion for C.UTF-8

From

"Daniel Verite"

Date:

22 April 2023, 17:22:24

    Thomas Munro wrote:

> It looks like for technical reasons
> inside glibc, that couldn't be done before 2.35:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=17318
>
> That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied
> by the glibc project) isn't supposed to be versioned, but it's
> extremely unfortunate that a bunch of OSes (Debian and maybe more)
> have been sorting text in some other order under that name for
> years.

Yes. This is consistent with Debian/Ubuntu patches in
glibc/localedata/locales/C

glibc-2.35 is not patched, and upstream has this:
  LC_COLLATE
  % The keyword 'codepoint_collation' in any part of any LC_COLLATE
  % immediately discards all collation information and causes the
  % locale to use strcmp/wcscmp for collation comparison.  This is
  % exactly what is needed for C (ASCII) or C.UTF-8.
  codepoint_collation
  END LC_COLLATE

But in older versions, glibc doesn't have the locales/C data file.
Debian adds it in debian/patches/localedata/C with that kind of
content:

* glibc 2.31  Debian 11
  LC_COLLATE
  order_start forward
  <U0000>
  ..
  <U007F>
  <U0080>
  ..
  <U00FF>
  etc...

But as explained in the above-linked bugzilla entry, that did not
result in true byte-comparison semantics, for several reasons
that got fixed in 2.35.

So this looks like a solved problem for anyone starting to use these
collation with glibc 2.35 or newer (or other OSes that don't have a
compatibility issue with them in the first place).
But Debian/Ubuntu users upgrading from the older C.* to 2.35+ will not
be having the normal warning about the need to reindex.

I understand that my proposal to version C.* like any other collation
might be erring on the side of caution, but ignoring these collation
changes on at least one major OS does not feel right either.
Maybe we should consider doing platform-dependent checks?

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: pg_collation.collversion for C.UTF-8

From

Jeff Davis

Date:

25 May 2023, 18:30:11

On Wed, 2023-04-19 at 14:07 +1200, Thomas Munro wrote:
> That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied
> by
> the glibc project) isn't supposed to be versioned, but it's extremely
> unfortunate that a bunch of OSes (Debian and maybe more) have been
> sorting text in some other order under that name for years.

What should we do with locales like C.UTF-8 in both libc and ICU?

We either need to capture it and use the memcmp/pg_ascii code paths so
it doesn't use the provider at all (like C); or if we send it to the
provider, we can't have too many expectations about what will be done
with it (even if we know what "should" happen).

If we capture it like the C locale, then where do we draw the line? Any
locale that begins with "C."? What if the language part is C but there
is some other part to the locale? What about lower case? Should all of
these apply the same way except with POSIX? What about backwards
compatibility?

If we pass it to the provider:

* ICU: Recent versions of ICU don't recognize C.UTF-8 at all, and if
you try to open it, you'll get the root collator (with warning or
error, which is not great for such a common locale name). ICU versions
63 and earlier recognize C.UTF-8 as en-US-u-va-posix (a.k.a.
en_US_POSIX), which has some adjustments to match expectations of C
sorting (e.g. upper case first).

* libc: problems as raised in this thread.

Regards,
    Jeff Davis

Re: pg_collation.collversion for C.UTF-8

From

Tom Lane

Date:

25 May 2023, 18:48:56

Jeff Davis <pgsql@j-davis.com> writes:
> What should we do with locales like C.UTF-8 in both libc and ICU? 

I vote for passing those to the existing C-specific code paths,
whereever we have any (not sure that we do for <ctype.h> functionality).
The semantics are quite well-defined and I can see no good coming of
allowing either provider to mess with them.

> If we capture it like the C locale, then where do we draw the line? Any
> locale that begins with "C."? What if the language part is C but there
> is some other part to the locale? What about lower case? Should all of
> these apply the same way except with POSIX? What about backwards
> compatibility?

Probably "C", or "C.anything", or "POSIX", or "POSIX.anything".
Case-independent might be good, but we haven't accepted such in
the past, so I don't feel strongly about it.  (Arguably, passing
lower case "c" to the provider would provide an "out" to anybody
who dislikes our choices here.)

            regards, tom lane

Re: pg_collation.collversion for C.UTF-8

From

Jeff Davis

Date:

26 May 2023, 17:43:09

On Thu, 2023-05-25 at 14:48 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > What should we do with locales like C.UTF-8 in both libc and ICU?
>
> I vote for passing those to the existing C-specific code paths,

Great, this would be a big step toward solving the ICU usability issues
in this thread:

https://postgr.es/m/000b01d97465%24c34bbd60%2449e33820%24%40pcorp.us

> Probably "C", or "C.anything", or "POSIX", or "POSIX.anything".
> Case-independent might be good, but we haven't accepted such in
> the past, so I don't feel strongly about it.  (Arguably, passing
> lower case "c" to the provider would provide an "out" to anybody
> who dislikes our choices here.)

Patch attached with your suggestions. It's based on the first patch in
the series I posted here:

https://postgr.es/m/a4388fa3acabf7794ac39fdb471ad97eebdfbe11.camel@j-davis.com

We still need to consider backwards compatibility. If someone has a
collation with locale name C.UTF-8 in an earlier version, any change to
the interpretation of that locale name after an upgrade carries a
corruption risk. The risks are different in ICU vs libc:

  For ICU: iculocale=C in an earlier version was a mistake that must
have been explicitly requested by the user. However, if such a mistake
was made, the indexes would have been created using the ICU root
locale, which is very different from the C locale. So reinterpreting
iculocale=C as memcmp() would be likely to result in index corruption.
Patch 0002 (also based on a patch from the series linked above) solves
this with a pg_upgrade check for iculocale=C in versions 15 and
earlier. The upgrade check is not likely to affect many users, and
those it does affect have a mis-defined collation and would benefit
from the check.

  For libc: this change may affect any user who happened to have
LANG=C.UTF-8 in their environment at initdb time, which is probably a
lot of users, and some buildfarm members. However, the average risk
seems to be much lower, because we've gone a long time with the
assumption that C.UTF-8 has the same behavior as C, and this only
recently came up. Also, I'm not sure how obscure the cases are even if
there is a difference; perhaps they don't often occur in practice? It's
not clear to me how we mitigate this risk further, though.

Regards,
    Jeff Davis

Attachment

Re: pg_collation.collversion for C.UTF-8

From

Jeff Davis

Date:

05 June 2023, 16:37:07

On Fri, 2023-05-26 at 10:43 -0700, Jeff Davis wrote:
> We still need to consider backwards compatibility. If someone has a
> collation with locale name C.UTF-8 in an earlier version, any change
> to
> the interpretation of that locale name after an upgrade carries a
> corruption risk. The risks are different in ICU vs libc:

...

>   For libc: this change may affect any user who happened to have
> LANG=C.UTF-8 in their environment at initdb time, which is probably a
> lot of users, and some buildfarm members. However, the average risk
> seems to be much lower, because we've gone a long time with the
> assumption that C.UTF-8 has the same behavior as C, and this only
> recently came up. Also, I'm not sure how obscure the cases are even
> if
> there is a difference; perhaps they don't often occur in practice?
> It's
> not clear to me how we mitigate this risk further, though.

We can avoid this risk by converting C.anything or POSIX.anything to
plain "C" or "POSIX", respectively, for new collations before storing
the string in the catalog. For upgraded collations, we can preserve the
existing locale name. When opening the locale, we would still only
recognize plain "C" and "POSIX" as the C locale.

That would be more consistent behavior for new users, without creating
a backwards compatibility problem for existing users who happened to
create a collation with C.UTF-8.

For ICU users, we'd still need the upgrade check, because even the "C"
locale was not implemented with memcmp in prior versions. But I think
that's fine and should be done anyway, as the behavior in that case was
incorrect and was almost certainly a mistake by the user.

Regards,
    Jeff Davis

Re: pg_collation.collversion for C.UTF-8

From

"Daniel Verite"

Date:

05 June 2023, 17:43:26

    Jeff Davis wrote:

> >   For libc: this change may affect any user who happened to have
> > LANG=C.UTF-8 in their environment at initdb time, which is probably a
> > lot of users, and some buildfarm members. However, the average risk
> > seems to be much lower, because we've gone a long time with the
> > assumption that C.UTF-8 has the same behavior as C, and this only
> > recently came up.

Currently, neither lc_collate_is_c() nor lookup_collation_cache()
think that C.UTF-8 is a C collation, since they do that kind of test:

        if (strcmp(localeptr, "C") == 0)
            result = true;
        else if (strcmp(localeptr, "POSIX") == 0)
            result = true;
        else
            result = false;

What is relatively new (v15) is that we compute a version for libc
collations in get_collation_actual_version(), with code that assumes
that C.* does not need a version, implying that it's immune to
Unicode changes. What came up in this thread is that this assumption
is not true for at least one major platform: Debian/Ubuntu for
releases occurring before 2022 (glibc < 2.35).

> We can avoid this risk by converting C.anything or POSIX.anything to
> plain "C" or "POSIX", respectively, for new collations before storing
> the string in the catalog. For upgraded collations, we can preserve the
> existing locale name. When opening the locale, we would still only
> recognize plain "C" and "POSIX" as the C locale.

Then Postgres would not sort the same as the operating system with the
same locale, at least on some OS. Concerning glibc, after waiting a
few years, glibc<2.35 will be obsolete, and C.UTF-8 sorting like C
will happen by itself.
But in the meantime, personally I don't quite see why Postgres should
start forcing C.UTF-8 to sort differently in the database than in the
OS.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: pg_collation.collversion for C.UTF-8

From

Jeff Davis

Date:

06 June 2023, 19:23:30

On Mon, 2023-06-05 at 19:43 +0200, Daniel Verite wrote:
> But in the meantime, personally I don't quite see why Postgres should
> start forcing C.UTF-8 to sort differently in the database than in the
> OS.

I can see both points of view. It could be surprising to users if
C.UTF-8 does not sort like C/memcmp, or surprising if it changes out
from under them. It could also be surprising that it wouldn't sort like
the current OS's libc interpretation of C.UTF-8.

What about ICU? How should provider=icu locale=C.UTF-8 behave? We
could:

a. Just pass it to the provider and see what happens (older versions of
ICU would interpret it as en-US-u-va-posix; newer versions would give
the root locale).

b. Consistently interpret it as en-US-u-va-posix.

c. Don't pass it to the provider at all and treat it with memcmp
semantics.

Regards,
    Jeff Davis

Re: pg_collation.collversion for C.UTF-8

From

Joe Conway

Date:

06 June 2023, 19:27:19

On 6/6/23 15:23, Jeff Davis wrote:
> On Mon, 2023-06-05 at 19:43 +0200, Daniel Verite wrote:
>> But in the meantime, personally I don't quite see why Postgres should
>> start forcing C.UTF-8 to sort differently in the database than in the
>> OS.
> 
> I can see both points of view. It could be surprising to users if
> C.UTF-8 does not sort like C/memcmp, or surprising if it changes out
> from under them. It could also be surprising that it wouldn't sort like
> the current OS's libc interpretation of C.UTF-8.
> 
> What about ICU? How should provider=icu locale=C.UTF-8 behave? We
> could:
> 
> a. Just pass it to the provider and see what happens (older versions of
> ICU would interpret it as en-US-u-va-posix; newer versions would give
> the root locale).
> 
> b. Consistently interpret it as en-US-u-va-posix.
> 
> c. Don't pass it to the provider at all and treat it with memcmp
> semantics.

Personally I think this should be (a). However we should also clearly 
document that the semantics of such is provider/OS dependent and 
therefore may not be what is expected/desired.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pg_collation.collversion for C.UTF-8

From

"Daniel Verite"

Date:

07 June 2023, 14:08:05

Jeff Davis wrote:

> What about ICU? How should provider=icu locale=C.UTF-8 behave? We
> could:
>
> a. Just pass it to the provider and see what happens (older versions of
> ICU would interpret it as en-US-u-va-posix; newer versions would give
> the root locale).
>
> b. Consistently interpret it as en-US-u-va-posix.
>
> c. Don't pass it to the provider at all and treat it with memcmp
> semantics.

I think b) and c) are quite problematic.

First, en-US-u-va-posix does not sort like C.UTF-8 in glibc.
For one thing it seems that en-US-u-va-posix assigns zero weights to
some codepoints, which makes it semantically definitely different.
For instance consider ZERO WIDTH SPACE (U+200B):

postgres=# select 'ab' < E'a\u200Ba' COLLATE "C.utf8";
 ?column?
----------
 t

postgres=# select 'ab' < E'a\u200Ba' COLLATE "en-US-u-va-posix-x-icu";
 ?column?
----------
 f

Even if ICU folks refer to u-va-posix as approximating POSIX (as in [1]),
for our purpose, either it sorts by codepoints or it does not,
and it clearly does not. One consequence is that
en-US-u-va-posix-x-icu needs to be versioned and indexes
depending on it need to be rebuilt on upgrades.
OTOH the goal with C.UTF-8, that is achieved in glibc>=2.35,
is to not need that.

Also it's not just about sorting. The semantics for the ctype-kind
functions are also different.

Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we only match
ASCII characters 0-9, or 10 codepoints.
With "en-US-u-va-posix-x-icu" we match 660 codepoints comprising
all the digit characters in all languages, plus a bunch of variants
for mathematical symbols.

For instance consider U+FF10 (Fullwidth Digit Zero):

postgres=# select E'\uff10' collate "C.utf8" ~ '\d';
 ?column?
----------
 f

postgres=# select E'\uff10' collate "en-US-u-va-posix-x-icu" ~ '\d';
 ?column?
----------
 t

If someone dumps their C.UTF-8 database to reload into an
ICU/en-US-u-va-posix database, there is no guarantee that it
even reloads because of semantic differences occuring
in constraints. In general it will surely reload,  but the apps
might not behave the same with the new database
in a way that might be problematic.
It's fine if that's what they want and they explicitly ask for this
conversion, but it's not fine if it's postgres that has quietly
decided that for them.

About c) "don't pass it to the operators", it would be doable for
sorting (ignoring the "glibc before 2.35 does not sort like that" issue)
but not for the ctype-kind functions, where postgres' own code
doesn't have the Unicode knowledge.

About a) "just pass it to the provider", that seems better than b) or
c), but still, when a user asks for provider=icu locale=C.UTF-8,
it's a very probably a pilot error.

To me the user would be best served by a warning, if not an error,
informing them that it's quite probably not the combination they want.

[1]

https://sourceforge.net/p/icu/mailman/icu-support/thread/CAN49p6pvQKP93j8LMn3zBWhpk-T0qYD0TCuiHMv6Z3UPGFh3QQ%40mail.gmail.com/#msg35638356

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: pg_collation.collversion for C.UTF-8

From

"Daniel Verite"

Date:

07 June 2023, 15:08:10

    I wrote:

> Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we
> only match ASCII characters 0-9, or 10 codepoints.  With
> "en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the
> digit characters in all languages, plus a bunch of variants for
> mathematical symbols.

BTW this not specifically a C.UTF-8 versus "en-US-u-va-posix-x-icu"
difference.
If think that any glibc-based locale will consider that \d
in a regexp means [0-9], and that any ICU locale
will make \d match a much larger variety of characters.

While moving to ICU by default, we should expect that
differences like that will affect apps in a way that might be
more or less disruptive.

Another known difference it that upper() with ICU does not do a
character-by-character conversion, for instance:

WITH words(w) as  (values('muß'),('ﬁnal'))
 SELECT
  w,
  length(w),
  upper(w collate "C.utf8") as "upper (libc)",
  length(upper(w collate "C.utf8")),
  upper(w collate "en-x-icu") as "upper (ICU)",
  length(upper(w collate "en-x-icu"))
FROM words;

  w   | length | upper libc | length | upper ICU | length
------+--------+------------+--------+-----------+--------
 muß  |      3 | MUß        |       3 | MUSS     |    4
 ﬁnal |      4 | ﬁNAL        |       4 | FINAL     |    5

The fact that the resulting string is larger that the original
might cause problems.

In general, we can't abstract from the fact that ICU semantics
are different.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: pg_collation.collversion for C.UTF-8

From

Peter Eisentraut

Date:

07 June 2023, 21:28:48

On 06.06.23 21:23, Jeff Davis wrote:
> What about ICU? How should provider=icu locale=C.UTF-8 behave? We
> could:

It should be an error.

> a. Just pass it to the provider and see what happens (older versions of
> ICU would interpret it as en-US-u-va-posix; newer versions would give
> the root locale).

This, but with an error instead of a warning.

Re: pg_collation.collversion for C.UTF-8

From

Jeff Davis

Date:

07 June 2023, 22:43:14

On Wed, 2023-06-07 at 23:28 +0200, Peter Eisentraut wrote:
> On 06.06.23 21:23, Jeff Davis wrote:
> > What about ICU? How should provider=icu locale=C.UTF-8 behave? We
> > could:
>
> It should be an error.
>
> > a. Just pass it to the provider and see what happens (older
> > versions of
> > ICU would interpret it as en-US-u-va-posix; newer versions would
> > give
> > the root locale).
>
> This, but with an error instead of a warning.

If we do that, a plain "initdb -D data" will fail if LANG=C.UTF-8.

Perhaps that's fine, but certainly some buildfarm members would
complain. I'm not sure how many users would be affected.

Regards,
    Jeff Davis

Re: pg_collation.collversion for C.UTF-8

From

Thomas Munro

Date:

15 June 2023, 07:15:08

On Sun, Apr 23, 2023 at 5:22 AM Daniel Verite <daniel@manitou-mail.org> wrote:
> I understand that my proposal to version C.* like any other collation
> might be erring on the side of caution, but ignoring these collation
> changes on at least one major OS does not feel right either.
> Maybe we should consider doing platform-dependent checks?

Hmm, OK let's explore that.  What could we do that would be helpful
here, without affecting users of the "true" C.UTF-8 for the rest of
time?  This is a Debian (+ downstream distro) only problem as far as
we know so far, and only for Debian 11 and older.  Debian 12 was just
released the other day as the new stable, so the window of opportunity
to actually help anyone with our actions in this area is now beginning
to close, because we're really talking about new databases initdb'd
and new COLLATIONs CREATEd on Debian old stable after our next
(August) release.  The window may be much wider for long term Ubuntu
releases.  You're right that we could do something platform-specific
to help with that: we could change that code so that it captures the
version for C.* #if __GLIBC_MAJOR__ == 2 && __GLIBC_MINOR__ < 35 (or
we could parse the string returned by the runtime version function).
I don't recall immediately what our warning code does if it sees a
change from versioned to not versioned, but we could make sure it does
what you want here.  That way we wouldn't burden all future users of
C.* with version warnings (because it'll be empty), but we'd catch
those doing Debian 11 -> 12 upgrades, and whatever Ubuntu upgrades
that corresponds to, etc.  Is it worth it?

Re: pg_collation.collversion for C.UTF-8

From

Jeff Davis

Date:

16 June 2023, 22:03:34

On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote:
> Hmm, OK let's explore that.  What could we do that would be helpful
> here, without affecting users of the "true" C.UTF-8 for the rest of
> time?

Where is the "true" C.UTF-8 defined?

I assume you mean that the collation order can't (shouldn't, anyway)
change. But what about the ctype (upper/lower/initcap) behavior? Is
that also locked down for all time, or could it change if some new
unicode characters are added?

Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C,
but leave LC_CTYPE=C.UTF-8 as-is?

Regards,
    Jeff Davis

Re: pg_collation.collversion for C.UTF-8

From

Thomas Munro

Date:

17 June 2023, 05:54:35

On Sat, Jun 17, 2023 at 10:03 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote:
> > Hmm, OK let's explore that.  What could we do that would be helpful
> > here, without affecting users of the "true" C.UTF-8 for the rest of
> > time?
>
> Where is the "true" C.UTF-8 defined?

By "true" I just meant glibc's official one, in contrast to the
imposter from Debian oldstable's patches.  It's not defined by any
standard, but we only know how to record versions for glibc, FreeBSD
and Windows, and we know what the first two of those do for that
locale because they tell us (see below).  For Windows, the manual's
BNF-style description of acceptable strings doesn't appear to accept
C.UTF-8 (but I haven't tried it).

> I assume you mean that the collation order can't (shouldn't, anyway)
> change. But what about the ctype (upper/lower/initcap) behavior? Is
> that also locked down for all time, or could it change if some new
> unicode characters are added?

Fair point.  Considering that our collversion effectively functions as
a proxy for ctype version too, Daniel's patch makes a certain amount
of sense.

Our versioning is nominally based only on the collation category, not
locales more generally or any other category they contain (nominally,
as in: we named it collversion, and our code and comments and
discussions so far only contemplated collations in this context).
But, clearly, changes to underlying ctype data could also cause a
constraint CHECK (x ~ '[[:digit:]]') or a partial index with WHERE
(upper(x) <> 'ẞ') to be corrupted, which I'd considered to be a
separate topic, but Daniel's patch would cover with the same
mechanism.  (Actually I just learned that [[:digit:]] is a bad example
on a glibc system, because they appear to have hardcoded a test for
[0-9] into their iswdigit_l() implementation, but FreeBSD gives the
Unicode answer, which is subject to change, and other classes may work
better on glibc.)

> Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C,
> but leave LC_CTYPE=C.UTF-8 as-is?

Yes.  The basic idea, at least for these two OSes, is that every
category behaves as if set to C, except LC_CTYPE.  For implementation
reasons the glibc people don't quite describe it that way[1]: for
LC_COLLATE, they decode to codepoints first and then compare those
using a new codepath they had to write for release 2.35, while FreeBSD
skips that useless step and compares raw UTF-8 bytes like
LC_COLLATE=C[2].  Which is the same, as RFC 3692 tells us:

   o  The byte-value lexicographic sorting order of UTF-8 strings is the
      same as if ordered by character numbers.  Of course this is of
      limited interest since a sort order based on character numbers is
      almost never culturally valid.

It is interesting to note that LC_COLLATE=C, LC_CTYPE=C.UTF-8 is
equivalent, but would not get version warnings with Daniel's patch,
revealing that it's only a proxy.  But recording ctype version
separately would be excessive.

For completeness, Solaris also has C.UTF-8.  I can't read about what
it does, the release notes are behind a signup thing.  *shrug*  I
can't find any other systems that have it.

[1] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8
[2] https://reviews.freebsd.org/D17833

Re: pg_collation.collversion for C.UTF-8

From

Jeff Davis

Date:

19 June 2023, 18:47:56

On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote:
>
> > Would it be correct to interpret LC_COLLATE=C.UTF-8 as
> > LC_COLLATE=C,
> > but leave LC_CTYPE=C.UTF-8 as-is?
>
> Yes.  The basic idea, at least for these two OSes, is that every
> category behaves as if set to C, except LC_CTYPE.

If that's true, and we version C.UTF-8, then users could still get the
behavior they want, a stable collation order, and benefit from the
optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8.

The only caveat is to be careful with things that depend on ctype in
indexes and constraints. While still a problem, it's a smaller problem
than unversioned collation. We should think a little more about solving
it, because I think there's a strong case to be made that a default
collation of C and a database ctype of something else is a good
combination (it makes less sense for a case-insensitive collation, but
those aren't allowed as a default collation).

In any case, we're better off following the rule "version anything that
goes to any external provider, period". And by "version", I really mean
a best effort, because we don't always have great information, but I
think it's better to record what we do have than not. We have just seen
too many examples of weird behavior. On top of that, it's simply
inconsistent to assume that C=C.UTF-8 for collation version, but not
for the collation implementation.

Users might get frustrated that the collation for C.UTF-8 is versioned,
of course. But I don't think it will affect anyone for quite some time,
because existing users will have a datcollversion=NULL; so they won't
get the warnings until they refresh the versions (or create new
collations/databases), and then after that upgrade libc. Right? So they
should have time to adjust to use LC_COLLATE=C if that's what they
want.

An alternative would be to define lc_collate_is_c("C.UTF-8") == true
while leaving lc_ctype_is_c("C.UTF-8") == false and
get_collation_actual_version("C.UTF-8") == NULL. In that case we would
not be passing it to an external provider, so we don't have to version
it. But that might be a little too magical and I'm not inclined to do
that.

Another alternative would be to implement C.UTF-8 internally according
to the "true" semantics, if they are truly simple and well-defined and
stable. But I don't think ctype=C.UTF-8 is actually stable because new
characters can be added, right?

Regards,
    Jeff Davis

Re: pg_collation.collversion for C.UTF-8

From

"Daniel Verite"

Date:

21 June 2023, 17:07:14

    Thomas Munro wrote:

> What could we do that would be helpful here, without affecting users
> of the "true" C.UTF-8 for the rest of time?  This is a Debian (+
> downstream distro) only problem as far as we know so far, and only
> for Debian 11 and older.

It seems to include RedHat-based distros as well.

According to https://bugzilla.redhat.com/show_bug.cgi?id=902094
C.utf8 was added in 2015 and backported down to Fedora 22.
RHEL8 / CentOS 8 / Rocky8 provide glibc 2.28 with a C.utf8
locale. We can reasonably suspect that they've been using the same
kind of patches as Debian before version 12, with not all codepoints
being sorted bytewise.

RHEL9 comes with glibc 2.34 according to distrowatch [1] and the
announcement [2], so presumably it also lacks the "real" C.utf8
with bytewise sorting that glibc 2.35 upstream added.

[1] https://distrowatch.com/table.php?distribution=redhat
[2]
https://developers.redhat.com/articles/2022/05/18/whats-new-red-hat-enterprise-linux-9

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: pg_collation.collversion for C.UTF-8

From

Thomas Munro

Date:

22 June 2023, 21:22:19

On Tue, Jun 20, 2023 at 6:48 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote:
> > > Would it be correct to interpret LC_COLLATE=C.UTF-8 as
> > > LC_COLLATE=C,
> > > but leave LC_CTYPE=C.UTF-8 as-is?
> >
> > Yes.  The basic idea, at least for these two OSes, is that every
> > category behaves as if set to C, except LC_CTYPE.
>
> If that's true, and we version C.UTF-8, then users could still get the
> behavior they want, a stable collation order, and benefit from the
> optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8.
>
> The only caveat is to be careful with things that depend on ctype in
> indexes and constraints. While still a problem, it's a smaller problem
> than unversioned collation. We should think a little more about solving
> it, because I think there's a strong case to be made that a default
> collation of C and a database ctype of something else is a good
> combination (it makes less sense for a case-insensitive collation, but
> those aren't allowed as a default collation).
>
> In any case, we're better off following the rule "version anything that
> goes to any external provider, period". And by "version", I really mean
> a best effort, because we don't always have great information, but I
> think it's better to record what we do have than not. We have just seen
> too many examples of weird behavior. On top of that, it's simply
> inconsistent to assume that C=C.UTF-8 for collation version, but not
> for the collation implementation.

Yeah, OK, you're convincing me.  It's hard to decide because our model
is basically wrong so it's only warning you about potential ctype
changes by happy coincidence, but even in respect of sort order it was
probably a mistake to start second-guessing what libc is doing, and
with that observation about the C/C.UTF-8 combination, at least an
end-user has a way to opt in/out of this choice.  I'll try to write a
concise commit message for Daniel's patch explaining all this and we
can see about squeaking it into beta2.

> Use rs might get frustrated that the collation for C.UTF-8 is versioned,
> of course. But I don't think it will affect anyone for quite some time,
> because existing users will have a datcollversion=NULL; so they won't
> get the warnings until they refresh the versions (or create new
> collations/databases), and then after that upgrade libc. Right? So they
> should have time to adjust to use LC_COLLATE=C if that's what they
> want.

Yeah.

> An alternative would be to define lc_collate_is_c("C.UTF-8") == true
> while leaving lc_ctype_is_c("C.UTF-8") == false and
> get_collation_actual_version("C.UTF-8") == NULL. In that case we would
> not be passing it to an external provider, so we don't have to version
> it. But that might be a little too magical and I'm not inclined to do
> that.

Agreed, let's not do any more of that sort of thing.

> Another alternative would be to implement C.UTF-8 internally according
> to the "true" semantics, if they are truly simple and well-defined and
> stable. But I don't think ctype=C.UTF-8 is actually stable because new
> characters can be added, right?

Correct.