Re: pg_collation.collversion for C.UTF-8 - Mailing list pgsql-hackers

From Daniel Verite
Subject Re: pg_collation.collversion for C.UTF-8
Date
Msg-id ac61fb5a-461a-4bdf-9201-68fa67b6242b@manitou-mail.org
Whole thread Raw
In response to Re: pg_collation.collversion for C.UTF-8  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: pg_collation.collversion for C.UTF-8  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
    Thomas Munro wrote:

> It looks like for technical reasons
> inside glibc, that couldn't be done before 2.35:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=17318
>
> That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied
> by the glibc project) isn't supposed to be versioned, but it's
> extremely unfortunate that a bunch of OSes (Debian and maybe more)
> have been sorting text in some other order under that name for
> years.

Yes. This is consistent with Debian/Ubuntu patches in
glibc/localedata/locales/C

glibc-2.35 is not patched, and upstream has this:
  LC_COLLATE
  % The keyword 'codepoint_collation' in any part of any LC_COLLATE
  % immediately discards all collation information and causes the
  % locale to use strcmp/wcscmp for collation comparison.  This is
  % exactly what is needed for C (ASCII) or C.UTF-8.
  codepoint_collation
  END LC_COLLATE

But in older versions, glibc doesn't have the locales/C data file.
Debian adds it in debian/patches/localedata/C with that kind of
content:

* glibc 2.31  Debian 11
  LC_COLLATE
  order_start forward
  <U0000>
  ..
  <U007F>
  <U0080>
  ..
  <U00FF>
  etc...

But as explained in the above-linked bugzilla entry, that did not
result in true byte-comparison semantics, for several reasons
that got fixed in 2.35.

So this looks like a solved problem for anyone starting to use these
collation with glibc 2.35 or newer (or other OSes that don't have a
compatibility issue with them in the first place).
But Debian/Ubuntu users upgrading from the older C.* to 2.35+ will not
be having the normal warning about the need to reindex.

I understand that my proposal to version C.* like any other collation
might be erring on the side of caution, but ignoring these collation
changes on at least one major OS does not feel right either.
Maybe we should consider doing platform-dependent checks?



Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



pgsql-hackers by date:

Previous
From: Isaac Morland
Date:
Subject: Re: Mark a transaction uncommittable
Next
From: Vik Fearing
Date:
Subject: Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options