Home > mailing lists

Re: pg_collation.collversion for C.UTF-8 - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Re: pg_collation.collversion for C.UTF-8
Date	April 19, 2023 01:30:13
Msg-id	8ad5213175b25f9ae9f9c38563caaf86cec14ec4.camel@j-davis.com Whole thread Raw
In response to	Re: pg_collation.collversion for C.UTF-8 (Thomas Munro <thomas.munro@gmail.com>)
Responses	Re: pg_collation.collversion for C.UTF-8
List	pgsql-hackers

Tree view

On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote:
> Many OSes have a locale with this name.  I don't know this history,
> who did it first etc, but now I am wondering if they all took the
> "obvious" interpretation, that it should be code-point based,
> extrapolating from "C" (really memcmp order):

memcmp() is not the same as code-point order in all encodings, right?

I've been thinking that we should have a "provider=none" for the
special cases that use memcmp(). It's not using libc as a collation
provider; it's really postgres in control of the semantics.

That would clean up the documentation and the code a bit, and make it
more clear which locales are being passed to the provider and which
ones aren't.

If we are passing it to a provider (e.g. "C.UTF-8"), we shouldn't make
unnecessary assumptions about what the provider will do with it.

For what it's worth, in my recent ICU language tag work, I special-
cased ICU locales with language "C" or "POSIX" to map to "en-US-u-va-
posix", disregarding everything else (collation attributes, etc.). I
believe that's the right thing based on the behavior I observed: for
the POSIX variant of en-US, ICU seems to disregard other things such as
case insensitivity. But it still ultimately goes to the provider and
ICU has particular rules for that locale -- I don't assume memcpy-like
semantics or code point order.

Regards,
    Jeff Davis

pgsql-hackers by date:

From: Thomas Munro
Date: 19 April 2023, 01:19:35
Subject: Re: check_strxfrm_bug()

From: Thomas Munro
Date: 19 April 2023, 02:07:13
Subject: Re: pg_collation.collversion for C.UTF-8

Re: pg_collation.collversion for C.UTF-8 - Mailing list pgsql-hackers

Previous

Next