Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Built-in CTYPE provider
Date
Msg-id 7451f81ba0cb512222ab759de8ca1cffe44e9acb.camel@j-davis.com
Whole thread Raw
In response to Re: Built-in CTYPE provider  (Peter Eisentraut <peter@eisentraut.org>)
Responses Re: Built-in CTYPE provider
Re: Built-in CTYPE provider
List pgsql-hackers
On Thu, 2024-03-14 at 09:54 +0100, Peter Eisentraut wrote:
> * doc/src/sgml/charset.sgml
>
> I don't understand the purpose of this sentence:
>
> "When using this locale, the behavior may depend on the database
> encoding."

The "C" locale (in either the builtin or libc provider) can sort
differently in different encodings, because it's based on memcmp. For
instance:

   select U&'\20AC' > U&'\201A' collate "C";

Returns true in UTF-8 and false in WIN1252. That's why UCS_BASIC is
only available in UTF-8, because (at least for some encodings) we'd
have to decode before comparison to get the code-point-order semantics
right.

In other words, the "C" collation is not a well-defined order, but
UCS_BASIC and C.UTF-8 are well-defined.

Suggestions for better wording are welcome.

> * doc/src/sgml/ref/create_database.sgml
>
> The new parameter builtin_locale is not documented.

Thank you, fixed in 0001 (review fixup).

> * src/backend/commands/collationcmds.c
>
> I think DefineCollation() should set collencoding = -1 for the
> COLLPROVIDER_BUILTIN case.  -1 stands for any encoding.  Or at least
> explain why not?

In the attached v25-0001 (review fixup) I have made it the
responsibility of a function, and then extended that for the C.UTF-8
(0002) and PG_UNICODE_FAST locales (0007).

> * src/backend/utils/adt/pg_locale.c
>
> This part is a bit confusing:
>
> +           cache_entry->collate_is_c = true;
> +           cache_entry->ctype_is_c = (strcmp(colllocale, "C") == 0);
>
> Is collate always C but ctype only sometimes?  Does this anticipate
> future patches in this series?  Maybe in this patch it should always
> be true?

Made it a constant in v25-0001, and changed it in 0002

>
> * src/bin/initdb/initdb.c
>
> +   printf(_("      --builtin-locale=LOCALE   set builtin locale name
> for new databases\n"));
>
> Put in a line break so that the right "column" lines up.

Fixed in 0001

> This output should line up better:
>
> The database cluster will be initialized with this locale
> configuration:
>    default collation provider:  icu
>    default collation locale:    en
>    LC_COLLATE:  C
>    LC_CTYPE:    C
>    ...
>
> Also, why are there two spaces after "provider:  "?
>
> Also we call these locale provider on input, why are they collation
> providers on output?  What is a "collation locale"?

I tried to fix these things in 0001.

> * src/bin/pg_upgrade/t/002_pg_upgrade.pl
>
> +if ($oldnode->pg_version >= '17devel')
>
> This is weird.  >= is a numeric comparison, so providing a string
> with
> non-digits is misleading at best.

It's actually not a numeric comparison, it's an overloaded comparison
op for the Version class.

See 32dd2c1eff and:
https://www.postgresql.org/message-id/1738174.1710274577%40sss.pgh.pa.us

> * src/test/icu/t/010_database.pl
>
> -# Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE
> -# are specified
>
> Why remove this test?

It must have been lost during a rebase, fixed in 0001.

> Change the name of the new database to be different from the name of
> the template database.

Fixed in 0001.

New series attached.

Regards,
    Jeff Davis

Attachment

pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: Add basic tests for the low-level backup method.
Next
From: Jeff Davis
Date:
Subject: Re: Built-in CTYPE provider