Thread: Does UCS_BASIC have the right CTYPE?

Does UCS_BASIC have the right CTYPE?

From

Jeff Davis

Date:

25 October 2023, 21:32:02

UCS_BASIC is defined in the standard as a collation based on comparing
the code point values, and in UTF8 that is satisfied with memcmp(), so
the collation locale for UCS_BASIC in Postgres is simply "C".

But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In
Postgres, the answer is 'á', but intuitively, one could reasonably
expect the answer to be 'Á'.

Reading the standard, it seems that LOWER()/UPPER() are defined in
terms of the Unicode General Category (Section 4.2, "<fold> is a pair
of functions..."). It is somewhat ambiguous about the case mappings,
but I could guess that it means the Default Case Algorithm[1].

That seems to suggest the standard answer should be 'Á' regardless of
any COLLATE clause (though I could be misreading). I'm a bit confused
by that... what's the standard-compatible way to specify the locale for
UPPER()/LOWER()? If there is none, then it makes sense that Postgres
overloads the COLLATE clause for that purpose so that users can use a
different locale if they want.

But given that UCS_BASIC is a collation specified in the standard,
shouldn't it have ctype behavior that's as close to the standard as
possible?

Regards,
    Jeff Davis

[1] https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G33992

Re: Does UCS_BASIC have the right CTYPE?

From

Peter Eisentraut

Date:

26 October 2023, 17:49:55

On 25.10.23 20:32, Jeff Davis wrote:
> But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In
> Postgres, the answer is 'á', but intuitively, one could reasonably
> expect the answer to be 'Á'.

I think that's right.  But what would you put into ctype to make that 
happen?

> That seems to suggest the standard answer should be 'Á' regardless of
> any COLLATE clause (though I could be misreading). I'm a bit confused
> by that... what's the standard-compatible way to specify the locale for
> UPPER()/LOWER()? If there is none, then it makes sense that Postgres
> overloads the COLLATE clause for that purpose so that users can use a
> different locale if they want.

The standard doesn't have the notion of locale-dependent case conversion.

Re: Does UCS_BASIC have the right CTYPE?

From

Jeff Davis

Date:

26 October 2023, 19:21:40

On Thu, 2023-10-26 at 16:49 +0200, Peter Eisentraut wrote:
> On 25.10.23 20:32, Jeff Davis wrote:
> > But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In
> > Postgres, the answer is 'á', but intuitively, one could reasonably
> > expect the answer to be 'Á'.
>
> I think that's right.  But what would you put into ctype to make that
> happen?

It looks like using Unicode files to implement
upper()/lower()/initcap() behavior would not be very hard. The Default
Case Algorithm only needs a simple mapping for toUppercase() and
toLowercase().

Our initcap() is not defined in the standard, and we document that it
only differentiates between alphanumeric and non-alphanumeric
characters, so we could get that behavior pretty easily as well. If we
wanted to do it the Unicode way instead, we can follow the
toTitlecase() part of the Default Case Algorithm, which is based on
word breaks and would require another lookup table for that.

I've already posted a patch that includes a lookup table for the
General Category, so that could be used for the rest of the ctype
behavior.

Doing ctype based on built-in Unicode tables would be a good use case
for the "builtin" provider that I had previously proposed.

> The standard doesn't have the notion of locale-dependent case
> conversion.

That explains it. Interesting.

Regards,
    Jeff Davis

Re: Does UCS_BASIC have the right CTYPE?

From

Jeff Davis

Date:

26 October 2023, 21:42:27

On Thu, 2023-10-26 at 09:21 -0700, Jeff Davis wrote:
> Our initcap() is not defined in the standard, and we document that it
> only differentiates between alphanumeric and non-alphanumeric
> characters, so we could get that behavior pretty easily as well. If
> we
> wanted to do it the Unicode way instead, we can follow the
> toTitlecase() part of the Default Case Algorithm, which is based on
> word breaks and would require another lookup table for that.

Correction: the rules for word breaks are fairly complex, so it would
not be worth it to try to replicate that just to support initcap(). We
could just use the simple, existing, and documented rules for initcap()
which only differentiate between alphanumeric and not. Anyone who wants
the more sophisticated rules can just use an ICU collation with
initcap().

The point stands that it would be pretty simple to have a collation
that handles upper() and lower() in a standards-compliant way without
relying on libc or ICU. Unfortunately it's too late to call that
collation UCS_BASIC, but it would still be useful.

Regards,
    Jeff Davis

Re: Does UCS_BASIC have the right CTYPE?

From

"Daniel Verite"

Date:

27 October 2023, 00:22:24

    Peter Eisentraut wrote:

> > That seems to suggest the standard answer should be 'Á' regardless of
> > any COLLATE clause (though I could be misreading). I'm a bit confused
> > by that... what's the standard-compatible way to specify the locale for
> > UPPER()/LOWER()? If there is none, then it makes sense that Postgres
> > overloads the COLLATE clause for that purpose so that users can use a
> > different locale if they want.
>
> The standard doesn't have the notion of locale-dependent case conversion.

Neither does Unicode, which is why the ICU functions like u_isupper()
or u_toupper() don't take a locale argument.

With libc, isupper_l() and the other ctype functions need a locale
argument, but given a locale's value of
"language[_territory][.codeset]", in theory only the codeset part is
actually useful.

To me the question of what we should put in pg_collation.collctype
for the "ucs_basic" collation leads to another question which is:
why do we even consider collctype in the first place?

Within a database, there's only one "codeset", which corresponds
to pg_database.encoding, and there's a value in pg_database.lc_ctype
that is normally compatible with that encoding.
ISTM that UPPER(string COLLATE "whatever") should always give
the same result than UPPER(string COLLATE pg_catalog.default). And
likewise all functions that depend on character categories could
basically ignore the COLLATE specification, given that our
database-wide properties are sufficient to characterize the strings
within.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Does UCS_BASIC have the right CTYPE?

From

Tom Lane

Date:

27 October 2023, 00:32:14

"Daniel Verite" <daniel@manitou-mail.org> writes:
> To me the question of what we should put in pg_collation.collctype
> for the "ucs_basic" collation leads to another question which is:
> why do we even consider collctype in the first place?

For starters, C locale should certainly act different from others.

I'm not sold that arguing from Unicode's behavior to other encodings
makes sense, either.  Unicode can get away with defining that there's
only one case-folding rule because they have the luxury of inventing
new code points when the "same" glyph should act differently according
to different languages' rules.  Encodings with a small number of code
points don't have that luxury.  In particular see the mess around dotted
and dotless I/J in Turkish vs. everywhere else.

            regards, tom lane

Re: Does UCS_BASIC have the right CTYPE?

From

Jeff Davis

Date:

27 October 2023, 01:48:26

On Thu, 2023-10-26 at 23:22 +0200, Daniel Verite wrote:
> Neither does Unicode, which is why the ICU functions like u_isupper()
> or u_toupper() don't take a locale argument.

u_strToUpper() accepts a locale argument:
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ustring_8h.html#aa64fbd4ad23af84d01c931d7cfa25f89

See also the part about tailorings here:
https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G33992

Regards,
    Jeff Davis

Re: Does UCS_BASIC have the right CTYPE?

From

Jeff Davis

Date:

27 October 2023, 02:27:10

On Thu, 2023-10-26 at 17:32 -0400, Tom Lane wrote:
> For starters, C locale should certainly act different from others.

Agreed. ctype of "C" is 100% stable (as implemented in Postgres with
special ASCII-only semantics) and simple.

I'm looking for a way to offer a new middle ground between plain "C"
and buying into all of the problems with collation providers and
localization. We don't need to remove functionality to do so.

Providing Unicode ctype behavior doesn't look very hard. Collations
could select it either with a special name or by using the "builtin"
provider I proposed earlier. If the behavior does change with a new
Unicode version it would be easier to see and less likely to affect on-
disk structures than a collation change.

Regards,
    Jeff Davis