Re: Order changes in PG16 since ICU introduction - Mailing list pgsql-hackers

From Daniel Verite
Subject Re: Order changes in PG16 since ICU introduction
Date
Msg-id beda0794-1d72-4584-8578-cf7d95fda396@manitou-mail.org
Whole thread Raw
In response to Re: Order changes in PG16 since ICU introduction  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
    Jeff Davis wrote:

> I guess where I'm confused is: why would a user actually want their
> database collation to be C.UTF-8? It's slower than C, our
> implementation doesn't properly version it (as you pointed out), and
> the semantics don't seem great ('Z' < 'a').

Because when LC_CTYPE=C, characters outside of US ASCII are not
categorized properly. upper/lower/regexp matching/... produce
incorrect results.

> But if they don't specify the provider, isn't it much more likely they
> just don't care much about the locale, and would be happier with C?

Consider a pre-existing script doing initdb --locale=C.UTF-8
Surely it does care about the locale, otherwise it would not specify
it.
Assuming that it would be better off with C is assuming that a
non-Unicode aware locale is better than the Unicode-aware locale
they're asking. I don't think it's reasonable.

> The user can easily get libc behavior by specifying --locale-
> provider=libc, so I don't see how you reached this conclusion.

What would be user hostile is forcing users that don't need an ICU
locale to change their invocations of initdb/createdb to avoid
regressions with v16. Most people would discover this after
it breaks their apps.

> It looks like you are fine with 0003 applying LOCALE to whatever
> provider is chosen, but you'd like to be smarter about choosing the
> provider and to choose libc in at least some cases.
>
> That is actually very much like option #2 in the list I presented
> here[2], and has the same problems. How should the following behave?
>
>   initdb --locale=C --lc-collate=fr_FR.utf8
>   initdb --locale=en --lc-collate=fr_FR.utf8

The same as v15.

> If we switch to libc in the first case, then --locale will be ignored
> and the collation will be fr_FR.utf8.

$ initdb --locale=C --lc-collate=fr_FR.utf8
v15 does that:

  The database cluster will be initialized with this locale configuration:
    provider:     libc
    LC_COLLATE:  fr_FR.utf8
    LC_CTYPE:     C
    LC_MESSAGES: C
    LC_MONETARY: C
    LC_NUMERIC:  C
    LC_TIME:     C
  The default database encoding has accordingly been set to "SQL_ASCII".

--locale is not ignored, it's overriden for LC_COLLATE only.

> But we will leave the second case as ICU and the collation will be
> "en".

Yes. To me the rule for "ICU is the default" in v16 should be: if the
--locale argument points to a locale that we know ICU does not provide,
we fall back to the v15 behavior down to every detail, otherwise we let
ICU be the provider.

> You also suggested that we consider switching the provider to libc any
> time ICU doesn't support something. I'm not sure whether you meant a
> static list (C, C.UTF-8, POSIX, ...?) or some kind of dynamic test.

C, C.*, POSIX. I'm not sure if there are other cases.

> I'm also not clear whether you think we should abandon the built-in
> provider, or still select it for C/POSIX.

I see it as going in v17, because it came after feature freeze and
is not strictly necessary in v16.



Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: Large files for relations
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Time delayed LR (WAS Re: logical replication restrictions)