Re: Order changes in PG16 since ICU introduction - Mailing list pgsql-hackers
From | Daniel Verite |
---|---|
Subject | Re: Order changes in PG16 since ICU introduction |
Date | |
Msg-id | beda0794-1d72-4584-8578-cf7d95fda396@manitou-mail.org Whole thread Raw |
In response to | Re: Order changes in PG16 since ICU introduction (Jeff Davis <pgsql@j-davis.com>) |
List | pgsql-hackers |
Jeff Davis wrote: > I guess where I'm confused is: why would a user actually want their > database collation to be C.UTF-8? It's slower than C, our > implementation doesn't properly version it (as you pointed out), and > the semantics don't seem great ('Z' < 'a'). Because when LC_CTYPE=C, characters outside of US ASCII are not categorized properly. upper/lower/regexp matching/... produce incorrect results. > But if they don't specify the provider, isn't it much more likely they > just don't care much about the locale, and would be happier with C? Consider a pre-existing script doing initdb --locale=C.UTF-8 Surely it does care about the locale, otherwise it would not specify it. Assuming that it would be better off with C is assuming that a non-Unicode aware locale is better than the Unicode-aware locale they're asking. I don't think it's reasonable. > The user can easily get libc behavior by specifying --locale- > provider=libc, so I don't see how you reached this conclusion. What would be user hostile is forcing users that don't need an ICU locale to change their invocations of initdb/createdb to avoid regressions with v16. Most people would discover this after it breaks their apps. > It looks like you are fine with 0003 applying LOCALE to whatever > provider is chosen, but you'd like to be smarter about choosing the > provider and to choose libc in at least some cases. > > That is actually very much like option #2 in the list I presented > here[2], and has the same problems. How should the following behave? > > initdb --locale=C --lc-collate=fr_FR.utf8 > initdb --locale=en --lc-collate=fr_FR.utf8 The same as v15. > If we switch to libc in the first case, then --locale will be ignored > and the collation will be fr_FR.utf8. $ initdb --locale=C --lc-collate=fr_FR.utf8 v15 does that: The database cluster will be initialized with this locale configuration: provider: libc LC_COLLATE: fr_FR.utf8 LC_CTYPE: C LC_MESSAGES: C LC_MONETARY: C LC_NUMERIC: C LC_TIME: C The default database encoding has accordingly been set to "SQL_ASCII". --locale is not ignored, it's overriden for LC_COLLATE only. > But we will leave the second case as ICU and the collation will be > "en". Yes. To me the rule for "ICU is the default" in v16 should be: if the --locale argument points to a locale that we know ICU does not provide, we fall back to the v15 behavior down to every detail, otherwise we let ICU be the provider. > You also suggested that we consider switching the provider to libc any > time ICU doesn't support something. I'm not sure whether you meant a > static list (C, C.UTF-8, POSIX, ...?) or some kind of dynamic test. C, C.*, POSIX. I'm not sure if there are other cases. > I'm also not clear whether you think we should abandon the built-in > provider, or still select it for C/POSIX. I see it as going in v17, because it came after feature freeze and is not strictly necessary in v16. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
pgsql-hackers by date: