Jeff Davis wrote:
> But there are a lot of users for whom neither of those things are true,
> and it makes zero sense to order all of the text indexes in the
> database according to any one particular locale. I think these users
> would prioritize stability and performance for the database collation,
> and then use COLLATE clauses with ICU collations where necessary.
+1
> I am also still concerned that we have the wrong defaults. Almost
> nobody thinks libc is a great provider, but that's the default, and
> there were problems trying to change that default to ICU in 16. If we
> had a builtin provider, that might be a better basis for a default
> (safe, fast, always available, and documentable). Then, at least if
> someone picks a different locale at initdb time, they would be doing so
> intentionally, rather than implicitly accepting index corruption risks
> based on an environment variable.
Yes. The introduction of the bytewise-sorting, locale-agnostic
C.UTF-8 in glibc is also a step in the direction of providing better
defaults for apps like Postgres, that need both long-term stability
in sorts and Unicode coverage for ctype-dependent functions.
But C.UTF-8 is not available everywhere, and there's still the
problem that Unicode updates through libc are not aligned
with Postgres releases.
ICU has the advantage of cross-OS compatibility,
but it does not provide any collation with bytewise sorting
like C or C.UTF-8, and we don't allow a combination like
"C" for sorting and ICU for ctype operations. When opting
for a locale provider, it has to be for both sorting
and ctype, so an installation that needs cross-OS
compatibility, good Unicode support and long-term stability
of indexes cannot get that with ICU as we expose it
today.
If the Postgres default was bytewise sorting+locale-agnostic
ctype functions directly derived from Unicode data files,
as opposed to libc/$LANG at initdb time, the main
annoyance would be that "ORDER BY textcol" would no
longer be the human-favored sort.
For the presentation layer, we would have to write for instance
ORDER BY textcol COLLATE "unicode" for the root collation
or a specific region-country if needed.
But all the rest seems better, especially cross-OS compatibity,
truly immutable and faster indexes for fields that
don't require linguistic ordering, alignment between Unicode
updates and Postgres updates.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite