Home > mailing lists

Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From	Daniel Verite
Subject	Re: Built-in CTYPE provider
Date	December 20, 2023 12:49:20
Msg-id	dd9261f4-7a98-4565-93ec-336c1c110d90@manitou-mail.org Whole thread Raw
In response to	Re: Built-in CTYPE provider (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: Built-in CTYPE provider Re: Built-in CTYPE provider
List	pgsql-hackers

Tree view

    Jeff Davis wrote:

> But there are a lot of users for whom neither of those things are true,
> and it makes zero sense to order all of the text indexes in the
> database according to any one particular locale. I think these users
> would prioritize stability and performance for the database collation,
> and then use COLLATE clauses with ICU collations where necessary.

+1

> I am also still concerned that we have the wrong defaults. Almost
> nobody thinks libc is a great provider, but that's the default, and
> there were problems trying to change that default to ICU in 16. If we
> had a builtin provider, that might be a better basis for a default
> (safe, fast, always available, and documentable). Then, at least if
> someone picks a different locale at initdb time, they would be doing so
> intentionally, rather than implicitly accepting index corruption risks
> based on an environment variable.

Yes. The introduction of the bytewise-sorting, locale-agnostic
C.UTF-8 in glibc is also a step in the direction of providing better
defaults for apps like Postgres, that need both long-term stability
in sorts and Unicode coverage for ctype-dependent functions.

But C.UTF-8 is not available everywhere, and there's still the
problem that Unicode updates through libc are not aligned
with Postgres releases.

ICU has the advantage of cross-OS compatibility,
but it does not provide any collation with bytewise sorting
like C or C.UTF-8, and we don't allow a combination like
"C" for sorting and ICU for ctype operations. When opting
for a locale provider, it has to be for both sorting
and ctype, so an installation that needs cross-OS
compatibility, good Unicode support and long-term stability
of indexes cannot get that with ICU as we expose it
today.

If the Postgres default was bytewise sorting+locale-agnostic
ctype functions directly derived from Unicode data files,
as opposed to libc/$LANG at initdb time, the main
annoyance would be that "ORDER BY textcol" would no
longer be the human-favored sort.
For the presentation layer, we would have to write for instance
 ORDER BY textcol COLLATE "unicode" for the root collation
or a specific region-country if needed.
But all the rest seems better, especially cross-OS compatibity,
truly immutable and faster indexes for fields that
don't require linguistic ordering, alignment between Unicode
updates and Postgres updates.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

pgsql-hackers by date:

From: "Zhijie Hou (Fujitsu)"
Date: 20 December 2023, 12:43:55
Subject: RE: Synchronizing slots from primary to standby

From: Pavel Borisov
Date: 20 December 2023, 12:51:23
Subject: Re: Table AM Interface Enhancements

Re: Built-in CTYPE provider - Mailing list pgsql-hackers

Previous

Next