Re: Change initdb default to the builtin collation provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Change initdb default to the builtin collation provider
Date
Msg-id 4309879ac305b1cf6b4d7b5fb85bc7b62c6ab768.camel@j-davis.com
Whole thread Raw
In response to Re: Change initdb default to the builtin collation provider  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, 2026-03-10 at 11:12 -0400, Robert Haas wrote:
> I don't know if this is exactly the right proposal, but I think it's
> probably appropriate to start gently pushing people towards UTF-8
> rather than anything else. Unicode has largely won, AFAICT, and the
> use cases for anything else are increasingly narrow. I don't think we
> should try to be coercive, but there's a reasonable presumption that
> people who haven't said what they want probably want UTF8.

If their environment's LC_CTYPE is UTF8-based, they already get UTF-8.
If it isn't, we can either:

(a) Fall back to LC_CTYPE=C, which is the only UTF8-compatible locale
available everywhere. C is actually not a terrible fallback: it doesn't
actually affect many things, because I have moved almost everything to
use the database default locale.

(b) Warn or error unless they explicitly specify the encoding with -E.
But the former is likely to be ignored and the latter is not what I'd
call "gentle".

Which of these do you think is the right approach?

There's narrower question about what we do with LC_CTYPE=C. Currently
we use SQL_ASCII encoding, which doesn't seem like a great default, and
we could change that to default to UTF8. And another question about
whether we change the meaning of --no-locale.

>
> I'm much less convinced about this idea. I think the number of people
> who will be unhappy about the less-user-friendly sort order changes
> is
> probably quite high. It's reasonable to want something more stable
> and
> better version-controlled than libc, but switching to a simple
> code-point sort seems like a high price to pay for that.

Surely inconsistent indexes and poor performance are also a high price,
so how do you weigh the prices against each other?

We sweat over single-digit performance regressions in fairly specific
cases all the time, but here we're 3X slower for index builds:

https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/

and 2-5X slower for Sort:

https://www.postgresql.org/message-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com

and others don't seem very concerned, so I feel like I'm missing
something.

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: shihao zhong
Date:
Subject: Re: Add missing stats_reset column to pg_stat_database_conflicts view
Next
From: Nathan Bossart
Date:
Subject: Re: Speed up COPY TO text/CSV parsing using SIMD