Re: Trying out native UTF-8 locales on Windows - Mailing list pgsql-hackers
| From | Thomas Munro |
|---|---|
| Subject | Re: Trying out native UTF-8 locales on Windows |
| Date | |
| Msg-id | CA+hUKG+W_=qQa0Zt=ajuvzEoup2VAp+8Evhr6EkpMtKkXJXm_g@mail.gmail.com Whole thread Raw |
| In response to | Re: Trying out native UTF-8 locales on Windows (Bryan Green <dbryan.green@gmail.com>) |
| List | pgsql-hackers |
On Fri, Jan 2, 2026 at 5:25 PM Bryan Green <dbryan.green@gmail.com> wrote:
> The patch is correct, and the new strcoll_l() path is 10-25% faster than
> the current wcscoll_l() conversion path. Whether UTF-8 locale is faster
> or slower than WIN1252 depends on string length and content - but users
> choosing UTF-8 locales presumably want Unicode support, not WIN1252
> compatibility.
Thanks Byran! This all sounds quite promising!
I can see three future pathways for this line of work:
1. We just do this opportunistically, ie when locale encoding happens
to be UTF-8, as in this patch, and call it a day, keeping the
mismatched encoding support indefinitely.
2. We additionally formalise it: there could be a build option
PG_ALLOW_MISMATCHED_ENCODINGS, true for Windows and false for Unix by
default, but a Unix developer could enable it to test that mode of
operation. We could use our own transcoding functions instead of the
Windows ones so that the code is identical on all platforms, ie
testable by all for general project sanity.
3. We decide we want to kill support for mismatched encodings. I
guess that means that the upgrade path for existing clusters would
involve switching your existing locale names eg in pg_database and
pg_collation from eg "English_United State.1252" or "en-US" to
"en-US.UTF-8", for databases that are using UTF-8 encoding. I don't
know how exactly that should be done, ie manually before pg_upgrade,
by pg_upgrade itself, or something else. On-the-fly translation would
also be possible but probably a bit too magical. In this pathway we
get to delete these code differences entirely, and I think our general
encoding and portability situation would be greatly improved.
I am in favour of #3 at least eventually, and I posted patches to try
that out a few years ago[1]. If we can't agree to do that, my next
preference would be #2. What do you think? #3 would be inflicting
one-time pain on users. Would it be worth it? (And if you're
interested in the general topic of encodings and portability, there
are many more problems to solve, working on those[2]...).
One small thing that would help reduce the pain of #3, if we think
there's any chance we're going to go that way, would be to ask the EDB
installer team to stop listing locales with mismatched encodings in
their GUI for initdb, for the benefit of new clusters being born in
the wild today.
An approach for existing clusters could be to produce a file
upgrade-windows-locale-encodings.sql that could be run against a
cluster, containing UPDATE pg_database SET datlocale = 'en_US.UTF-8'
WHERE datencoding = 6 AND datlocale in ('English_United States.1252',
'en-US') and so on, extracted from some official source, or something
like that? Or maybe it would just spit out the UPDATE statements for
human review and execution. I assume/hope those locales are really
the "same" in every important respect. One minor problem to think
about is that the historical "English_..." names don't report a
version, so that field is empty. I'm less sure what the situation is
with pg_collation. It is initially populated by
pg_import_system_collations() which iterates over the system locales
(it doesn't actually seem that wise to me that it translates "-" to
"_" in those names, what is the point of messing with the true
system-defined names?!) and I don't know of the top of my head if it
finishes up with both "en-US" (presumably implying .1252 encoding) and
"en-US.UTF-8" entries. If so it's less obvious how to upgrade
automatically, perhaps by ALTERing any columns to use the one that
points to "en-US.UTF-8" instead of the one that points to "en-US"
where appropriate, but perhaps a simple SQL query could find simple
cases and recommend the ALTER statements that would achieve that?
More complicated places where COLLATE might be hiding would of course
be a Turing tarpit beyond analysis. Presumably pg_upgrade would
simply fail if you don't upgrade the source database in this way
first, and tell you why.
[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJ%3Dca39Cg%3Dy%3DS89EaCYvvCF8NrZRO%3Duog-cnz0VzC6Kfg%40mail.gmail.com#6158915417859e029d60456312fd1cc7
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLrx02%3DwT647oYz7x0GW%3DPyQdma0s4U6w9h%3DsuFjcciTw%40mail.gmail.com
pgsql-hackers by date: