On Tue, Aug 6, 2024 at 11:44 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> I assume that "1254" here is the code page.
> But you specified --encoding=UTF-8 above, so your default locale uses a
> different encoding than the template databases. I would expect that to
> cause problems if the template databases contain any charecters where
> the encodings differ (such as "ü" in the locale name).
It's weird, but on Windows, PostgreSQL allows UTF-8 encoding with any
locale, and thus apparent contradictions:
/* See notes in createdb() to understand these tests */
if (!(locale_enc == user_enc ||
locale_enc == PG_SQL_ASCII ||
locale_enc == -1 ||
#ifdef WIN32
user_enc == PG_UTF8 ||
#endif
user_enc == PG_SQL_ASCII))
{
pg_log_error("encoding mismatch");
... and createdb's comments say that is acceptable because:
* 3. selected encoding is UTF8 and platform is win32. This is because
* UTF8 is a pseudo codepage that is supported in all locales since it's
* converted to UTF16 before being used.
At the time PostgreSQL was ported to Windows, UTF-8 was not a
supported encoding in "char"-based system interfaces like strcoll_l(),
and the port had to convert to "wchar_t" interfaces and call (in that
example) wcscoll_l(). On modern Windows it is, and there are two
locale names, with and without ".UTF-8" suffix (cf. glibc systems that
have "en_US" and "en_US.UTF-8" where the suffix-less version uses
whatever traditional encoding was used for that language before UTF-8
ate the world).
If we were doing the Windows port today, we'd probably not have that
special case for Windows, and we wouldn't have the wchar_t
conversions. Then I think we'd allow only:
--locale=tr-TR (defaults to --encoding=WIN1254)
--locale=tr-TR --encoding=WIN1254
--locale-tr-TR.UTF-8
--locale=tr-TR.UTF-8 --encoding=UTF-8
If we come up with an automated (or even manual but documented) way to
perform the "Turkish_Türkiye.1254" -> "tr-TR" upgrade as Dave was
suggesting upthread, we'll probably want to be careful to tidy up
these contradictory settings. For example I guess that American
databases initialised by EDB's installer must be using
--locale="English_United States.1252" and --encoding=UTF-8, and should
be changed to "en-US.UTF-8", while those initialised by letting
initdb.exe pick the encoding must be using --locale="English_United
States.1252" and --encoding=WIN1252 (implicit) and should be changed
to "en-US" to match the WIN1252 encoding.
Only if we did that update would we be able to consider removing the
extra UTF-16 conversions that are happening very frequently inside
PostgreSQL code, which is a waste of CPU cycles and programmer sanity.
(But that's all just speculation from studying the locale code -- I've
never really used Windows.)