Thread: Windows default locale vs initdb
Hi, Moving this topic into its own thread from the one about collation versions, because it concerns pre-existing problems, and that thread is long. Currently initdb sets up template databases with old-style Windows locale names reported by the OS, and they seem to have caused us quite a few problems over the years: db29620d "Work around Windows locale name with non-ASCII character." aa1d2fc5 "Another attempt at fixing Windows Norwegian locale." db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..." 9f12a3b9 "Tolerate version lookup failure for old style Windows locale..." ... and probably more, and also various threads about , for example, "German_German.1252" vs "German_Switzerland.1252" which seem to get confused or badly canonicalised or rejected somewhere in the mix. I hadn't focused on any of that before, being a non-Windows-user, but the entire contents of win32setlocale.c supports the theory that Windows' manual meant what it said when it said[1]: "We do not recommend this form for locale strings embedded in code or serialized to storage, because these strings are more likely to be changed by an operating system update than the locale name form." I suppose that was the only form available at the time the code was written, so there was no choice. The question we asked ourselves multiple times in the other thread was how we're supposed to get to the modern BCP 47 form when creating the template databases. It looks like one possibility, since Vista, is to call GetUserDefaultLocaleName()[2], which doesn't appear to have been discussed before on this list. That doesn't allow you to ask for the default for each individual category, but I don't know if that is even a concept for Windows user settings. It may be that some of the other nearby functions give a better answer for some reason. But one thing is clear from a test that someone kindly ran for me: it reports standardised strings like "en-NZ", not strings like "English_New Zealand.1252". No patch, but I wondered if any Windows hackers have any feedback on relative sanity of trying to fix all these problems this way. [1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160 [2] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename
po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:
Hi,
Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.
Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:
db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."
... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.
I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:
"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."
I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".
No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.
Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.
Regards
Pavel
[1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
[2] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename
On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:Hi,
Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.
Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:
db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."
... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.
I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:
"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."
I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".
No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.
My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on linux, not WIndows.
cheers
andrew
po 19. 4. 2021 v 12:52 odesílatel Andrew Dunstan <andrew@dunslane.net> napsal:
On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:Hi,
Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.
Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:
db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."
... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.
I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:
"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."
I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".
No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on linux, not WIndows.
I had different informations, but still there was something wrong because no czech locales was in pg_collation
cheersandrew
On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net> wrote:
My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on linux, not WIndows.
This is from a regular Azure Database for PostgreSQL single server:
postgres=> select version();
version
------------------------------------------------------------
PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
(1 row)
And this is from the new Flexible Server preview:
postgres=> select version();
version
-----------------------------------------------------------------------------------------------------------------
PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
(1 row)
So I guess it's a case of "it depends".
On 4/19/21 10:26 AM, Dave Page wrote: > > > On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net > <mailto:andrew@dunslane.net>> wrote: > > > My understanding from Microsoft staff at conferences is that > Azure's PostgreSQL SAS runs on linux, not WIndows. > > > This is from a regular Azure Database for PostgreSQL single server: > > postgres=> select version(); > version > ------------------------------------------------------------ > PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit > (1 row) > > And this is from the new Flexible Server preview: > > postgres=> select version(); > version > > ----------------------------------------------------------------------------------------------------------------- > PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu > 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit > (1 row) > > So I guess it's a case of "it depends". > Good to know. A year or two back at more than one conference I tried to enlist some of these folks in helping us with WindowsPostgreSQL and their reply was that they knew nothing about it because they were on Linux :-) I guess things changeover time. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 19.04.21 07:42, Thomas Munro wrote: > It looks > like one possibility, since Vista, is to call > GetUserDefaultLocaleName()[2], which doesn't appear to have been > discussed before on this list. That doesn't allow you to ask for the > default for each individual category, but I don't know if that is even > a concept for Windows user settings. pg_newlocale_from_collation() doesn't support collcollate != collctype on Windows anyway, so that wouldn't be an issue.
On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote: > Currently initdb sets up template databases with old-style Windows > locale names reported by the OS, and they seem to have caused us quite > a few problems over the years: > > db29620d "Work around Windows locale name with non-ASCII character." > aa1d2fc5 "Another attempt at fixing Windows Norwegian locale." > db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..." > 9f12a3b9 "Tolerate version lookup failure for old style Windows locale..." > I suppose that was the only form available at the time the code was > written, so there was no choice. Right. > The question we asked ourselves > multiple times in the other thread was how we're supposed to get to > the modern BCP 47 form when creating the template databases. It looks > like one possibility, since Vista, is to call > GetUserDefaultLocaleName()[2] > No patch, but I wondered if any Windows hackers have any feedback on > relative sanity of trying to fix all these problems this way. Sounds reasonable. If PostgreSQL v15 would otherwise run on Windows Server 2003 R2, this is a good time to let that support end.
On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote:
On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:
> The question we asked ourselves
> multiple times in the other thread was how we're supposed to get to
> the modern BCP 47 form when creating the template databases. It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2]
> No patch, but I wondered if any Windows hackers have any feedback on
> relative sanity of trying to fix all these problems this way.
Sounds reasonable. If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.
The value returned by GetUserDefaultLocaleName() is a system configured parameter, independent of what you set with setlocale(). It might be reasonable for initdb but not for a backend in most cases.
You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no longer recommended, because using LCIDs is no longer recommended [1]. Although, this would work for legacy locales. Please find attached a POC patch showing this approach.
Regards,
Juan José Santamaría Flecha
Attachment
On Wed, Dec 15, 2021 at 11:32 PM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote: >> On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote: >> > The question we asked ourselves >> > multiple times in the other thread was how we're supposed to get to >> > the modern BCP 47 form when creating the template databases. It looks >> > like one possibility, since Vista, is to call >> > GetUserDefaultLocaleName()[2] >> >> > No patch, but I wondered if any Windows hackers have any feedback on >> > relative sanity of trying to fix all these problems this way. >> >> Sounds reasonable. If PostgreSQL v15 would otherwise run on Windows Server >> 2003 R2, this is a good time to let that support end. >> > The value returned by GetUserDefaultLocaleName() is a system configured parameter, independent of what you set with setlocale().It might be reasonable for initdb but not for a backend in most cases. Agreed. Only for initdb, and only if you didn't specify a locale name on the command line. > You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no longer recommended, because using LCIDs isno longer recommended [1]. Although, this would work for legacy locales. Please find attached a POC patch showing thisapproach. Now that museum-grade Windows has been defenestrated, we are free to call GetUserDefaultLocaleName(). Here's a patch. One thing you did in your patch that I disagree with, I think, was to convert a BCP 47 name to a POSIX name early, that is, s/-/_/. I think we should use the locale name exactly as Windows (really, under the covers, ICU) spells it. There is only one place in the tree today that really wants a POSIX locale name, and that's LC_MESSAGES, accessed by GNU gettext, not Windows. We already had code to cope with that. I think we should also convert to POSIX format when making the collname in your pg_import_system_collations() proposal, so that COLLATE "en_US" works (= a SQL identifier), but that's another thread[1]. I don't think we should do it in collcollate or datcollate, which is a string for the OS to interpret. With my garbage collector hat on, I would like to rip out all of the support for traditional locale names, eventually. Deleting kludgy code is easy and fun -- 0002 is a first swing at that -- but there remains an important unanswered question. How should someone pg_upgrade a "English_Canada.1521" cluster if we now reject that name? We'd need to do a conversion to "en-CA", or somehow tell the user to. Hmmmm. [1] https://www.postgresql.org/message-id/flat/CAC%2BAXB0WFjJGL1n33bRv8wsnV-3PZD0A7kkjJ2KjPH0dOWqQdg%40mail.gmail.com
Attachment
On Tue, Jul 19, 2022 at 10:58 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's a patch. I added this to the next commitfest, and cfbot promptly told me about some warnings I needed to fix. That'll teach me to post a patch tested with "ci-os-only: windows". Looking more closely at some error messages that report GetLastError() where I'd mixed up %d and %lu, I see also that I didn't quite follow existing conventions for wording when reporting Windows error numbers, so I fixed that too. In the "startcreate" step on CI you can see that it says: The database cluster will be initialized with locale "en-US". The default database encoding has accordingly been set to "WIN1252". The default text search configuration will be set to "english". As for whether "accordingly" still applies, by the logic of of win32_langinfo()... Windows still considers WIN1252 to be the default ANSI code page for "en-US", though it'd work with UTF-8 too. I'm not sure what to make of that. The goal here was to give Windows users good defaults, but WIN1252 is probably not what most people actually want. Hmph.
Attachment
On Tue, Jul 19, 2022 at 12:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Now that museum-grade Windows has been defenestrated, we are free to
call GetUserDefaultLocaleName(). Here's a patch.
This LGTM.
I think we should also convert to POSIX format when making the
collname in your pg_import_system_collations() proposal, so that
COLLATE "en_US" works (= a SQL identifier), but that's another
thread[1]. I don't think we should do it in collcollate or
datcollate, which is a string for the OS to interpret.
That thread has been split [1], but that is how the current version behaves.
With my garbage collector hat on, I would like to rip out all of the
support for traditional locale names, eventually. Deleting kludgy
code is easy and fun -- 0002 is a first swing at that -- but there
remains an important unanswered question. How should someone
pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
We'd need to do a conversion to "en-CA", or somehow tell the user to.
Hmmmm.
Is there a safe way to do that in pg_upgrade or would we be forcing users to pg_dump into the new cluster?
[1] https://www.postgresql.org/message-id/flat/0050ec23-34d9-2765-9015-98c04f0e18ac%40postgrespro.ru
Regards,
Juan José Santamaría Flecha
On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
As for whether "accordingly" still applies, by the logic of of
win32_langinfo()... Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too. I'm not
sure what to make of that. The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want. Hmph.
Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
[1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170
Regards,
Juan José Santamaría Flecha
On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> As for whether "accordingly" still applies, by the logic of of >> win32_langinfo()... Windows still considers WIN1252 to be the default >> ANSI code page for "en-US", though it'd work with UTF-8 too. I'm not >> sure what to make of that. The goal here was to give Windows users >> good defaults, but WIN1252 is probably not what most people actually >> want. Hmph. > > > Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will usethe current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page. I'm still confused about what that means. Suppose we decided to insist by adding a ".UTF-8" suffix to the name, as that page says we can now that we're on Windows 10+, when building the default locale name (see experimental 0002 patch, attached). It initially seemed to have the right effect: The database cluster will be initialized with locale "en-US.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]: SELECT 'i'::citext = 'İ'::citext AS t; t --- - t + f (1 row) About the pg_upgrade problem, maybe it's OK ... existing old format names should continue to work, but we can still remove the weird code that does locale name tweaking, right? pg_upgraded databases should contain fixed names (ie that were fixed by old initdb so should continue to work), and new clusters will get BCP 47 names. I don't really know, I was just playing with rough ideas by sending patches to CI here... [1] https://cirrus-ci.com/task/6423238052937728
Attachment
On Wed, Jul 20, 2022 at 1:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
I'm still confused about what that means. Suppose we decided to
insist by adding a ".UTF-8" suffix to the name, as that page says we
can now that we're on Windows 10+, when building the default locale
name (see experimental 0002 patch, attached). It initially seemed to
have the right effect:
The database cluster will be initialized with locale "en-US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Let me try to explain this using the "Beta: Use Unicode UTF-8 for worldwide language support" option [1].
- Currently in a system with the language settings of "English_United States" and that option disabled, when executing initdb you get:
The database cluster will be initialized with locale "English_United States.1252".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".
And as a test for psql:
SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
ERROR: character with byte sequence 0xc5 0x9f in encoding "UTF8" has no equivalent in encoding "WIN1252"
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
ERROR: character with byte sequence 0xc5 0x9f in encoding "UTF8" has no equivalent in encoding "WIN1252"
We get this error even if the database encoding is UTF8, and is caused by the tr_tr locales being encoded in WIN1254. We can discuss this in another thread, and I can propose a patch.
- If we enable the UTF-8 support option, then the same test goes as:
The database cluster will be initialized with locale "English_United States.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
And for psql:
SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
to_charSET
SELECT to_char('2000-2-01'::date, 'tmmonth');
---------
şubat
(1 row)
In this case the Windows locales are actually UTF8 encoded.
TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be done through the Windows registry and only in recent releases.
But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]:
SELECT 'i'::citext = 'İ'::citext AS t;
t
---
- t
+ f
(1 row)
This is current state of affairs:
- Windows:
SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ | İ
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ | İ
- Linux:
SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ | i
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ | i
Latin_capital_dotted doesn't have the same lower value.
Regards,
Juan José Santamaría Flecha
On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be donethrough the Windows registry and only in recent releases. Thanks, that was helpful, and so was that SO link. So it sounds like I should forget about the v3-0002 patch, but the v3-0001 and v3-0003 patches might have a future. And it sounds like we might need to investigate maybe defending ourselves against the ACP being different than what we expect (ie not matching the database encoding)? Did I understand correctly that you're looking into that?
On Fri, Jul 29, 2022 at 3:33 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha > <juanjo.santamaria@gmail.com> wrote: > > TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be donethrough the Windows registry and only in recent releases. > > Thanks, that was helpful, and so was that SO link. > > So it sounds like I should forget about the v3-0002 patch, but the > v3-0001 and v3-0003 patches might have a future. And it sounds like > we might need to investigate maybe defending ourselves against the ACP > being different than what we expect (ie not matching the database > encoding)? Did I understand correctly that you're looking into that? I'm going to withdraw this entry. The sooner we get something like 0001 into a release, the sooner the world will be rid of PostgreSQL clusters initialised with the bad old locale names that the manual very clearly tells you not to use for databases.... but I don't understand this ACP/registry vs database encoding stuff and how it relates to the use of BCP47 locale names, which puts me off changing anything until we do.
Another country has changed its name, and a Windows OS update has again broken every PostgreSQL cluster in that whole country[1] (or at least those that had accepted initdb's default choice of locale, probably most). Let's get to the bottom of this, because otherwise it is simply going to keep happening, causing administrative pain for a lot of people. Here is a rebase of the basic patch I proposed last time, and a re-statement of what we know: 1. initdb chooses a default locale using a technique that gives you an unstable ("Czech Republic"->"Czechia", "Turkey"->"Türkiye"), non-ASCII ("Norwegian (Bokmål)") string that we are warned we should not store anywhere. We store it, and then later it is not recognised. Instead we should select an IETF BCP 47 locale name, based on stable ISO country and language codes, like "en-US", "tr-TR" etc. Here is the patch to teach initdb to use that, unchanged from v3 except that I tweaked the docs a bit. 2. In Windows 10+ it is now also possible to put ".UTF-8" on the end of locale names. I couldn't figure out whether we should do that, and what effect it has on ctypes -- apparently not the effect I expected (see upthread). Was our UTF-8 support on Windows already broken, and this new ".UTF-8" thing is just a new way to reach that brokenness? Is it OK to continue to choose the "legacy" single byte encodings by default on that OS, and consider that a separate topic for separate research? 3. It is not clear to me how we should deal with pg_upgrade. Eventually we want all of the old-school names to fade away, and pg_upgrade would need to be part of that. Perhaps there is some API that can be used to translate to the new canonical forms without us having to maintain translation tables and other messiness in our tree. 4. Eventually we should probably ban non-ASCII characters from entering the relevant catalogues (they are shared, so their encoding is undefined except that they must be a superset of ASCII), and delete all the old win32setlocale.c kludges, after we reach a point where everyone should be using exclusively BCP 47. [1] https://www.postgresql.org/message-id/flat/18196-b10f93dfbde3d7db%40postgresql.org
Attachment
I clicked "Trigger" to get a Mingw test run of this, and it failed[1]. I see why: our function win32_langinfo() believes that it shouldn't call GetLocaleInfoEx() on non-MSVC compilers, so we see 'initdb: error: could not find suitable encoding for locale "en-US"'. I think it has fallback code that parses the ".1252" or whatever on the end of the name, but "en-US" hasn't got one. I don't know the first thing about Mingw but it looks like a declaration for that function arrived 6 years ago[2], and deleting the "#if defined(_MSC_VER)" fixes the problem and the tests pass[3]. As far as I know, we don't support any Mingw but the very latest: it's not a target with real users who have version requirements, it's just a developer [in]convenience, so if it passes on CI and whatever MSYS version "fairywren" runs in the build farm right now, that should be enough. I could just do that in this patch, but I suppose that also means that someone needs to go through pg_locale.c and other places that test _MSC_VER not because they actually care about the compiler but because they want to detect some crusty old Mingw version, and see what else can be deleted as a result, possibly including a lot of fallback code. It feels like a separate cleanup for a separate patch. [1] https://cirrus-ci.com/task/5301814774464512 [2] https://github.com/mirror/mingw-w64/blame/eff726c461e09f35eeaed125a3570fa5f807f02b/mingw-w64-tools/widl/include/winnls.h#L931 [3] https://cirrus-ci.com/task/6558569718349824
Here is a thought that occurs to me, as I follow along with Jeff Davis's evolving proposals for built-in collations and ctypes: What would stop us from dropping support for the libc (sic) provider on Windows? That may sound radical and likely to cause extra work for people on upgrade, but how does that compare to the pain of keeping this barely maintained code in the tree? Suppose the idea in this thread goes ahead and we get people to transition to the modern locale names: there is non-zero transitional/upgrade pain there too. How delicious it would be to just nuke the whole thing from orbit, and keep only cross-platform code that is maintained with enthusiasm by active hackers. That's probably a little extreme, but it's the direction my thoughts start to go in when confronting the realisation that it's up to us [Unix hackers making drive-by changes], no one is coming to help us [from the Windows user community]. I've even heard others talk about dropping Windows completely, due to the maintenance imbalance. This would be somewhat more fine grained. (One could use a similar argument to drop non-NTFS filesystems and turn on POSIX-mode file links, to end that other locus of struggle.)