Thread: Windows default locale vs initdb

Windows default locale vs initdb

From
Thomas Munro
Date:
Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice.  The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases.  It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.  It may be that some of the other
nearby functions give a better answer for some reason.  But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

[1]
https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
[2] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename



Re: Windows default locale vs initdb

From
Pavel Stehule
Date:


po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:
Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice.  The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases.  It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.  It may be that some of the other
nearby functions give a better answer for some reason.  But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.

Regards

Pavel


[1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
[2] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename


Re: Windows default locale vs initdb

From
Andrew Dunstan
Date:


On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:


po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:
Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice.  The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases.  It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.  It may be that some of the other
nearby functions give a better answer for some reason.  But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.



My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on  linux, not WIndows.

cheers

andrew

Re: Windows default locale vs initdb

From
Pavel Stehule
Date:


po 19. 4. 2021 v 12:52 odesílatel Andrew Dunstan <andrew@dunslane.net> napsal:


On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:


po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:
Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice.  The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases.  It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.  It may be that some of the other
nearby functions give a better answer for some reason.  But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.



My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on  linux, not WIndows.

I had different informations, but still there was something wrong because no czech locales was in pg_collation

 

cheers

andrew

Re: Windows default locale vs initdb

From
Dave Page
Date:


On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net> wrote:

My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on  linux, not WIndows.

This is from a regular Azure Database for PostgreSQL single server:

postgres=> select version();
                          version                           
------------------------------------------------------------
 PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
(1 row) 

And this is from the new Flexible Server preview:

postgres=> select version();
                                                     version                                                     
-----------------------------------------------------------------------------------------------------------------
 PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
(1 row)

So I guess it's a case of "it depends".

--

Re: Windows default locale vs initdb

From
Andrew Dunstan
Date:
On 4/19/21 10:26 AM, Dave Page wrote:
>
>
> On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net
> <mailto:andrew@dunslane.net>> wrote:
>
>
>     My understanding from Microsoft staff at conferences is that
>     Azure's PostgreSQL SAS runs on  linux, not WIndows.
>
>
> This is from a regular Azure Database for PostgreSQL single server:
>
> postgres=> select version();
>                           version                           
> ------------------------------------------------------------
>  PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
> (1 row) 
>
> And this is from the new Flexible Server preview:
>
> postgres=> select version();
>                                                      version          
>                                           
> -----------------------------------------------------------------------------------------------------------------
>  PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
> (1 row)
>
> So I guess it's a case of "it depends".
>

Good to know. A year or two back at more than one conference I tried to enlist some of these folks in helping us with
WindowsPostgreSQL and their reply was that they knew nothing about it because they were on Linux :-) I guess things
changeover time.
 


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Windows default locale vs initdb

From
Peter Eisentraut
Date:
On 19.04.21 07:42, Thomas Munro wrote:
> It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2], which doesn't appear to have been
> discussed before on this list.  That doesn't allow you to ask for the
> default for each individual category, but I don't know if that is even
> a concept for Windows user settings.

pg_newlocale_from_collation() doesn't support collcollate != collctype 
on Windows anyway, so that wouldn't be an issue.



Re: Windows default locale vs initdb

From
Noah Misch
Date:
On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:
> Currently initdb sets up template databases with old-style Windows
> locale names reported by the OS, and they seem to have caused us quite
> a few problems over the years:
> 
> db29620d "Work around Windows locale name with non-ASCII character."
> aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
> db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
> 9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

> I suppose that was the only form available at the time the code was
> written, so there was no choice.

Right.

> The question we asked ourselves
> multiple times in the other thread was how we're supposed to get to
> the modern BCP 47 form when creating the template databases.  It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2]

> No patch, but I wondered if any Windows hackers have any feedback on
> relative sanity of trying to fix all these problems this way.

Sounds reasonable.  If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.



Re: Windows default locale vs initdb

From
Juan José Santamaría Flecha
Date:

On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote:
On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:

> The question we asked ourselves
> multiple times in the other thread was how we're supposed to get to
> the modern BCP 47 form when creating the template databases.  It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2]

> No patch, but I wondered if any Windows hackers have any feedback on
> relative sanity of trying to fix all these problems this way.

Sounds reasonable.  If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.

The value returned by GetUserDefaultLocaleName() is a system configured parameter, independent of what you set with setlocale(). It might be reasonable for initdb but not for a backend in most cases.

You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no longer recommended, because using LCIDs is no longer recommended [1]. Although, this would work for legacy locales. Please find attached a POC patch showing this approach.


Regards,

Juan José Santamaría Flecha
Attachment

Re: Windows default locale vs initdb

From
Thomas Munro
Date:
On Wed, Dec 15, 2021 at 11:32 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote:
>> On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:
>> > The question we asked ourselves
>> > multiple times in the other thread was how we're supposed to get to
>> > the modern BCP 47 form when creating the template databases.  It looks
>> > like one possibility, since Vista, is to call
>> > GetUserDefaultLocaleName()[2]
>>
>> > No patch, but I wondered if any Windows hackers have any feedback on
>> > relative sanity of trying to fix all these problems this way.
>>
>> Sounds reasonable.  If PostgreSQL v15 would otherwise run on Windows Server
>> 2003 R2, this is a good time to let that support end.
>>
> The value returned by GetUserDefaultLocaleName() is a system configured parameter, independent of what you set with
setlocale().It might be reasonable for initdb but not for a backend in most cases. 

Agreed.  Only for initdb, and only if you didn't specify a locale name
on the command line.

> You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no longer recommended, because using LCIDs
isno longer recommended [1]. Although, this would work for legacy locales. Please find attached a POC patch showing
thisapproach. 

Now that museum-grade Windows has been defenestrated, we are free to
call GetUserDefaultLocaleName().  Here's a patch.

One thing you did in your patch that I disagree with, I think, was to
convert a BCP 47 name to a POSIX name early, that is, s/-/_/.  I think
we should use the locale name exactly as Windows (really, under the
covers, ICU) spells it.  There is only one place in the tree today
that really wants a POSIX locale name, and that's LC_MESSAGES,
accessed by GNU gettext, not Windows.  We already had code to cope
with that.

I think we should also convert to POSIX format when making the
collname in your pg_import_system_collations() proposal, so that
COLLATE "en_US" works (= a SQL identifier), but that's another
thread[1].  I don't think we should do it in collcollate or
datcollate, which is a string for the OS to interpret.

With my garbage collector hat on, I would like to rip out all of the
support for traditional locale names, eventually.  Deleting kludgy
code is easy and fun -- 0002 is a first swing at that -- but there
remains an important unanswered question.  How should someone
pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
 We'd need to do a conversion to "en-CA", or somehow tell the user to.
Hmmmm.

[1] https://www.postgresql.org/message-id/flat/CAC%2BAXB0WFjJGL1n33bRv8wsnV-3PZD0A7kkjJ2KjPH0dOWqQdg%40mail.gmail.com

Attachment

Re: Windows default locale vs initdb

From
Thomas Munro
Date:
On Tue, Jul 19, 2022 at 10:58 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's a patch.

I added this to the next commitfest, and cfbot promptly told me about
some warnings I needed to fix.  That'll teach me to post a patch
tested with "ci-os-only: windows".  Looking more closely at some error
messages that report GetLastError() where I'd mixed up %d and %lu, I
see also that I didn't quite follow existing conventions for wording
when reporting Windows error numbers, so I fixed that too.

In the "startcreate" step on CI you can see that it says:

The database cluster will be initialized with locale "en-US".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

As for whether "accordingly" still applies, by the logic of of
win32_langinfo()...  Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too.  I'm not
sure what to make of that.  The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want.  Hmph.

Attachment

Re: Windows default locale vs initdb

From
Juan José Santamaría Flecha
Date:

On Tue, Jul 19, 2022 at 12:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Now that museum-grade Windows has been defenestrated, we are free to
call GetUserDefaultLocaleName().  Here's a patch.

This LGTM. 

I think we should also convert to POSIX format when making the
collname in your pg_import_system_collations() proposal, so that
COLLATE "en_US" works (= a SQL identifier), but that's another
thread[1].  I don't think we should do it in collcollate or
datcollate, which is a string for the OS to interpret.

That thread has been split [1], but that is how the current version behaves.

With my garbage collector hat on, I would like to rip out all of the
support for traditional locale names, eventually.  Deleting kludgy
code is easy and fun -- 0002 is a first swing at that -- but there
remains an important unanswered question.  How should someone
pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
 We'd need to do a conversion to "en-CA", or somehow tell the user to.
Hmmmm.
 
Is there a safe way to do that in pg_upgrade or would we be forcing users to pg_dump into the new cluster?
 

Regards,

Juan José Santamaría Flecha

Re: Windows default locale vs initdb

From
Juan José Santamaría Flecha
Date:

On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
As for whether "accordingly" still applies, by the logic of of
win32_langinfo()...  Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too.  I'm not
sure what to make of that.  The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want.  Hmph.

Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.


Regards,

Juan José Santamaría Flecha

Re: Windows default locale vs initdb

From
Thomas Munro
Date:
On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> As for whether "accordingly" still applies, by the logic of of
>> win32_langinfo()...  Windows still considers WIN1252 to be the default
>> ANSI code page for "en-US", though it'd work with UTF-8 too.  I'm not
>> sure what to make of that.  The goal here was to give Windows users
>> good defaults, but WIN1252 is probably not what most people actually
>> want.  Hmph.
>
>
> Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will
usethe current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page. 

I'm still confused about what that means.  Suppose we decided to
insist by adding a ".UTF-8" suffix to the name, as that page says we
can now that we're on Windows 10+, when building the default locale
name (see experimental 0002 patch, attached).  It initially seemed to
have the right effect:

The database cluster will be initialized with locale "en-US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]:

SELECT 'i'::citext = 'İ'::citext AS t;
 t
 ---
- t
+ f
 (1 row)

About the pg_upgrade problem, maybe it's OK ... existing old format
names should continue to work, but we can still remove the weird code
that does locale name tweaking, right?  pg_upgraded databases should
contain fixed names (ie that were fixed by old initdb so should
continue to work), and new clusters will get BCP 47 names.

I don't really know, I was just playing with rough ideas by sending
patches to CI here...

[1] https://cirrus-ci.com/task/6423238052937728

Attachment

Re: Windows default locale vs initdb

From
Juan José Santamaría Flecha
Date:

On Wed, Jul 20, 2022 at 1:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.

I'm still confused about what that means.  Suppose we decided to
insist by adding a ".UTF-8" suffix to the name, as that page says we
can now that we're on Windows 10+, when building the default locale
name (see experimental 0002 patch, attached).  It initially seemed to
have the right effect:

The database cluster will be initialized with locale "en-US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Let me try to explain this using the "Beta: Use Unicode UTF-8 for worldwide language support" option [1]. 

- Currently in a system with the language settings of "English_United States" and that option disabled, when executing initdb you get:

The database cluster will be initialized with locale "English_United States.1252".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

And as a test for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
ERROR:  character with byte sequence 0xc5 0x9f in encoding "UTF8" has no equivalent in encoding "WIN1252"

We get this error even if the database encoding is UTF8, and is caused by the tr_tr locales being encoded in WIN1254. We can discuss this in another thread, and I can propose a patch.

- If we enable the UTF-8 support option, then the same test goes as:

The database cluster will be initialized with locale "English_United States.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

And for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
 to_char
---------
 şubat
(1 row)

In this case the Windows locales are actually UTF8 encoded.

TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be done through the Windows registry and only in recent releases.
 
But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]:

SELECT 'i'::citext = 'İ'::citext AS t;
 t
 ---
- t
+ f
 (1 row)

This is current state of affairs:

- Windows:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
 latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
 ı                   | i           | I             | i     | İ                    | İ

- Linux:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
 latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
 ı                   | i           | I             | i     | İ                    | i

Latin_capital_dotted doesn't have the same lower value.
 

Regards,

Juan José Santamaría Flecha

Re: Windows default locale vs initdb

From
Thomas Munro
Date:
On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be
donethrough the Windows registry and only in recent releases. 

Thanks, that was helpful, and so was that SO link.

So it sounds like I should forget about the v3-0002 patch, but the
v3-0001 and v3-0003 patches might have a future.  And it sounds like
we might need to investigate maybe defending ourselves against the ACP
being different than what we expect (ie not matching the database
encoding)?  Did I understand correctly that you're looking into that?



Re: Windows default locale vs initdb

From
Thomas Munro
Date:
On Fri, Jul 29, 2022 at 3:33 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha
> <juanjo.santamaria@gmail.com> wrote:
> > TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be
donethrough the Windows registry and only in recent releases. 
>
> Thanks, that was helpful, and so was that SO link.
>
> So it sounds like I should forget about the v3-0002 patch, but the
> v3-0001 and v3-0003 patches might have a future.  And it sounds like
> we might need to investigate maybe defending ourselves against the ACP
> being different than what we expect (ie not matching the database
> encoding)?  Did I understand correctly that you're looking into that?

I'm going to withdraw this entry.  The sooner we get something like
0001 into a release, the sooner the world will be rid of PostgreSQL
clusters initialised with the bad old locale names that the manual
very clearly tells you not to use for databases.... but I don't
understand this ACP/registry vs database encoding stuff and how it
relates to the use of BCP47 locale names, which puts me off changing
anything until we do.



Re: Windows default locale vs initdb

From
Thomas Munro
Date:
Another country has changed its name, and a Windows OS update has
again broken every PostgreSQL cluster in that whole country[1] (or at
least those that had accepted initdb's default choice of locale,
probably most).  Let's get to the bottom of this, because otherwise it
is simply going to keep happening, causing administrative pain for a
lot of people.

Here is a rebase of the basic patch I proposed last time, and a
re-statement of what we know:

1.  initdb chooses a default locale using a technique that gives you
an unstable ("Czech Republic"->"Czechia", "Turkey"->"Türkiye"),
non-ASCII ("Norwegian (Bokmål)") string that we are warned we should
not store anywhere.  We store it, and then later it is not recognised.
Instead we should select an IETF BCP 47 locale name, based on stable
ISO country and language codes, like "en-US", "tr-TR" etc.  Here is
the patch to teach initdb to use that, unchanged from v3 except that I
tweaked the docs a bit.

2.  In Windows 10+ it is now also possible to put ".UTF-8" on the end
of locale names.  I couldn't figure out whether we should do that, and
what effect it has on ctypes -- apparently not the effect I expected
(see upthread).  Was our UTF-8 support on Windows already broken, and
this new ".UTF-8" thing is just a new way to reach that brokenness?
Is it OK to continue to choose the "legacy" single byte encodings by
default on that OS, and consider that a separate topic for separate
research?

3.  It is not clear to me how we should deal with pg_upgrade.
Eventually we want all of the old-school names to fade away, and
pg_upgrade would need to be part of that.  Perhaps there is some API
that can be used to translate to the new canonical forms without us
having to maintain translation tables and other messiness in our tree.

4.  Eventually we should probably ban non-ASCII characters from
entering the relevant catalogues (they are shared, so their encoding
is undefined except that they must be a superset of ASCII), and delete
all the old win32setlocale.c kludges, after we reach a point where
everyone should be using exclusively BCP 47.

[1] https://www.postgresql.org/message-id/flat/18196-b10f93dfbde3d7db%40postgresql.org

Attachment

Re: Windows default locale vs initdb

From
Thomas Munro
Date:
I clicked "Trigger" to get a Mingw test run of this, and it failed[1].
I see why: our function win32_langinfo() believes that it shouldn't
call GetLocaleInfoEx() on non-MSVC compilers, so we see 'initdb:
error: could not find suitable encoding for locale "en-US"'.  I think
it has fallback code that parses the ".1252" or whatever on the end of
the name, but "en-US" hasn't got one.  I don't know the first thing
about Mingw but it looks like a declaration for that function arrived
6 years ago[2], and deleting the "#if defined(_MSC_VER)" fixes the
problem and the tests pass[3].  As far as I know, we don't support any
Mingw but the very latest: it's not a target with real users who have
version requirements, it's just a developer [in]convenience, so if it
passes on CI and whatever MSYS version "fairywren" runs in the build
farm right now, that should be enough.

I could just do that in this patch, but I suppose that also means that
someone needs to go through pg_locale.c and other places that test
_MSC_VER not because they actually care about the compiler but because
they want to detect some crusty old Mingw version, and see what else
can be deleted as a result, possibly including a lot of fallback code.
It feels like a separate cleanup for a separate patch.

[1] https://cirrus-ci.com/task/5301814774464512
[2]
https://github.com/mirror/mingw-w64/blame/eff726c461e09f35eeaed125a3570fa5f807f02b/mingw-w64-tools/widl/include/winnls.h#L931
[3] https://cirrus-ci.com/task/6558569718349824



Re: Windows default locale vs initdb

From
Thomas Munro
Date:
Here is a thought that occurs to me, as I follow along with Jeff
Davis's evolving proposals for built-in collations and ctypes:  What
would stop us from dropping support for the libc (sic) provider on
Windows?  That may sound radical and likely to cause extra work for
people on upgrade, but how does that compare to the pain of keeping
this barely maintained code in the tree?  Suppose the idea in this
thread goes ahead and we get people to transition to the modern locale
names: there is non-zero transitional/upgrade pain there too.  How
delicious it would be to just nuke the whole thing from orbit, and
keep only cross-platform code that is maintained with enthusiasm by
active hackers.

That's probably a little extreme, but it's the direction my thoughts
start to go in when confronting the realisation that it's up to us
[Unix hackers making drive-by changes], no one is coming to help us
[from the Windows user community].

I've even heard others talk about dropping Windows completely, due to
the maintenance imbalance.  This would be somewhat more fine grained.
(One could use a similar argument to drop non-NTFS filesystems and
turn on POSIX-mode file links, to end that other locus of struggle.)