Re: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607 - Mailing list pgsql-bugs
From | Thomas Munro |
---|---|
Subject | Re: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607 |
Date | |
Msg-id | CA+hUKGKrzGnCW3MoVC7gayHCtJQPEJKgpfQWD6c7eUvFp+ac+A@mail.gmail.com Whole thread Raw |
In response to | RE: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607 ("Haifang Wang (Centific Technologies Inc)" <v-haiwang@microsoft.com>) |
Responses |
RE: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607
|
List | pgsql-bugs |
On Tue, May 14, 2024 at 6:51 AM Haifang Wang (Centific Technologies Inc) <v-haiwang@microsoft.com> wrote: > You said the Turkey -> Türkiye issue caused by a recent operating system upgrade, you mean this crash is caused by thechanges made on Windows side? If that is the case, you prefer to leave the bug there? Any plan for the future? First, let me restate the problem: When you create a database cluster (= a PostgreSQL instance) with "initdb", unless you request a default locale with --locale, initdb uses setlocale("") to query the system/user default locale. It then records that string in postgresql.conf, and also in the pg_database catalog. On POSIX systems, that captures something like "tr-TR.UTF-8" or similar. On Windows, that captures something like "Turkish_Turkey.1254". Later, PostgreSQL uses newlocale() or setlocale() functions to access that locale again. In rare cases where a country changes its name, a Windows update *renames* the locale, and then those calls fail, because the old name is not recognised anymore. I proposed a partial solution that should help avoid the problem in future: I think that initdb should instead call GetUserDefaultLocaleName() to discover the user account's default locale, because it returns BCP47 names like "tr-TR". It is probably less likely for a country or language to change its ISO code (but not impossible[1]), than for the English-language names of them to change. Even if you reject this idea because technically they can both change, there are other reasons why we should not be storing this "display"-style names anywhere, including that PostgreSQL needs to store them in a place where the encoding must be ASCII, which Türkiye is not. And finally, the Windows manual explicitly warns us about this[2]: "We don't recommend this form for locale strings embedded in code or serialized to storage: These strings are more likely to be changed by an operating system update than the locale name form." (Note: the BCP47 support in Windows did not exist when that PostgreSQL code was written, so it did what it had to at the time.) There are further problems to resolve: 1. We don't know if we should put encodings (AKA codepages?) on the end of those strings or not. There is some confusion about what exactly it does, and how it interacts with the "ACP". I'd be worried that if you don't put the endings on, perhaps it can change under your feet. (I suspect that part of the discussion on that other thread took some wrong turns, based on citext_utf8 results that were actually probably misleading.) 2. If we do decide to put the encoding suffixes on, which encoding should we be suggesting? It appears from anecdotal reports that most PostgreSQL-on-Windows users are stuck in the past, using the old pre-UTF-8 language-specific encodings. Should we be encouraging UTF-8 use by default, if we can? Maybe that is a separate question. Then there is the practical question of what to do with an already-broken system. One idea would be to introduce a "locale remapping" file, pgdata/pg_locale.map where you can write things like "Turkish_Turkey.1254"="Turkish_Türkiye.1254". I think there might be a GUI tool that allows you to duplicate, rename, etc locales in Windows, so you can re-create the old name. I believe that is how some people have fixed their broken databases. I don't know if there is a good reference/blog/article on that, that we should be pointing people towards if they show up with broken systems. Patches, testing, research are welcome! Even though I put forward that BCP47 idea, it was based on reading the manual, so the "unresolved" questions may in fact be very easy to resolve by people who actually use/know Windows. Even if I had been gung-ho about committing that in 16 without feedback from Windows users, it would have been too late to help Turkish users with existing databases. [1] https://learn.microsoft.com/en-us/globalization/locale/standard-locale-names [2] https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
pgsql-bugs by date: