Thread: Windows UTF8 system locale

Windows UTF8 system locale

From
Noah Misch
Date:
Since ~2019, Windows has option "Beta: Use Unicode UTF-8 for worldwide
language support".  That option breaks the appendShellString() assumption that
it can escape every byte except '\0', '\r'. and '\n'.  Instead, process
creation injects U+FFFD REPLACEMENT CHARACTER (UTF-8: ef bf bd) for each byte
of the command line not forming valid UTF-8.  Here's the Windows Server 2025
output from a test program that sends bytes 0x80..0xFF in a CreateProcessA()
command line:

argv[1] = 58 ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf
bdef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
efbf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef
bfbd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf
bdef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
efbf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef
bfbd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf
bdef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
efbf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef
bfbd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd 
GetCommandLineA() = 61 20 58 ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
efbf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef
bfbd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf
bdef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
efbf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef
bfbd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf
bdef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
efbf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef
bfbd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf
bdef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd 
GetCommandLineW() = 61 20 58 fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd
fffdfffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd
fffdfffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd
fffdfffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd
fffdfffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd
fffdfffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd 

For PostgreSQL, I expect the most obvious problems will arise for rolname and
datname containing non-UTF8.  For example, pg_dumpall relies on
appendShellString() to call pg_dump for arbitrary datname.  pg_dumpall would
get "database ... does not exist".  Some ways we might react:

1. Instead of arbitrary bytes in argv[], use a temporary PGSERVICEFILE.  For
   other kinds of appendShellString() input (mostly file paths), we could
   provide other ways to pass them outside argv, or we could just not support
   the full character repertoire in those.  Windows "8.3 filenames" are a fair
   workaround.

2. Just fail if the system option is enabled and we would appendShellString()
   a non-UTF8 value.

3. Fail if we find U+FFFD in arguments.  It's valid Unicode, though.


I plan not to work on this myself, and I'm not advocating this as a priority
to anyone else.  I'm just sending this to record what I learned, in case it
helps someone for whom it does become a priority.

https://stackoverflow.com/a/57134096/16371536 shows how to enable the option.
I'd be interested to hear test results with that enabled.  My hypothesis is
that 010_dump_connstr.pl and 200_connstr.pl would fail.  (My Windows
development environments are all too old, and I stopped short of building a
new one for this.)  It should also be possible to test this in CI by building
an image with the following https://github.com/anarazel/pg-vm-images.git
modification:

--- a/scripts/windows_install_dbg.ps1
+++ b/scripts/windows_install_dbg.ps1
@@ -9,6 +9,15 @@ mkdir c:\t
 cd c:\t


+echo "enabling UTF8"
+Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' `
+  -Name 'ACP' -Value '65001'
+Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' `
+  -Name 'OEMCP' -Value '65001'
+Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' `
+  -Name 'MACCP' -Value '65001'
+
+
 echo "configuring windows error reporting"

 # prevent windows error handling dialog from causing hangs



Re: Windows UTF8 system locale

From
Thomas Munro
Date:
On Sun, Dec 15, 2024 at 3:32 PM Noah Misch <noah@leadboat.com> wrote:
> For PostgreSQL, I expect the most obvious problems will arise for rolname and
> datname containing non-UTF8.  For example, pg_dumpall relies on
> appendShellString() to call pg_dump for arbitrary datname.  pg_dumpall would
> get "database ... does not exist".

Right, those catalogues have undefined encoding (the initial problem
my CLUSTER ENCODING proposal started trying to fix) and could even be
different for every row, and Windows wants all strings used in
non-wide environ, argv, file APIs, etc to be valid in the ACP (because
it converts them to UTF-16).  We would get away with it if UTF-8
weren't so picky, but come to think of it, so is SJIS, so maybe this
is not a new problem with $SUBJECT?

Wild guess: 文字化け (= mojibake) when encoded as UTF-8 and then passed in
a command line to CreateProcess() with ACP=SJIS might show the problem
(I just gave that string to iconv -f SJIS -t UTF-8 and it rejected it,
I'm assuming that means it'd do the same sort of thing in that
context).

It's a shame the implicit conversion here doesn't fail with EILSEQ.  I
can't imagine how anything good can ever have come from lossy,
non-error-raising implicit conversions anywhere near argv[].  On the
other hand, on Unix we have other problems stemming from the
undefinedness.  What does "copy ... to '/tmp/café.txt" do inside a
LATIN1 database?  macOS: EILSEQ, can't open that file, Linux: sure,
now you have a file whose name is displayed as caf�.txt in your UTF-8
terminal or other software (U+FFFD REPLACEMENT CHARACTER).

> 2. Just fail if the system option is enabled and we would appendShellString()
>    a non-UTF8 value.

I guess the general version is just: fail if the string is not valid
in the ACP (MB_ERR_INVALID_CHARS).

With the ACP-matching idea for CLUSTER ENCODING, it *think* it should
become unreachable in the two recommended modes: either those strings
would be pure ASCII, or they'd be in database encoding (same encoding
for all databases enforced) and the ACP would match, so it would all
be aligned without any new conversions being required.  It also has an
UNDEFINED mode so a failed encoding validation there would still be
reachable that way.  Still thinking about it all though.



Re: Windows UTF8 system locale

From
Noah Misch
Date:
On Sun, Dec 15, 2024 at 06:43:35PM +0100, Michail Nikolaev wrote:
> I have Win 11 with that feature enabled, 200_connstr.pl passes without any
> issues, but 010_dump_connstr.pl fails, yes.
> All other tests seem to be passing, at least without ICU enabled.
> 
>  010_dump_connstr.pl log is attached.

Thanks.  I'll guess 200_connstr passed but created different names than
intended.  It likely got "will be truncated" messages like 010_dump_connstr
did.  Unlike 010_dump_connstr, nothing in 200_connstr fails if the resulting
object name doesn't match the intended name.

010_dump_connstr failed earlier than I hypothesized.  It failed when
pg_authid.rolname for $username4 got U+FFFD characters, which didn't match
what config_sspi_auth() wrote into pg_ident.conf.  CI w/ CP_UTF8 might evade
that particular failure via its use of PG_TEST_USE_UNIX_SOCKETS=1.

Reaching the hypothesized failure at the pg_dumpall->pg_dump call would need a
modified test.  These tests use the createdb executable.  Replacing a createdb
run with safe_psql() of a CREATE DATABASE command should reproduce the
pg_dumpall->pg_dump failure.



Re: Windows UTF8 system locale

From
Noah Misch
Date:
On Tue, Dec 17, 2024 at 02:29:59AM +1300, Thomas Munro wrote:
> On Sun, Dec 15, 2024 at 3:32 PM Noah Misch <noah@leadboat.com> wrote:
> > For PostgreSQL, I expect the most obvious problems will arise for rolname and
> > datname containing non-UTF8.  For example, pg_dumpall relies on
> > appendShellString() to call pg_dump for arbitrary datname.  pg_dumpall would
> > get "database ... does not exist".
> 
> Right, those catalogues have undefined encoding (the initial problem
> my CLUSTER ENCODING proposal started trying to fix) and could even be
> different for every row, and Windows wants all strings used in
> non-wide environ, argv, file APIs, etc to be valid in the ACP (because
> it converts them to UTF-16).  We would get away with it if UTF-8
> weren't so picky, but come to think of it, so is SJIS, so maybe this
> is not a new problem with $SUBJECT?
> 
> Wild guess: 文字化け (= mojibake) when encoded as UTF-8 and then passed in
> a command line to CreateProcess() with ACP=SJIS might show the problem
> (I just gave that string to iconv -f SJIS -t UTF-8 and it rejected it,
> I'm assuming that means it'd do the same sort of thing in that
> context).

I wasn't ready to believe it, but 010_dump_connstr indeed fails with
GetACP()==932.  We've had test coverage of this for 8+ years, so I gather few
or no runs of the TAP suite on GetACP()==932 systems have ever happened.  Wow.

Here's how your particular example traverses the CP932 command line:

CreateProcessA(0xe6 0x96 0x87 0xe5 0xad 0x97 0xe5 0x8c 0x96 0xe3 0x81 0x91)
argv[1] = e6 96 81 45 ad 97 e5 8c 96 e3 81
GetCommandLineA() = 61 20 e6 96 81 45 ad 97 e5 8c 96 e3 81
GetCommandLineW() = 61 20 8b41 30fb ff6d 601c 55a7 7e3a

> It's a shame the implicit conversion here doesn't fail with EILSEQ.  I
> can't imagine how anything good can ever have come from lossy,
> non-error-raising implicit conversions anywhere near argv[].  On the

It's a shame.

> other hand, on Unix we have other problems stemming from the
> undefinedness.  What does "copy ... to '/tmp/café.txt" do inside a
> LATIN1 database?  macOS: EILSEQ, can't open that file, Linux: sure,
> now you have a file whose name is displayed as caf�.txt in your UTF-8
> terminal or other software (U+FFFD REPLACEMENT CHARACTER).

GNU ls provides nine options for rendering that name to a terminal:
https://www.gnu.org/software/coreutils/manual/html_node/Formatting-the-file-names.html
https://www.gnu.org/software/coreutils/quotes.html

Non-default option "ls --quoting=literal" does display the "replacement
character" way.  It may count as a shame that POSIX pathnames are [0x1,0xFF]
binary strings instead of Unicode character strings, but here we are.

> > 2. Just fail if the system option is enabled and we would appendShellString()
> >    a non-UTF8 value.
> 
> I guess the general version is just: fail if the string is not valid
> in the ACP (MB_ERR_INVALID_CHARS).

Roughly that.

> With the ACP-matching idea for CLUSTER ENCODING, it *think* it should
> become unreachable in the two recommended modes: either those strings
> would be pure ASCII, or they'd be in database encoding (same encoding
> for all databases enforced) and the ACP would match, so it would all
> be aligned without any new conversions being required.  It also has an
> UNDEFINED mode so a failed encoding validation there would still be
> reachable that way.  Still thinking about it all though.

I see.  Interesting.  Considering you need to be root to change the ACP, I'm
disinclined to bet big on requiring the ACP to match anything about encodings
used in PostgreSQL.  We might get away with it, but it sounds bad for the
Poker Tracker use case.



Re: Windows UTF8 system locale

From
Vladlen Popolitov
Date:
Noah Misch писал(а) 2024-12-17 02:16:
> I wasn't ready to believe it, but 010_dump_connstr indeed fails with
> GetACP()==932.  We've had test coverage of this for 8+ years, so I 
> gather few
> or no runs of the TAP suite on GetACP()==932 systems have ever 
> happened.  Wow.
> 
> Here's how your particular example traverses the CP932 command line:
> 
> CreateProcessA(0xe6 0x96 0x87 0xe5 0xad 0x97 0xe5 0x8c 0x96 0xe3 0x81 
> 0x91)
> argv[1] = e6 96 81 45 ad 97 e5 8c 96 e3 81
> GetCommandLineA() = 61 20 e6 96 81 45 ad 97 e5 8c 96 e3 81
> GetCommandLineW() = 61 20 8b41 30fb ff6d 601c 55a7 7e3a
> 
>> It's a shame the implicit conversion here doesn't fail with EILSEQ.  I
>> can't imagine how anything good can ever have come from lossy,
>> non-error-raising implicit conversions anywhere near argv[].  On the
> 
> It's a shame.

I also found that 010_dump_connstr fails, if Windows has UTF-8 option on 
(in language setting,
mentioned as beta feature in option description).

It looks like this test does not consider language settings. It creates 
user with a long name,
language settings can increase this name by 2 times in UTF-8 case,
Postgres cut this name to 64 bytes, but the test continues to use 
original long name and fails.
It is not clear, what exactly the test checks, if it fails before the 
dump check.

-- 
Best regards,

Vladlen Popolitov.



Re: Windows UTF8 system locale

From
Vladlen Popolitov
Date:
Noah Misch писал(а) 2024-12-17 02:16:
> On Tue, Dec 17, 2024 at 02:29:59AM +1300, Thomas Munro wrote:
>> On Sun, Dec 15, 2024 at 3:32 PM Noah Misch <noah@leadboat.com> wrote:
>> > For PostgreSQL, I expect the most obvious problems will arise for rolname and
>> > datname containing non-UTF8.  For example, pg_dumpall relies on
>> > appendShellString() to call pg_dump for arbitrary datname.  pg_dumpall would
>> > get "database ... does not exist".
>> 
>> Right, those catalogues have undefined encoding (the initial problem
>> my CLUSTER ENCODING proposal started trying to fix) and could even be
>> different for every row, and Windows wants all strings used in
>> non-wide environ, argv, file APIs, etc to be valid in the ACP (because
>> it converts them to UTF-16).  We would get away with it if UTF-8
>> weren't so picky, but come to think of it, so is SJIS, so maybe this
>> is not a new problem with $SUBJECT?
>> 
>> Wild guess: 文字化け (= mojibake) when encoded as UTF-8 and then passed in
>> a command line to CreateProcess() with ACP=SJIS might show the problem
>> (I just gave that string to iconv -f SJIS -t UTF-8 and it rejected it,
>> I'm assuming that means it'd do the same sort of thing in that
>> context).
> 
> I wasn't ready to believe it, but 010_dump_connstr indeed fails with
> GetACP()==932.  We've had test coverage of this for 8+ years, so I 
> gather few
> or no runs of the TAP suite on GetACP()==932 systems have ever 
> happened.  Wow.
> 
> Here's how your particular example traverses the CP932 command line:
> 
> CreateProcessA(0xe6 0x96 0x87 0xe5 0xad 0x97 0xe5 0x8c 0x96 0xe3 0x81 
> 0x91)
> argv[1] = e6 96 81 45 ad 97 e5 8c 96 e3 81
> GetCommandLineA() = 61 20 e6 96 81 45 ad 97 e5 8c 96 e3 81
> GetCommandLineW() = 61 20 8b41 30fb ff6d 601c 55a7 7e3a
> 
>> It's a shame the implicit conversion here doesn't fail with EILSEQ.  I
>> can't imagine how anything good can ever have come from lossy,
>> non-error-raising implicit conversions anywhere near argv[].  On the
> 
> It's a shame.
> 
>> other hand, on Unix we have other problems stemming from the
>> undefinedness.  What does "copy ... to '/tmp/café.txt" do inside a
>> LATIN1 database?  macOS: EILSEQ, can't open that file, Linux: sure,
>> now you have a file whose name is displayed as caf�.txt in your UTF-8
>> terminal or other software (U+FFFD REPLACEMENT CHARACTER).
> 
> GNU ls provides nine options for rendering that name to a terminal:
> https://www.gnu.org/software/coreutils/manual/html_node/Formatting-the-file-names.html
> https://www.gnu.org/software/coreutils/quotes.html
> 
> Non-default option "ls --quoting=literal" does display the "replacement
> character" way.  It may count as a shame that POSIX pathnames are 
> [0x1,0xFF]
> binary strings instead of Unicode character strings, but here we are.
> 
>> > 2. Just fail if the system option is enabled and we would appendShellString()
>> >    a non-UTF8 value.
>> 
>> I guess the general version is just: fail if the string is not valid
>> in the ACP (MB_ERR_INVALID_CHARS).
> 
> Roughly that.
> 
>> With the ACP-matching idea for CLUSTER ENCODING, it *think* it should
>> become unreachable in the two recommended modes: either those strings
>> would be pure ASCII, or they'd be in database encoding (same encoding
>> for all databases enforced) and the ACP would match, so it would all
>> be aligned without any new conversions being required.  It also has an
>> UNDEFINED mode so a failed encoding validation there would still be
>> reachable that way.  Still thinking about it all though.
> 
> I see.  Interesting.  Considering you need to be root to change the 
> ACP, I'm
> disinclined to bet big on requiring the ACP to match anything about 
> encodings
> used in PostgreSQL.  We might get away with it, but it sounds bad for 
> the
> Poker Tracker use case.

Hi Noah!

  It is excellent investigation done by you in previous emails regarding
this topic. This UTF-8 feature leads to annoying test failure 
(010_dump_connstr).

I read the articles from links above, and got conclusion, that this 
option
is the user choice to push all programs use UTF-8. It forces UTF-8 
encoding on the screen,
and convert command line to prevent non UTF-8 chars. It is not common 
for Unix world
to change users command line (though we live with CR to CR-NL conversion 
from DOS time),
but it is already the actual solution in Windows. Negative drawback of 
this solution -
some programs can stop working (like pg_dumpall test).

Really it is not so bad situation. Even before we had to exclude in this 
test some
characters, that can not be passed through a command line: " >  <  | & .

How it can be improved?

At least, this test was intended to check, that pg_dumpall can use all 
characters from 1 to 255
with some exceptions. It did its work and found this configuration, when 
all characters cannot
be used, OS considers them wrong in command line even if PostgreSQL 
considers them correct.

Option 1
Skip this test for Windows in UTF-8 mode.

Option 2.
Exclude all 8-bit characters for Windows in UTF-8 mode. Now only " 
excluded for Windows.

Option 3.
Test with some limited list of correct UTF-8 symbols - just in case, 
that they also works.
It could be 64 2-bytes UTF-8 characters.

It is interesting to look at other opinions.

-- 
Best regards,

Vladlen Popolitov.