On Sun, Dec 15, 2024 at 3:32 PM Noah Misch <noah@leadboat.com> wrote:
> For PostgreSQL, I expect the most obvious problems will arise for rolname and
> datname containing non-UTF8. For example, pg_dumpall relies on
> appendShellString() to call pg_dump for arbitrary datname. pg_dumpall would
> get "database ... does not exist".
Right, those catalogues have undefined encoding (the initial problem
my CLUSTER ENCODING proposal started trying to fix) and could even be
different for every row, and Windows wants all strings used in
non-wide environ, argv, file APIs, etc to be valid in the ACP (because
it converts them to UTF-16). We would get away with it if UTF-8
weren't so picky, but come to think of it, so is SJIS, so maybe this
is not a new problem with $SUBJECT?
Wild guess: 文字化け (= mojibake) when encoded as UTF-8 and then passed in
a command line to CreateProcess() with ACP=SJIS might show the problem
(I just gave that string to iconv -f SJIS -t UTF-8 and it rejected it,
I'm assuming that means it'd do the same sort of thing in that
context).
It's a shame the implicit conversion here doesn't fail with EILSEQ. I
can't imagine how anything good can ever have come from lossy,
non-error-raising implicit conversions anywhere near argv[]. On the
other hand, on Unix we have other problems stemming from the
undefinedness. What does "copy ... to '/tmp/café.txt" do inside a
LATIN1 database? macOS: EILSEQ, can't open that file, Linux: sure,
now you have a file whose name is displayed as caf�.txt in your UTF-8
terminal or other software (U+FFFD REPLACEMENT CHARACTER).
> 2. Just fail if the system option is enabled and we would appendShellString()
> a non-UTF8 value.
I guess the general version is just: fail if the string is not valid
in the ACP (MB_ERR_INVALID_CHARS).
With the ACP-matching idea for CLUSTER ENCODING, it *think* it should
become unreachable in the two recommended modes: either those strings
would be pure ASCII, or they'd be in database encoding (same encoding
for all databases enforced) and the ACP would match, so it would all
be aligned without any new conversions being required. It also has an
UNDEFINED mode so a failed encoding validation there would still be
reachable that way. Still thinking about it all though.