On Sun, Dec 15, 2024 at 04:50:57PM +1300, Thomas Munro wrote:
> On Sun, Dec 15, 2024 at 3:27 PM Noah Misch <noah@leadboat.com> wrote:
> > yielding mojibake
>
> Thank you for this magnificent terminology.
Thank you for reviewing.
> > What we can do safely is exit if we could not convert the command line to the
> > Windows ANSI code page (losslessly). Patch attached.
>
> This seems to make sense generally, since (based only on googling, I
> am not a Windows person and I don't even play one on TV), it doesn't
> sound like there is a way for the system to tell you that the
> conversion was lossy.
>
> For the all-ASCII test, why not us pg_is_ascii()?
pg_is_ascii() takes a char[], but this has a wchar_t[]. Maybe I should wrap
that part in a pg_wcs_is_ascii()? (We can't use GetCommandLineA(), because
"best fit" may have converted non-ASCII to ASCII.)
> I wonder if there are any legitimate uses that would fail spuriously
> due to the minutiae of normalisation, surrogates, etc. Or perhaps
> that type of thing is part of the problem you're talking about.
I wouldn't rule it out. My original plan was just to reject any wchar_t above
0xFF. That breaks innocent command lines. If GetACP()==1252, then
CreateProcessA(... 0x80 ...) puts 0x20ac (UTF-16 EURO SIGN) in
GetCommandLineW(). The conversion to GetCommandLineA() arrives back at 0x80,
so there's no reason to reject it.
> > As I mention in a code comment, I decided not to do anything about the similar
> > hazard in GetEnvironmentStringsW(). A possible alternative would be to check
> > each var referenced in PostgreSQL code or each var where the name begins with
> > "PG". Should we use one of those or another alternative?
>
> I would vote for having a table of variables we know about and
> checking them too, just on principle (it's not the characters the user
> intended). Checking every variable might reject legitimate UTF-16
> communication between parent and sub-processes other than ours, and
> even assuming that we own everything beginning "PG" seems an
> unnecessary overreach, which seems to leave only the known list idea.
Works for me. It's a little weird for psql to complain about a PGDATA value,
but the condition should be rare and easy enough for the user to fix. If
nobody else dislikes that plan, I'll give it a try. Perhaps I'd add a
src/tools/RELEASE_CHANGES task to check for new vars. Alternatively, some
check-world test could verify names appearing in getenv calls or
PQconninfoOptions.envvar fields. I don't think we have a source-tree-scanning
test like that today, but I wouldn't object to starting. We've talked about a
pgindent test, so visiting source code from a test is plausible.
> > The Perl testing calls powershell, which is new for our source tree. The
> > Internet thinks Windows Server 2008 R1 is the newest Windows where powershell
> > was optional. The buildfarm's oldest Windows is Windows Server 2008 R2. If
> > powershell is bad, there's likely a hairy way to write the test in any of
> > cmd.exe, cscript, Perl Win32::API, or C. My first choice was Perl
> > Win32::Process or IPC::Run. This test needs CreateProcessW(), but those two
> > use CreateProcessA().
>
> Sounds OK from here if it's the only sane way, and you could always skip?
A C program would be another sane way, but I would indeed opt to skip on
16-year-old Windows before rewriting in C.
> > I didn't test a CP_UTF8 Windows system. The new test coverage has a condition
> > intended to avoid failure there. (My Windows development environments are all
> > too old, and I stopped short of building a new one for this.) CP_UTF8 Windows
> > raises broader issues. I'll start a separate thread about them.
>
> Is it really still in beta?
The word "beta" appears next to its checkbox. That's all I know!
> Yeah, I can't immediately think of a good
> reason not to treat it the same as everything else and create more
> code paths, even if we think it should always succeed (but I'll look
> out for your thread, I'm quite interested in this general topic of
> "can we make a lot of Windows encoding troubles go away that way").
> It's a few extra CPU cycles but I guess it doesn't matter much.
postgr.es/m/flat/20241215023221.4d.nmisch@google.com is that thread. So far,
it's only about new breakage.
> FWIW, coincidentally, I assume, I wrote about environ and argv
> encoding problems last week, including a test program showing wchar_t
> <-> ACP weirdness before main() among other interesting related
> effects[1].
Nice. I had made a note to look at that thread eventually, but I didn't know
Windows had come up there. My observations have been consistent with your
Windows findings in [1].
> (My focus has been on how we sort out ACP vs database
> encoding and block stuff that is b0rked in *our* code, ie below
> main(), because we don't understand the platform's encoding model, not
> so much the stuff that is already b0rked in all non-UTF8, non-wmain()
> Windows programs in general before even laying hands on the program
> counter. I had got as far as wondering out loud if forcing or
> strongly recommending UTF-8 ACP should be on the agenda to head off
> various related problems and generally make Windows not so different
My initial impression is not great, since it's painful with any database
having encoding!=UTF8. The nice thing about every other ACP is that you can
treat it as a POSIX-style, encoding-oblivious sequence of bytes. (Side note:
I don't understand how that is the case for code page 932 and other variable
length, non-UTF* code pages. It seems to work, but maybe I just haven't found
the byte sequences that break.) My hypothesis is that GetACP()==CP_UTF8 would
make PostgreSQL cooperate better with rest of the Windows world, but it would
make PostgreSQL cooperate worse with other POSIX transplants. Again, that's
just an initial impression.
> [1] https://www.postgresql.org/message-id/CA%2BhUKGL%3DF0pSLLf3nLpA_-sBwYsLg7s%3DFD6YFo_PDvS84FE_hw%40mail.gmail.com