Re: Reject Windows command lines not convertible to system locale - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Reject Windows command lines not convertible to system locale |
Date | |
Msg-id | 20241215050553.45.nmisch@google.com Whole thread Raw |
In response to | Re: Reject Windows command lines not convertible to system locale (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Reject Windows command lines not convertible to system locale
|
List | pgsql-hackers |
On Sun, Dec 15, 2024 at 04:50:57PM +1300, Thomas Munro wrote: > On Sun, Dec 15, 2024 at 3:27 PM Noah Misch <noah@leadboat.com> wrote: > > yielding mojibake > > Thank you for this magnificent terminology. Thank you for reviewing. > > What we can do safely is exit if we could not convert the command line to the > > Windows ANSI code page (losslessly). Patch attached. > > This seems to make sense generally, since (based only on googling, I > am not a Windows person and I don't even play one on TV), it doesn't > sound like there is a way for the system to tell you that the > conversion was lossy. > > For the all-ASCII test, why not us pg_is_ascii()? pg_is_ascii() takes a char[], but this has a wchar_t[]. Maybe I should wrap that part in a pg_wcs_is_ascii()? (We can't use GetCommandLineA(), because "best fit" may have converted non-ASCII to ASCII.) > I wonder if there are any legitimate uses that would fail spuriously > due to the minutiae of normalisation, surrogates, etc. Or perhaps > that type of thing is part of the problem you're talking about. I wouldn't rule it out. My original plan was just to reject any wchar_t above 0xFF. That breaks innocent command lines. If GetACP()==1252, then CreateProcessA(... 0x80 ...) puts 0x20ac (UTF-16 EURO SIGN) in GetCommandLineW(). The conversion to GetCommandLineA() arrives back at 0x80, so there's no reason to reject it. > > As I mention in a code comment, I decided not to do anything about the similar > > hazard in GetEnvironmentStringsW(). A possible alternative would be to check > > each var referenced in PostgreSQL code or each var where the name begins with > > "PG". Should we use one of those or another alternative? > > I would vote for having a table of variables we know about and > checking them too, just on principle (it's not the characters the user > intended). Checking every variable might reject legitimate UTF-16 > communication between parent and sub-processes other than ours, and > even assuming that we own everything beginning "PG" seems an > unnecessary overreach, which seems to leave only the known list idea. Works for me. It's a little weird for psql to complain about a PGDATA value, but the condition should be rare and easy enough for the user to fix. If nobody else dislikes that plan, I'll give it a try. Perhaps I'd add a src/tools/RELEASE_CHANGES task to check for new vars. Alternatively, some check-world test could verify names appearing in getenv calls or PQconninfoOptions.envvar fields. I don't think we have a source-tree-scanning test like that today, but I wouldn't object to starting. We've talked about a pgindent test, so visiting source code from a test is plausible. > > The Perl testing calls powershell, which is new for our source tree. The > > Internet thinks Windows Server 2008 R1 is the newest Windows where powershell > > was optional. The buildfarm's oldest Windows is Windows Server 2008 R2. If > > powershell is bad, there's likely a hairy way to write the test in any of > > cmd.exe, cscript, Perl Win32::API, or C. My first choice was Perl > > Win32::Process or IPC::Run. This test needs CreateProcessW(), but those two > > use CreateProcessA(). > > Sounds OK from here if it's the only sane way, and you could always skip? A C program would be another sane way, but I would indeed opt to skip on 16-year-old Windows before rewriting in C. > > I didn't test a CP_UTF8 Windows system. The new test coverage has a condition > > intended to avoid failure there. (My Windows development environments are all > > too old, and I stopped short of building a new one for this.) CP_UTF8 Windows > > raises broader issues. I'll start a separate thread about them. > > Is it really still in beta? The word "beta" appears next to its checkbox. That's all I know! > Yeah, I can't immediately think of a good > reason not to treat it the same as everything else and create more > code paths, even if we think it should always succeed (but I'll look > out for your thread, I'm quite interested in this general topic of > "can we make a lot of Windows encoding troubles go away that way"). > It's a few extra CPU cycles but I guess it doesn't matter much. postgr.es/m/flat/20241215023221.4d.nmisch@google.com is that thread. So far, it's only about new breakage. > FWIW, coincidentally, I assume, I wrote about environ and argv > encoding problems last week, including a test program showing wchar_t > <-> ACP weirdness before main() among other interesting related > effects[1]. Nice. I had made a note to look at that thread eventually, but I didn't know Windows had come up there. My observations have been consistent with your Windows findings in [1]. > (My focus has been on how we sort out ACP vs database > encoding and block stuff that is b0rked in *our* code, ie below > main(), because we don't understand the platform's encoding model, not > so much the stuff that is already b0rked in all non-UTF8, non-wmain() > Windows programs in general before even laying hands on the program > counter. I had got as far as wondering out loud if forcing or > strongly recommending UTF-8 ACP should be on the agenda to head off > various related problems and generally make Windows not so different My initial impression is not great, since it's painful with any database having encoding!=UTF8. The nice thing about every other ACP is that you can treat it as a POSIX-style, encoding-oblivious sequence of bytes. (Side note: I don't understand how that is the case for code page 932 and other variable length, non-UTF* code pages. It seems to work, but maybe I just haven't found the byte sequences that break.) My hypothesis is that GetACP()==CP_UTF8 would make PostgreSQL cooperate better with rest of the Windows world, but it would make PostgreSQL cooperate worse with other POSIX transplants. Again, that's just an initial impression. > [1] https://www.postgresql.org/message-id/CA%2BhUKGL%3DF0pSLLf3nLpA_-sBwYsLg7s%3DFD6YFo_PDvS84FE_hw%40mail.gmail.com
pgsql-hackers by date: