Re: Reject Windows command lines not convertible to system locale - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Reject Windows command lines not convertible to system locale
Date
Msg-id 20241215050553.45.nmisch@google.com
Whole thread Raw
In response to Re: Reject Windows command lines not convertible to system locale  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Reject Windows command lines not convertible to system locale
List pgsql-hackers
On Sun, Dec 15, 2024 at 04:50:57PM +1300, Thomas Munro wrote:
> On Sun, Dec 15, 2024 at 3:27 PM Noah Misch <noah@leadboat.com> wrote:
> > yielding mojibake
> 
> Thank you for this magnificent terminology.

Thank you for reviewing.

> > What we can do safely is exit if we could not convert the command line to the
> > Windows ANSI code page (losslessly).  Patch attached.
> 
> This seems to make sense generally, since (based only on googling, I
> am not a Windows person and I don't even play one on TV), it doesn't
> sound like there is a way for the system to tell you that the
> conversion was lossy.
> 
> For the all-ASCII test, why not us pg_is_ascii()?

pg_is_ascii() takes a char[], but this has a wchar_t[].  Maybe I should wrap
that part in a pg_wcs_is_ascii()?  (We can't use GetCommandLineA(), because
"best fit" may have converted non-ASCII to ASCII.)

> I wonder if there are any legitimate uses that would fail spuriously
> due to the minutiae of normalisation, surrogates, etc.  Or perhaps
> that type of thing is part of the problem you're talking about.

I wouldn't rule it out.  My original plan was just to reject any wchar_t above
0xFF.  That breaks innocent command lines.  If GetACP()==1252, then
CreateProcessA(... 0x80 ...) puts 0x20ac (UTF-16 EURO SIGN) in
GetCommandLineW().  The conversion to GetCommandLineA() arrives back at 0x80,
so there's no reason to reject it.

> > As I mention in a code comment, I decided not to do anything about the similar
> > hazard in GetEnvironmentStringsW().  A possible alternative would be to check
> > each var referenced in PostgreSQL code or each var where the name begins with
> > "PG".  Should we use one of those or another alternative?
> 
> I would vote for having a table of variables we know about and
> checking them too, just on principle (it's not the characters the user
> intended).  Checking every variable might reject legitimate UTF-16
> communication between parent and sub-processes other than ours, and
> even assuming that we own everything beginning "PG" seems an
> unnecessary overreach, which seems to leave only the known list idea.

Works for me.  It's a little weird for psql to complain about a PGDATA value,
but the condition should be rare and easy enough for the user to fix.  If
nobody else dislikes that plan, I'll give it a try.  Perhaps I'd add a
src/tools/RELEASE_CHANGES task to check for new vars.  Alternatively, some
check-world test could verify names appearing in getenv calls or
PQconninfoOptions.envvar fields.  I don't think we have a source-tree-scanning
test like that today, but I wouldn't object to starting.  We've talked about a
pgindent test, so visiting source code from a test is plausible.

> > The Perl testing calls powershell, which is new for our source tree.  The
> > Internet thinks Windows Server 2008 R1 is the newest Windows where powershell
> > was optional.  The buildfarm's oldest Windows is Windows Server 2008 R2.  If
> > powershell is bad, there's likely a hairy way to write the test in any of
> > cmd.exe, cscript, Perl Win32::API, or C.  My first choice was Perl
> > Win32::Process or IPC::Run.  This test needs CreateProcessW(), but those two
> > use CreateProcessA().
> 
> Sounds OK from here if it's the only sane way, and you could always skip?

A C program would be another sane way, but I would indeed opt to skip on
16-year-old Windows before rewriting in C.

> > I didn't test a CP_UTF8 Windows system.  The new test coverage has a condition
> > intended to avoid failure there.  (My Windows development environments are all
> > too old, and I stopped short of building a new one for this.)  CP_UTF8 Windows
> > raises broader issues.  I'll start a separate thread about them.
> 
> Is it really still in beta?

The word "beta" appears next to its checkbox.  That's all I know!

> Yeah, I can't immediately think of a good
> reason not to treat it the same as everything else and create more
> code paths, even if we think it should always succeed (but I'll look
> out for your thread, I'm quite interested in this general topic of
> "can we make a lot of Windows encoding troubles go away that way").
> It's a few extra CPU cycles but I guess it doesn't matter much.

postgr.es/m/flat/20241215023221.4d.nmisch@google.com is that thread.  So far,
it's only about new breakage.

> FWIW, coincidentally, I assume, I wrote about environ and argv
> encoding problems last week, including a test program showing wchar_t
> <-> ACP weirdness before main() among other interesting related
> effects[1].

Nice.  I had made a note to look at that thread eventually, but I didn't know
Windows had come up there.  My observations have been consistent with your
Windows findings in [1].

> (My focus has been on how we sort out ACP vs database
> encoding and block stuff that is b0rked in *our* code, ie below
> main(), because we don't understand the platform's encoding model, not
> so much the stuff that is already b0rked in all non-UTF8, non-wmain()
> Windows programs in general before even laying hands on the program
> counter.  I had got as far as wondering out loud if forcing or
> strongly recommending UTF-8 ACP should be on the agenda to head off
> various related problems and generally make Windows not so different

My initial impression is not great, since it's painful with any database
having encoding!=UTF8.  The nice thing about every other ACP is that you can
treat it as a POSIX-style, encoding-oblivious sequence of bytes.  (Side note:
I don't understand how that is the case for code page 932 and other variable
length, non-UTF* code pages.  It seems to work, but maybe I just haven't found
the byte sequences that break.)  My hypothesis is that GetACP()==CP_UTF8 would
make PostgreSQL cooperate better with rest of the Windows world, but it would
make PostgreSQL cooperate worse with other POSIX transplants.  Again, that's
just an initial impression.

> [1] https://www.postgresql.org/message-id/CA%2BhUKGL%3DF0pSLLf3nLpA_-sBwYsLg7s%3DFD6YFo_PDvS84FE_hw%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Reject Windows command lines not convertible to system locale
Next
From: Dilip Kumar
Date:
Subject: Re: Skip collecting decoded changes of already-aborted transactions