Thread: Re: Reject Windows command lines not convertible to system locale

Re: Reject Windows command lines not convertible to system locale

From

Thomas Munro

Date:

15 December 2024, 06:50:57

On Sun, Dec 15, 2024 at 3:27 PM Noah Misch <noah@leadboat.com> wrote:
> yielding mojibake

Thank you for this magnificent terminology.

> What we can do safely is exit if we could not convert the command line to the
> Windows ANSI code page (losslessly).  Patch attached.

This seems to make sense generally, since (based only on googling, I
am not a Windows person and I don't even play one on TV), it doesn't
sound like there is a way for the system to tell you that the
conversion was lossy.

For the all-ASCII test, why not us pg_is_ascii()?

I wonder if there are any legitimate uses that would fail spuriously
due to the minutiae of normalisation, surrogates, etc.  Or perhaps
that type of thing is part of the problem you're talking about.

> As I mention in a code comment, I decided not to do anything about the similar
> hazard in GetEnvironmentStringsW().  A possible alternative would be to check
> each var referenced in PostgreSQL code or each var where the name begins with
> "PG".  Should we use one of those or another alternative?

I would vote for having a table of variables we know about and
checking them too, just on principle (it's not the characters the user
intended).  Checking every variable might reject legitimate UTF-16
communication between parent and sub-processes other than ours, and
even assuming that we own everything beginning "PG" seems an
unnecessary overreach, which seems to leave only the known list idea.

> The Perl testing calls powershell, which is new for our source tree.  The
> Internet thinks Windows Server 2008 R1 is the newest Windows where powershell
> was optional.  The buildfarm's oldest Windows is Windows Server 2008 R2.  If
> powershell is bad, there's likely a hairy way to write the test in any of
> cmd.exe, cscript, Perl Win32::API, or C.  My first choice was Perl
> Win32::Process or IPC::Run.  This test needs CreateProcessW(), but those two
> use CreateProcessA().

Sounds OK from here if it's the only sane way, and you could always skip?

> I didn't test a CP_UTF8 Windows system.  The new test coverage has a condition
> intended to avoid failure there.  (My Windows development environments are all
> too old, and I stopped short of building a new one for this.)  CP_UTF8 Windows
> raises broader issues.  I'll start a separate thread about them.

Is it really still in beta?  Yeah, I can't immediately think of a good
reason not to treat it the same as everything else and create more
code paths, even if we think it should always succeed (but I'll look
out for your thread, I'm quite interested in this general topic of
"can we make a lot of Windows encoding troubles go away that way").
It's a few extra CPU cycles but I guess it doesn't matter much.

> I wanted to put the check in some function that most main() instances already
> call, instead of adding ~40 new calls for this esoteric topic.  Options:
>
> 1. Call in pg_logging_init() and the backend.  Add pg_logging_init() calls to
>    pg_config, ecpg (the ecpg preprocessor), and pg_test_timing.  I picked
>    this, for two minor benefits.  First, it facilitates the backend
>    initializing LC_MESSAGES before this check.  Second, every frontend may
>    need pg_logging_init() eventually, so it's fair to add calls where missing.
>
> 2. Call in set_pglocale_pgservice() and the three programs that don't call
>    set_pglocale_pgservice(): oid2name, vacuumlo, pgbench.

+1

FWIW, coincidentally, I assume, I wrote about environ and argv
encoding problems last week, including a test program showing wchar_t
<-> ACP weirdness before main() among other interesting related
effects[1].  (My focus has been on how we sort out ACP vs database
encoding and block stuff that is b0rked in *our* code, ie below
main(), because we don't understand the platform's encoding model, not
so much the stuff that is already b0rked in all non-UTF8, non-wmain()
Windows programs in general before even laying hands on the program
counter.  I had got as far as wondering out loud if forcing or
strongly recommending UTF-8 ACP should be on the agenda to head off
various related problems and generally make Windows not so different
from Unix, and also supporting it as a native encoding properly on
that platform as experimented with in CF #3772, but much more study is
required there.)

[1] https://www.postgresql.org/message-id/CA%2BhUKGL%3DF0pSLLf3nLpA_-sBwYsLg7s%3DFD6YFo_PDvS84FE_hw%40mail.gmail.com

Re: Reject Windows command lines not convertible to system locale

From

Noah Misch

Date:

15 December 2024, 08:05:53

On Sun, Dec 15, 2024 at 04:50:57PM +1300, Thomas Munro wrote:
> On Sun, Dec 15, 2024 at 3:27 PM Noah Misch <noah@leadboat.com> wrote:
> > yielding mojibake
> 
> Thank you for this magnificent terminology.

Thank you for reviewing.

> > What we can do safely is exit if we could not convert the command line to the
> > Windows ANSI code page (losslessly).  Patch attached.
> 
> This seems to make sense generally, since (based only on googling, I
> am not a Windows person and I don't even play one on TV), it doesn't
> sound like there is a way for the system to tell you that the
> conversion was lossy.
> 
> For the all-ASCII test, why not us pg_is_ascii()?

pg_is_ascii() takes a char[], but this has a wchar_t[].  Maybe I should wrap
that part in a pg_wcs_is_ascii()?  (We can't use GetCommandLineA(), because
"best fit" may have converted non-ASCII to ASCII.)

> I wonder if there are any legitimate uses that would fail spuriously
> due to the minutiae of normalisation, surrogates, etc.  Or perhaps
> that type of thing is part of the problem you're talking about.

I wouldn't rule it out.  My original plan was just to reject any wchar_t above
0xFF.  That breaks innocent command lines.  If GetACP()==1252, then
CreateProcessA(... 0x80 ...) puts 0x20ac (UTF-16 EURO SIGN) in
GetCommandLineW().  The conversion to GetCommandLineA() arrives back at 0x80,
so there's no reason to reject it.

> > As I mention in a code comment, I decided not to do anything about the similar
> > hazard in GetEnvironmentStringsW().  A possible alternative would be to check
> > each var referenced in PostgreSQL code or each var where the name begins with
> > "PG".  Should we use one of those or another alternative?
> 
> I would vote for having a table of variables we know about and
> checking them too, just on principle (it's not the characters the user
> intended).  Checking every variable might reject legitimate UTF-16
> communication between parent and sub-processes other than ours, and
> even assuming that we own everything beginning "PG" seems an
> unnecessary overreach, which seems to leave only the known list idea.

Works for me.  It's a little weird for psql to complain about a PGDATA value,
but the condition should be rare and easy enough for the user to fix.  If
nobody else dislikes that plan, I'll give it a try.  Perhaps I'd add a
src/tools/RELEASE_CHANGES task to check for new vars.  Alternatively, some
check-world test could verify names appearing in getenv calls or
PQconninfoOptions.envvar fields.  I don't think we have a source-tree-scanning
test like that today, but I wouldn't object to starting.  We've talked about a
pgindent test, so visiting source code from a test is plausible.

> > The Perl testing calls powershell, which is new for our source tree.  The
> > Internet thinks Windows Server 2008 R1 is the newest Windows where powershell
> > was optional.  The buildfarm's oldest Windows is Windows Server 2008 R2.  If
> > powershell is bad, there's likely a hairy way to write the test in any of
> > cmd.exe, cscript, Perl Win32::API, or C.  My first choice was Perl
> > Win32::Process or IPC::Run.  This test needs CreateProcessW(), but those two
> > use CreateProcessA().
> 
> Sounds OK from here if it's the only sane way, and you could always skip?

A C program would be another sane way, but I would indeed opt to skip on
16-year-old Windows before rewriting in C.

> > I didn't test a CP_UTF8 Windows system.  The new test coverage has a condition
> > intended to avoid failure there.  (My Windows development environments are all
> > too old, and I stopped short of building a new one for this.)  CP_UTF8 Windows
> > raises broader issues.  I'll start a separate thread about them.
> 
> Is it really still in beta?

The word "beta" appears next to its checkbox.  That's all I know!

> Yeah, I can't immediately think of a good
> reason not to treat it the same as everything else and create more
> code paths, even if we think it should always succeed (but I'll look
> out for your thread, I'm quite interested in this general topic of
> "can we make a lot of Windows encoding troubles go away that way").
> It's a few extra CPU cycles but I guess it doesn't matter much.

postgr.es/m/flat/20241215023221.4d.nmisch@google.com is that thread.  So far,
it's only about new breakage.

> FWIW, coincidentally, I assume, I wrote about environ and argv
> encoding problems last week, including a test program showing wchar_t
> <-> ACP weirdness before main() among other interesting related
> effects[1].

Nice.  I had made a note to look at that thread eventually, but I didn't know
Windows had come up there.  My observations have been consistent with your
Windows findings in [1].

> (My focus has been on how we sort out ACP vs database
> encoding and block stuff that is b0rked in *our* code, ie below
> main(), because we don't understand the platform's encoding model, not
> so much the stuff that is already b0rked in all non-UTF8, non-wmain()
> Windows programs in general before even laying hands on the program
> counter.  I had got as far as wondering out loud if forcing or
> strongly recommending UTF-8 ACP should be on the agenda to head off
> various related problems and generally make Windows not so different

My initial impression is not great, since it's painful with any database
having encoding!=UTF8.  The nice thing about every other ACP is that you can
treat it as a POSIX-style, encoding-oblivious sequence of bytes.  (Side note:
I don't understand how that is the case for code page 932 and other variable
length, non-UTF* code pages.  It seems to work, but maybe I just haven't found
the byte sequences that break.)  My hypothesis is that GetACP()==CP_UTF8 would
make PostgreSQL cooperate better with rest of the Windows world, but it would
make PostgreSQL cooperate worse with other POSIX transplants.  Again, that's
just an initial impression.

> [1] https://www.postgresql.org/message-id/CA%2BhUKGL%3DF0pSLLf3nLpA_-sBwYsLg7s%3DFD6YFo_PDvS84FE_hw%40mail.gmail.com

Re: Reject Windows command lines not convertible to system locale

From

Thomas Munro

Date:

15 December 2024, 17:44:04

On Sun, Dec 15, 2024 at 6:05 PM Noah Misch <noah@leadboat.com> wrote:
> On Sun, Dec 15, 2024 at 04:50:57PM +1300, Thomas Munro wrote:
> > For the all-ASCII test, why not us pg_is_ascii()?
>
> pg_is_ascii() takes a char[], but this has a wchar_t[].  Maybe I should wrap
> that part in a pg_wcs_is_ascii()?  (We can't use GetCommandLineA(), because
> "best fit" may have converted non-ASCII to ASCII.)

Right, sorry, brain fade.  No issue with the open-coding here if you
want an early exit (I'd be tempted to just not have it for simplicity
but no strong opinion).