Re: BUG #18735: Specific multibyte character in psql file path command parameter for Windows - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #18735: Specific multibyte character in psql file path command parameter for Windows
Date
Msg-id 2572359.1733424649@sss.pgh.pa.us
Whole thread Raw
In response to BUG #18735: Specific multibyte character in psql file path command parameter for Windows  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #18735: Specific multibyte character in psql file path command parameter for Windows
Re: BUG #18735: Specific multibyte character in psql file path command parameter for Windows
List pgsql-bugs
PG Bug reporting form <noreply@postgresql.org> writes:
> Analysis:
> * Latter byte valueof the character in question is same as '\' (backslash). 
>   It looks that this byte value is handled as escape characters.   This
> happns SHIFT JIS client encoding.
> * The issue happens in \i, \ir and \copy but does not happen in \cd, \o and
> \! command.

I imagine what is happening here is that canonicalize_path() interprets
the backslash bytes as directory separators.

The only thing I can think of to improve that is to make
canonicalize_path() encoding-aware and have it skip over multibyte
characters.  Unfortunately, I fear that would introduce as many
misbehaviors as it would remove, because we don't always know the
relevant encoding.  We might be able to limit the hazard by
confining the encoding-awareness to the initial Windows-only
conversion of '\' to '/', but it'd still be pretty squishy.

> * The similar issue may happen if the latter byte value of a multibyte
> character is same as '/' (directory delimiter).

I don't believe Shift-JIS uses '/' as part of multibyte characters,
so it should be sufficient to consider '\'.

BTW, according to wikipedia[1], backslash is not even part of the
Shift-JIS character set:

    The single-byte characters 0x00 to 0x7F match the ASCII encoding,
    except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at
    0x7E in place of the ASCII character set's backslash and tilde
    respectively (these deviations from ASCII align with JIS X
    0201). The single-byte characters from 0xA1 to 0xDF map to the
    half-width katakana characters found in JIS X 0201.

    For double-byte characters, the first byte is always in the range
    0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are
    unassigned in JIS X 0201). If the first byte is odd, the second
    byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if
    the first byte is even, the second byte must in the range 0x9F to
    0xFC.

This might mean that it'd be okay to just skip the backslash-to-slash
conversion loops altogether if we think the encoding is Shift-JIS.

There's still the question of how we determine the relevant encoding.
I don't think client_encoding is what to use (and we won't have that
at hand anyway, in programs other than psql).  What we want to know
is what fopen and related system calls will do with the path: they
must have different behavior for Shift-JIS than other encodings,
else none of your examples could work at all.  I assume there's
a way to find out what they think the relevant encoding is.

make_native_path() adds even more fun: when should we convert '/'
back to '\'?  From the comments, this function is concerned with
producing something that will be accepted as a command-line
argument by other programs, so I wonder if we can even know what
to do with any certainty.

(In case it's not clear, I'm not volunteering to write or test
any of this.)

            regards, tom lane

[1] https://en.wikipedia.org/wiki/Shift_JIS



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18734: pg_terminate_backend was unresponsive for processes with the status "active"
Next
From: Andres Freund
Date:
Subject: Re: BUG #18734: pg_terminate_backend was unresponsive for processes with the status "active"