Thread: BUG #18735: Specific multibyte character in psql file path command parameter for Windows

The following bug has been logged on the website:

Bug reference:      18735
Logged by:          Koichi Suzuki
Email address:      koichi.dbms@gmail.com
PostgreSQL version: 17.2
Operating system:   Windows 10, Japanese language version
Description:

In psql for Windows 10, version 17.2, some multibyte character in the file
path of psql command causes error such as:
=======
Server [localhost]:
Database [postgres]:
Port [5432]:
Username [postgres]:
Client Encoding [SJIS]:
ユーザー postgres のパスワード:
psql (17.2)
"help"でヘルプを表示します。

postgres=# \cd 'c:/work/09_環境構築'
postgres=# \o 'c:/work/09_環境構築/work.out'
postgres=# select * from pg_database;
postgres=# \o
postgres=# \! type c:\work\09_環境構築\work.out
 oid |  datname  | datdba | encoding | datlocprovider | datistemplate |
datallowconn | dathasloginevt | datconnlimit | datfrozenxid | datminmxid |
dattablespace |     datcollate     |      datctype      | datlocale |
daticurules | datcollversion |               datacl

-----+-----------+--------+----------+----------------+---------------+--------------+----------------+--------------+--------------+------------+---------------+--------------------+--------------------+-----------+-------------+----------------+-------------------------------------
   5 | postgres  |     10 |        6 | c              | f             | t
        | f              |           -1 |          731 |          1 |
  1663 | Japanese_Japan.932 | Japanese_Japan.932 |           |             |
               |
   1 | template1 |     10 |        6 | c              | t             | t
        | f              |           -1 |          731 |          1 |
  1663 | Japanese_Japan.932 | Japanese_Japan.932 |           |             |
               | {=c/postgres,postgres=CTc/postgres}
   4 | template0 |     10 |        6 | c              | t             | f
        | f              |           -1 |          731 |          1 |
  1663 | Japanese_Japan.932 | Japanese_Japan.932 |           |             |
               | {=c/postgres,postgres=CTc/postgres}
(3 行)

postgres=#
postgres=# \i 'c:/work/09_環境構築/sqmple.sql'
c:/work/09_環境穀z/sqmple.sql: No such file or directory
postgres=# copy pg_database to 'c:/work/09_環境構築/database.out'
postgres-# \copy pg_database to 'c:/work/09_環境構築/database.csv' with (format
csv, header)
c:/work/09_環境・築/database.csv: No such file or directory
=====

Analysis:
* Latter byte valueof the character in question is same as '\' (backslash).
  It looks that this byte value is handled as escape characters.   This
happns SHIFT JIS client encoding.
* The issue happens in \i, \ir and \copy but does not happen in \cd, \o and
\! command.
* The similar issue may happen if the latter byte value of a multibyte
character is same as '/' (directory delimiter).


PG Bug reporting form <noreply@postgresql.org> writes:
> Analysis:
> * Latter byte valueof the character in question is same as '\' (backslash). 
>   It looks that this byte value is handled as escape characters.   This
> happns SHIFT JIS client encoding.
> * The issue happens in \i, \ir and \copy but does not happen in \cd, \o and
> \! command.

I imagine what is happening here is that canonicalize_path() interprets
the backslash bytes as directory separators.

The only thing I can think of to improve that is to make
canonicalize_path() encoding-aware and have it skip over multibyte
characters.  Unfortunately, I fear that would introduce as many
misbehaviors as it would remove, because we don't always know the
relevant encoding.  We might be able to limit the hazard by
confining the encoding-awareness to the initial Windows-only
conversion of '\' to '/', but it'd still be pretty squishy.

> * The similar issue may happen if the latter byte value of a multibyte
> character is same as '/' (directory delimiter).

I don't believe Shift-JIS uses '/' as part of multibyte characters,
so it should be sufficient to consider '\'.

BTW, according to wikipedia[1], backslash is not even part of the
Shift-JIS character set:

    The single-byte characters 0x00 to 0x7F match the ASCII encoding,
    except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at
    0x7E in place of the ASCII character set's backslash and tilde
    respectively (these deviations from ASCII align with JIS X
    0201). The single-byte characters from 0xA1 to 0xDF map to the
    half-width katakana characters found in JIS X 0201.

    For double-byte characters, the first byte is always in the range
    0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are
    unassigned in JIS X 0201). If the first byte is odd, the second
    byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if
    the first byte is even, the second byte must in the range 0x9F to
    0xFC.

This might mean that it'd be okay to just skip the backslash-to-slash
conversion loops altogether if we think the encoding is Shift-JIS.

There's still the question of how we determine the relevant encoding.
I don't think client_encoding is what to use (and we won't have that
at hand anyway, in programs other than psql).  What we want to know
is what fopen and related system calls will do with the path: they
must have different behavior for Shift-JIS than other encodings,
else none of your examples could work at all.  I assume there's
a way to find out what they think the relevant encoding is.

make_native_path() adds even more fun: when should we convert '/'
back to '\'?  From the comments, this function is concerned with
producing something that will be accepted as a command-line
argument by other programs, so I wonder if we can even know what
to do with any certainty.

(In case it's not clear, I'm not volunteering to write or test
any of this.)

            regards, tom lane

[1] https://en.wikipedia.org/wiki/Shift_JIS



Hello;

Very short response.

Lexical analysis of backshash commands in psql is handled by psqlscanslash.l and this module scans iput byte-by-byte, not character-by-character.   I'm afraid that the cause of the bug is in this part..   Is there any way to make this flex syntax local-dependent?

We need to analyze the behavior of this flex module to get practical idea for fix.

Regards;


2024年12月6日(金) 3:50 Tom Lane <tgl@sss.pgh.pa.us>:
PG Bug reporting form <noreply@postgresql.org> writes:
> Analysis:
> * Latter byte valueof the character in question is same as '\' (backslash).
>   It looks that this byte value is handled as escape characters.   This
> happns SHIFT JIS client encoding.
> * The issue happens in \i, \ir and \copy but does not happen in \cd, \o and
> \! command.

I imagine what is happening here is that canonicalize_path() interprets
the backslash bytes as directory separators.

The only thing I can think of to improve that is to make
canonicalize_path() encoding-aware and have it skip over multibyte
characters.  Unfortunately, I fear that would introduce as many
misbehaviors as it would remove, because we don't always know the
relevant encoding.  We might be able to limit the hazard by
confining the encoding-awareness to the initial Windows-only
conversion of '\' to '/', but it'd still be pretty squishy.

> * The similar issue may happen if the latter byte value of a multibyte
> character is same as '/' (directory delimiter).

I don't believe Shift-JIS uses '/' as part of multibyte characters,
so it should be sufficient to consider '\'.

BTW, according to wikipedia[1], backslash is not even part of the
Shift-JIS character set:

    The single-byte characters 0x00 to 0x7F match the ASCII encoding,
    except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at
    0x7E in place of the ASCII character set's backslash and tilde
    respectively (these deviations from ASCII align with JIS X
    0201). The single-byte characters from 0xA1 to 0xDF map to the
    half-width katakana characters found in JIS X 0201.

    For double-byte characters, the first byte is always in the range
    0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are
    unassigned in JIS X 0201). If the first byte is odd, the second
    byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if
    the first byte is even, the second byte must in the range 0x9F to
    0xFC.

This might mean that it'd be okay to just skip the backslash-to-slash
conversion loops altogether if we think the encoding is Shift-JIS.

There's still the question of how we determine the relevant encoding.
I don't think client_encoding is what to use (and we won't have that
at hand anyway, in programs other than psql).  What we want to know
is what fopen and related system calls will do with the path: they
must have different behavior for Shift-JIS than other encodings,
else none of your examples could work at all.  I assume there's
a way to find out what they think the relevant encoding is.

make_native_path() adds even more fun: when should we convert '/'
back to '\'?  From the comments, this function is concerned with
producing something that will be accepted as a command-line
argument by other programs, so I wonder if we can even know what
to do with any certainty.

(In case it's not clear, I'm not volunteering to write or test
any of this.)

                        regards, tom lane

[1] https://en.wikipedia.org/wiki/Shift_JIS
Koichi Suzuki <koichi.dbms@gmail.com> writes:
> Lexical analysis of backshash commands in psql is handled
> by psqlscanslash.l and this module scans iput byte-by-byte, not
> character-by-character.   I'm afraid that the cause of the bug is in
> this part..   Is there any way to make this flex syntax local-dependent?

I think you're mistaken: see the "safe_encoding" hackery in
psqlscan.l (which does also operate for the rules in psqlscanslash.l).
Now it could be that psqlscan.l has been misinformed about the
encoding that's in use, but that wouldn't be its fault.

            regards, tom lane



> I don't believe Shift-JIS uses '/' as part of multibyte characters,

Correct.

> so it should be sufficient to consider '\'.

Agreed.

> BTW, according to wikipedia[1], backslash is not even part of the
> Shift-JIS character set:
> 
>     The single-byte characters 0x00 to 0x7F match the ASCII encoding,
>     except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at
>     0x7E in place of the ASCII character set's backslash and tilde
>     respectively (these deviations from ASCII align with JIS X
>     0201). The single-byte characters from 0xA1 to 0xDF map to the
>     half-width katakana characters found in JIS X 0201.
> 
>     For double-byte characters, the first byte is always in the range
>     0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are
>     unassigned in JIS X 0201). If the first byte is odd, the second
>     byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if
>     the first byte is even, the second byte must in the range 0x9F to
>     0xFC.
> 
> This might mean that it'd be okay to just skip the backslash-to-slash
> conversion loops altogether if we think the encoding is Shift-JIS.

I suggest to not do so because majority of Shift-JIS users treat 0x5C
as a backslash. They understand that a 0x5C means a backslash in
Shift-JIS files if the files are for programming (source code) or for
the technical documentations and so on.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp







2024年12月6日(金) 14:21 Tatsuo Ishii <ishii@postgresql.org>:
> I don't believe Shift-JIS uses '/' as part of multibyte characters,

Correct.

> so it should be sufficient to consider '\'.

Agreed.

> BTW, according to wikipedia[1], backslash is not even part of the
> Shift-JIS character set:
>
>     The single-byte characters 0x00 to 0x7F match the ASCII encoding,
>     except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at
>     0x7E in place of the ASCII character set's backslash and tilde
>     respectively (these deviations from ASCII align with JIS X
>     0201). The single-byte characters from 0xA1 to 0xDF map to the
>     half-width katakana characters found in JIS X 0201.
>
>     For double-byte characters, the first byte is always in the range
>     0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are
>     unassigned in JIS X 0201). If the first byte is odd, the second
>     byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if
>     the first byte is even, the second byte must in the range 0x9F to
>     0xFC.
>
> This might mean that it'd be okay to just skip the backslash-to-slash
> conversion loops altogether if we think the encoding is Shift-JIS.

I suggest to not do so because majority of Shift-JIS users treat 0x5C
as a backslash. They understand that a 0x5C means a backslash in
Shift-JIS files if the files are for programming (source code) or for
the technical documentations and so on.

Better way is to treat 'backslash' byte value in the latter byte of SJIS-encoded character as is, not treat this byte as escape character.

I'm not sure if we can fix src/fe_utils/psqlscan.l and/or src/bin/psql/psqlscanslash.l for this.  Needs some more investigqation.


Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

All the best and thanks to all the kind inputs.
Tatsuo Ishii <ishii@postgresql.org> writes:
>> This might mean that it'd be okay to just skip the backslash-to-slash
>> conversion loops altogether if we think the encoding is Shift-JIS.

> I suggest to not do so because majority of Shift-JIS users treat 0x5C
> as a backslash. They understand that a 0x5C means a backslash in
> Shift-JIS files if the files are for programming (source code) or for
> the technical documentations and so on.

Sure, we can do it that way.  I think the hard part is figuring
out whether Windows thinks the file names are in Shift-JIS.
Do you have any idea about finding that out?

            regards, tom lane



Koichi Suzuki <koichi.dbms@gmail.com> writes:
> I'm not sure if we can fix src/fe_utils/psqlscan.l and/or
> src/bin/psql/psqlscanslash.l for this.  Needs some more investigqation.

I don't believe the theory that the fault lies there.  If the flex
rules were taking the backslash-embedded-in-a-shift-JIS-character
as a backslash, they would think it is the start of a new backslash
command, with the result being that the filename argument gets
truncated there.  That doesn't match the reported symptoms: we
see more of the filename than that echoed back in the error message.
So I think the filename is getting through that part just fine,
and then we're messing it up in canonicalize_path or adjacent
processing.

            regards, tom lane



In the Japanese version of Windows, file names are in Shift-JIS. 

For sure, we need to check client_encoding.

Regards;


2024年12月6日(金) 14:44 Tom Lane <tgl@sss.pgh.pa.us>:
Tatsuo Ishii <ishii@postgresql.org> writes:
>> This might mean that it'd be okay to just skip the backslash-to-slash
>> conversion loops altogether if we think the encoding is Shift-JIS.

> I suggest to not do so because majority of Shift-JIS users treat 0x5C
> as a backslash. They understand that a 0x5C means a backslash in
> Shift-JIS files if the files are for programming (source code) or for
> the technical documentations and so on.

Sure, we can do it that way.  I think the hard part is figuring
out whether Windows thinks the file names are in Shift-JIS.
Do you have any idea about finding that out?

                        regards, tom lane
Hello;

We need to investigate how backslash in the latter byte of Shift-JIS encoded string is handled in psql.


2024年12月6日(金) 15:11 Koichi Suzuki <koichi.dbms@gmail.com>:
In the Japanese version of Windows, file names are in Shift-JIS. 

For sure, we need to check client_encoding.

Regards;


2024年12月6日(金) 14:44 Tom Lane <tgl@sss.pgh.pa.us>:
Tatsuo Ishii <ishii@postgresql.org> writes:
>> This might mean that it'd be okay to just skip the backslash-to-slash
>> conversion loops altogether if we think the encoding is Shift-JIS.

> I suggest to not do so because majority of Shift-JIS users treat 0x5C
> as a backslash. They understand that a 0x5C means a backslash in
> Shift-JIS files if the files are for programming (source code) or for
> the technical documentations and so on.

Sure, we can do it that way.  I think the hard part is figuring
out whether Windows thinks the file names are in Shift-JIS.
Do you have any idea about finding that out?

                        regards, tom lane
> Tatsuo Ishii <ishii@postgresql.org> writes:
>>> This might mean that it'd be okay to just skip the backslash-to-slash
>>> conversion loops altogether if we think the encoding is Shift-JIS.
> 
>> I suggest to not do so because majority of Shift-JIS users treat 0x5C
>> as a backslash. They understand that a 0x5C means a backslash in
>> Shift-JIS files if the files are for programming (source code) or for
>> the technical documentations and so on.
> 
> Sure, we can do it that way.  I think the hard part is figuring
> out whether Windows thinks the file names are in Shift-JIS.
> Do you have any idea about finding that out?

I am not familiar with Windows and have no idea, but from the original
report:

postgres=# \cd 'c:/work/09_環境構築'

it seems the KANJI characters are shown properly even if a part of the
KANJI characters include a backslash. This could indicate Windows
thinks the file names are in Shift-JIS.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



> I don't believe the theory that the fault lies there.  If the flex
> rules were taking the backslash-embedded-in-a-shift-JIS-character
> as a backslash, they would think it is the start of a new backslash
> command, with the result being that the filename argument gets
> truncated there.  That doesn't match the reported symptoms: we
> see more of the filename than that echoed back in the error message.
> So I think the filename is getting through that part just fine,
> and then we're messing it up in canonicalize_path or adjacent
> processing.

I have looked into canonicalize_path() and found this:

#ifdef WIN32

    /*
     * The Windows command processor will accept suitably quoted paths with
     * forward slashes, but barfs badly with mixed forward and back slashes.
     */
    for (p = path; *p; p++)
    {
        if (*p == '\\')
            *p = '/';
    }

Here "path" is the filename encoded in Shift-JIS I think.  It seems
canonicalize_path() unconditionaly replaces a backslash with a slash.
For me this seems to break any Shift-JIS KANJI characters that a
backslash in the second byte.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Tatsuo Ishii <ishii@postgresql.org> writes:
> I have looked into canonicalize_path() and found this:

>         if (*p == '\\')
>             *p = '/';

Right, that's where the trouble is.  It'd be easy enough to make
that loop (and the similar one in cleanup_path) encoding-aware,
if we knew what encoding applies.  Deciding that is the sticky part.

After sleeping on it, I'm coming around to the opinion that
client_encoding (pset.encoding) is what to use in psql, for
two reasons:
* we already do our best to set that correctly, and the user
is able to change it if it's wrong;
* as previously noted, psqlscan.l will do the wrong things
if it's not set correctly, so you're probably already hosed
if working in a non-server-safe encoding with the wrong
setting of client_encoding.

However, there are a bunch of callers of canonicalize_path()
that are not in psql, and those arguments don't apply to them;
in fact places like initdb and pg_ctl don't really have a
concept of client encoding at all.  So what to do?

After looking through the callers I think we might not be in as bad
shape as this sounds, because all of the other callers are dealing
with Postgres installation paths or data directory-related paths that
are also dealt with by the server.  So it's not unreasonable to
require that those paths must be written in server-safe encodings.
If they're not, you're going to have trouble with stuff like
"show data_directory".

I wonder whether we ought to try to enforce that.  It'd be feasible
I think for initdb to verify that the selected paths are validly
encoded according to whatever encoding it's about to set the server
up with.  If we were feeling draconian we could insist that
the installation path and data directory path be all-ASCII, which
is the only way to be sure that you won't have issues if you later
create a database that uses some other encoding.  But I think we'd
likely get pushback from that.  (This ties into the nearby
discussion about encoding of shared-catalog names [1], which is
more or less the same problem --- maybe the path encoding checks
could vary depending on how we're setting that up?)

Anyway, what I'm now thinking is that we can have two variants
of canonicalize_path:
    extern void canonicalize_path(char *path);
    extern void canonicalize_path_enc(char *path, int encoding);
The first one assumes a server-safe encoding, the second doesn't,
and at least to start with only psql would bother with the second.

It looks like we don't need cleanup_path_enc, not yet anyway,
since that's only applied to installation paths.

I am also guessing that we don't need an encoding-aware variant
of make_native_path: since it only changes '/' it can't create
an incorrectly encoded path, assuming the input is OK.  However,
this is assuming that it's okay to use '\' as a Windows directory
separator even in shift-JIS, which I'm not too sure about.

            regards, tom lane

[1] https://www.postgresql.org/message-id/CA%2BhUKGKDC7tKMZ1v0JGH5D23F-%3DADf-3UfcriVepqoi7Q_SKgQ%40mail.gmail.com



I wrote:
> Anyway, what I'm now thinking is that we can have two variants
> of canonicalize_path:
>     extern void canonicalize_path(char *path);
>     extern void canonicalize_path_enc(char *path, int encoding);
> The first one assumes a server-safe encoding, the second doesn't,
> and at least to start with only psql would bother with the second.

I thought that part would be trivial, but there's a small annoying
problem.  The obvious way to write the encoding-aware version of
the de-backslashing loop is to use pg_encoding_mblen_bounded().
However, that function is in src/common/wchar.c while path.c
is in src/port/ --- and I believe we have a rule that libpgport
can't depend on libpgcommon.  (The dependencies go the other way,
instead.)

Now it's pretty dubious that path.c is in src/port/ at all, because
it does not meet the expectation that that directory is for
functions that replace missing system-library functionality.
(We've trodden pretty hard on that expectation over the years,
but whatever.)  So one reasonable fix could be to move path.c
to src/common, but I'm concerned that that would be unsafe to
back-patch.  Also, we'd really want to move the externs for
path.c out of port.h, which would cause additional code churn
for callers.

Another way, given that we only really need this to work for
SJIS, is to hard-wire the logic into path.c --- it's not like
pg_sjis_mblen is either long or likely to change.  That's
ugly but would be a lot less invasive and safer to back-patch.

I'm leaning a bit to the second way, mainly because of the
extern-relocation annoyance.

            regards, tom lane



> Tatsuo Ishii <ishii@postgresql.org> writes:
>> I have looked into canonicalize_path() and found this:
> 
>>         if (*p == '\\')
>>             *p = '/';
> 
> Right, that's where the trouble is.  It'd be easy enough to make
> that loop (and the similar one in cleanup_path) encoding-aware,
> if we knew what encoding applies.  Deciding that is the sticky part.
> 
> After sleeping on it, I'm coming around to the opinion that
> client_encoding (pset.encoding) is what to use in psql, for
> two reasons:
> * we already do our best to set that correctly, and the user
> is able to change it if it's wrong;
> * as previously noted, psqlscan.l will do the wrong things
> if it's not set correctly, so you're probably already hosed
> if working in a non-server-safe encoding with the wrong
> setting of client_encoding.

I think the encoding we need to supply to canonicalize_path() is not
necessarily the same as client_encoding. For example we could set
client_encoding to UTF-8 but use a file which has Shift-JIS encode
file name.  I think what we really need to supply to
canonicalize_path() is the "file system encoding", not
client_encoding.

Among the file system encodings, the only problematic one is
Shift-JIS. As far as I know, currently there's no OS except Windows
which uses Shift-JIS as the file system encoding. So probably we can
safely assume that if the OS is Windows for Japanese, we can assume
that the file system encoding is Shift-JIS. If we know how to
determine the OS is Windows for Japanese inside the
canonicalize_path(), we don't need to change the API of it.

Quick gooling found this page (sorry, in Japanese)
https://tarenagashi.hatenablog.jp/entry/2023/07/17/160149
and it says:

- In Windows "system locale" represents the language/country used.

- The code for system locale is called "LCID" and it's 1041 (decimal)
  for Japanese/Japan.

- There are some APIs to obtain LCID (GetSystemDefaultLocaleName etc.)

As I am not familiar with Windows and I cannot test these. Can someone
confirm?

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Tatsuo Ishii <ishii@postgresql.org> writes:
> I think the encoding we need to supply to canonicalize_path() is not
> necessarily the same as client_encoding. For example we could set
> client_encoding to UTF-8 but use a file which has Shift-JIS encode
> file name.  I think what we really need to supply to
> canonicalize_path() is the "file system encoding", not
> client_encoding.

That was what I was thinking yesterday, but seems to me it could
not really work to have client_encoding set to UTF-8 while you're
trying to type an SJIS file name.  Even if your terminal program
doesn't mangle anything, it's likely that psqlscan.l will.

I suppose if we think that's the situation, we could try to translate
the file path names from UTF8 to SJIS.  But that's a chunk of
functionality that doesn't exist right now, plus I'm afraid it'd
often make things worse not better.  We don't have nearly as much
certainty as we could wish about which encoding incoming data is in.

            regards, tom lane



> That was what I was thinking yesterday, but seems to me it could
> not really work to have client_encoding set to UTF-8 while you're
> trying to type an SJIS file name.  Even if your terminal program
> doesn't mangle anything, it's likely that psqlscan.l will.

Ok. So if the client encoding is not a "safe" encoding like Shift-JIS,
we have to use the same encoding for both the client encoding and the
file system encoding. I don't think this is an unreasonable
limitation.

> I suppose if we think that's the situation, we could try to translate
> the file path names from UTF8 to SJIS.  But that's a chunk of
> functionality that doesn't exist right now, plus I'm afraid it'd
> often make things worse not better.  We don't have nearly as much
> certainty as we could wish about which encoding incoming data is in.

Yes. Also in PostgreSQL encoding conversion is extensible and the
feature is only available in backend.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



> Another way, given that we only really need this to work for
> SJIS, is to hard-wire the logic into path.c --- it's not like
> pg_sjis_mblen is either long or likely to change.  That's
> ugly but would be a lot less invasive and safer to back-patch.
> 
> I'm leaning a bit to the second way, mainly because of the
> extern-relocation annoyance.

+1. I believe the logic of detecting byte length in Shift-JIS will not
be changed.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Tatsuo Ishii <ishii@postgresql.org> writes:
>> Another way, given that we only really need this to work for
>> SJIS, is to hard-wire the logic into path.c --- it's not like
>> pg_sjis_mblen is either long or likely to change.  That's
>> ugly but would be a lot less invasive and safer to back-patch.
>> I'm leaning a bit to the second way, mainly because of the
>> extern-relocation annoyance.

> +1. I believe the logic of detecting byte length in Shift-JIS will not
> be changed.

OK.  I know I said I wasn't going to write this, but here's a draft
patch that assumes we need only touch psql and can rely on its idea
of client_encoding.

I'm not in a position to test this.

            regards, tom lane

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 1f3cbb11f7..bffe614f00 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1201,7 +1201,7 @@ exec_command_edit(PsqlScanState scan_state, bool active_branch,
                 expand_tilde(&fname);
                 if (fname)
                 {
-                    canonicalize_path(fname);
+                    canonicalize_path_enc(fname, pset.encoding);
                     /* Always clear buffer if the file isn't modified */
                     discard_on_quit = true;
                 }
@@ -2819,7 +2819,7 @@ exec_command_write(PsqlScanState scan_state, bool active_branch,
                 }
                 else
                 {
-                    canonicalize_path(fname);
+                    canonicalize_path_enc(fname, pset.encoding);
                     fd = fopen(fname, "w");
                 }
                 if (!fd)
@@ -4423,7 +4423,7 @@ process_file(char *filename, bool use_relative_path)
     }
     else if (strcmp(filename, "-") != 0)
     {
-        canonicalize_path(filename);
+        canonicalize_path_enc(filename, pset.encoding);

         /*
          * If we were asked to resolve the pathname relative to the location
@@ -4437,7 +4437,7 @@ process_file(char *filename, bool use_relative_path)
             strlcpy(relpath, pset.inputfile, sizeof(relpath));
             get_parent_directory(relpath);
             join_path_components(relpath, relpath, filename);
-            canonicalize_path(relpath);
+            canonicalize_path_enc(relpath, pset.encoding);

             filename = relpath;
         }
diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c
index e020e4d665..1cfe31f164 100644
--- a/src/bin/psql/copy.c
+++ b/src/bin/psql/copy.c
@@ -280,7 +280,7 @@ do_copy(const char *args)

     /* prepare to read or write the target file */
     if (options->file && !options->program)
-        canonicalize_path(options->file);
+        canonicalize_path_enc(options->file, pset.encoding);

     if (options->from)
     {
diff --git a/src/include/port.h b/src/include/port.h
index ba9ab0d34f..1665ac0258 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -53,6 +53,7 @@ extern char *first_path_var_separator(const char *pathlist);
 extern void join_path_components(char *ret_path,
                                  const char *head, const char *tail);
 extern void canonicalize_path(char *path);
+extern void canonicalize_path_enc(char *path, int encoding);
 extern void make_native_path(char *filename);
 extern void cleanup_path(char *path);
 extern bool path_contains_parent_reference(const char *path);
diff --git a/src/port/path.c b/src/port/path.c
index de4df6cd78..02ab3188db 100644
--- a/src/port/path.c
+++ b/src/port/path.c
@@ -36,6 +36,7 @@
 #include <unistd.h>
 #endif

+#include "mb/pg_wchar.h"
 #include "pg_config_paths.h"


@@ -45,6 +46,10 @@
 #define IS_PATH_VAR_SEP(ch) ((ch) == ';')
 #endif

+#ifdef WIN32
+static void debackslash_path(char *path, int encoding);
+static int    pg_sjis_mblen(const unsigned char *s);
+#endif
 static void make_relative_path(char *ret_path, const char *target_path,
                                const char *bin_path, const char *my_exec_path);
 static char *trim_directory(char *path);
@@ -149,10 +154,73 @@ last_dir_separator(const char *filename)
 }


+#ifdef WIN32
+
+/*
+ * Convert '\' to '/' within the given path, assuming the path
+ * is in the specified encoding.
+ */
+static void
+debackslash_path(char *path, int encoding)
+{
+    char       *p;
+
+    /*
+     * Of the supported encodings, only Shift-JIS has multibyte characters
+     * that can include a byte equal to '\' (0x5C).  So rather than implement
+     * a fully encoding-aware conversion, we special-case SJIS.  (Invoking the
+     * general encoding-aware logic in wchar.c is impractical here for
+     * assorted reasons.)
+     */
+    if (encoding == PG_SJIS)
+    {
+        for (p = path; *p; p += pg_sjis_mblen((const unsigned char *) p))
+        {
+            if (*p == '\\')
+                *p = '/';
+        }
+    }
+    else
+    {
+        for (p = path; *p; p++)
+        {
+            if (*p == '\\')
+                *p = '/';
+        }
+    }
+}
+
 /*
- *    make_native_path - on WIN32, change / to \ in the path
+ * SJIS character length
  *
- *    This effectively undoes canonicalize_path.
+ * This must match the behavior of
+ *        pg_encoding_mblen_bounded(PG_SJIS, s)
+ * In particular, unlike the version of pg_sjis_mblen in src/common/wchar.c,
+ * do not allow caller to accidentally step past end-of-string.
+ */
+static int
+pg_sjis_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (*s >= 0xa1 && *s <= 0xdf)
+        len = 1;                /* 1 byte kana? */
+    else if (IS_HIGHBIT_SET(*s) && s[1] != '\0')
+        len = 2;                /* kanji? */
+    else
+        len = 1;                /* should be ASCII */
+    return len;
+}
+
+#endif                            /* WIN32 */
+
+
+/*
+ *    make_native_path - on WIN32, change '/' to '\' in the path
+ *
+ *    This reverses the '\'-to-'/' transformation of debackslash_path.
+ *    We need not worry about encodings here, since '/' does not appear
+ *    as a byte of a multibyte character in any supported encoding.
  *
  *    This is required because WIN32 COPY is an internal CMD.EXE
  *    command and doesn't process forward slashes in the same way
@@ -182,13 +250,14 @@ make_native_path(char *filename)
  * on Windows. We need them to use filenames without spaces, for which a
  * short filename is the safest equivalent, eg:
  *        C:/Progra~1/
+ *
+ * Presently, this is only used on paths that we can assume are in a
+ * backend-safe encoding, so there's no need for an encoding-aware variant.
  */
 void
 cleanup_path(char *path)
 {
 #ifdef WIN32
-    char       *ptr;
-
     /*
      * GetShortPathName() will fail if the path does not exist, or short names
      * are disabled on this file system.  In both cases, we just return the
@@ -198,11 +267,8 @@ cleanup_path(char *path)
     GetShortPathName(path, path, MAXPGPATH - 1);

     /* Replace '\' with '/' */
-    for (ptr = path; *ptr; ptr++)
-    {
-        if (*ptr == '\\')
-            *ptr = '/';
-    }
+    /* All server-safe encodings are alike here, so just use PG_SQL_ASCII */
+    debackslash_path(path, PG_SQL_ASCII);
 #endif
 }

@@ -253,6 +319,8 @@ typedef enum
 } canonicalize_state;

 /*
+ * canonicalize_path()
+ *
  *    Clean up path by:
  *        o  make Win32 path use Unix slashes
  *        o  remove trailing quote on Win32
@@ -260,9 +328,20 @@ typedef enum
  *        o  remove duplicate (adjacent) separators
  *        o  remove '.' (unless path reduces to only '.')
  *        o  process '..' ourselves, removing it if possible
+ *    Modifies path in-place.
+ *
+ * This comes in two variants: encoding-aware and not.  The non-aware version
+ * is only safe to use on strings that are in a server-safe encoding.
  */
 void
 canonicalize_path(char *path)
+{
+    /* All server-safe encodings are alike here, so just use PG_SQL_ASCII */
+    canonicalize_path_enc(path, PG_SQL_ASCII);
+}
+
+void
+canonicalize_path_enc(char *path, int encoding)
 {
     char       *p,
                *to_p;
@@ -278,17 +357,15 @@ canonicalize_path(char *path)
     /*
      * The Windows command processor will accept suitably quoted paths with
      * forward slashes, but barfs badly with mixed forward and back slashes.
+     * Hence, start by converting all back slashes to forward slashes.
      */
-    for (p = path; *p; p++)
-    {
-        if (*p == '\\')
-            *p = '/';
-    }
+    debackslash_path(path, encoding);

     /*
      * In Win32, if you do: prog.exe "a b" "\c\d\" the system will pass \c\d"
      * as argv[2], so trim off trailing quote.
      */
+    p = path + strlen(path);
     if (p > path && *(p - 1) == '"')
         *(p - 1) = '/';
 #endif