Re: Illegal SJIS mapping - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Illegal SJIS mapping
Date
Msg-id 9c544547-7214-aebe-9b04-57624aedde96@iki.fi
Whole thread Raw
In response to Illegal SJIS mapping  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: Illegal SJIS mapping  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Re: Illegal SJIS mapping  (Tatsuo Ishii <ishii@sraoss.co.jp>)
List pgsql-hackers
On 09/07/2016 09:50 AM, Kyotaro HORIGUCHI wrote:
> Hi,
>
> I found an useless entry in utf8_to_sjis.map
>
>>  {0xc19c, 0x815f},
>
> which is apparently illegal as UTF-8 which postgresql
> deliberately refuses. So it should be removed and the attached
> patch does that. 0x815f(SJIS) is also mapped from 0xefbcbc(U+FF3C
> FULLWIDTH REVERSE SOLIDUS) and it is a right mapping.

Yes, I think you're right. Committed, thanks!

> By the way, the file comment at the beginning of UCS_to_SJIS.pl
> is the following.
>
> # Generate UTF-8 <--> SJIS code conversion tables from
> # map files provided by Unicode organization.
> # Unfortunately it is prohibited by the organization
> # to distribute the map files. So if you try to use this script,
> # you have to obtain SHIFTJIS.TXT from
> # the organization's ftp site.
>
> The file was found at the following place thanks to google.
>
> ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/
>
> As the URL is showing, or as written in the file
> Public/MAPPINGS/EASTASIA/ReadMe.txt, it is already obsolete and
> the *live* definition *may* be found in Unicode Character
> Database. But I haven't found SJIS-related informatin there.>
> If I'm not missing anything, the only available authority would
> be JIS X 0208/0213 but what should be implmented seems to be
> maybe-modified MS932 for which I don't know the authority.
>
> Anyway I ran UCS_to_SJIS.pl with the SHIFTJIS.TXT above and I got
> a quite different mapping files from the current ones.
>
> So, I wonder how the mappings related to SJIS (and/or EUC-JP) are
> maintained. If no authoritative information is available, the
> generating script no longer usable. If any other autority is
> choosed, it is to be modified according to whatever the new
> source format is.

The script is clearly intended to read CP932.TXT, rather than 
SHIFTJIS.TXT, despite the comments in it. CP932.TXT can be found at

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

However, running the script with that doesn't produce exactly what we 
have in utf8_to_sjis.map, either. It's otherwise same, but we have some 
extra mappings:

-  {0xc2a5, 0x5c},
-  {0xc2ac, 0x81ca},
-  {0xe28096, 0x8161},
-  {0xe280be, 0x7e},
-  {0xe28892, 0x817c},
-  {0xe3809c, 0x8160},

Those mappings were added in commit 
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus 
mapping for the invalid 0xc19c UTF-8 byte sequence was also added by 
that commit, as well a few valid mappings that UCS_to_SJIS.pl also produces.

I can't judge if those mappings make sense. If we can't find an 
authoritative source for them, I suggest that we leave them as they are, 
but also hard-code them to UCS_to_SJIS.pl, so that running that script 
produces those mappings in utf8_to_sjis.map, even though they are not 
present in the CP932.TXT source file.

- Heikki




pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pgbench vs. wait events
Next
From: Tom Lane
Date:
Subject: Fixing inheritance merge behavior in ALTER TABLE ADD CONSTRAINT