> However, running the script with that doesn't produce exactly what we
> have in utf8_to_sjis.map, either. It's otherwise same, but we have
> some extra mappings:
>
> - {0xc2a5, 0x5c},
0xc2a5 is U+00a5. The glyph is "YEN SIGN" which is corresponding to
0x5c in SJIS. So this is a valid mapping.
In the mean time, Microsoft wants to map U+005c to 0x5c in CP932. The
glyph of U+005c is "REVERSE SOLDIUS" (back slash). So MS
decided that the glyph of U+00x5c is "YEN SIGN" in CP932!
In summary we need to keep both of mappings:
U+00a5 (utf 0xc2a5) -> 0x5c and U+005c -> 0x5c.
Obviously this breaks the round trip conversion between UTF8 and SJIS
encoding in this case though.
> - {0xc2ac, 0x81ca},
U+00ac (NOT SIGN). Exists in SJIS.
> - {0xe28096, 0x8161},
U+2016 (DOUBLE VERTICAL LINE). Exists in SJIS.
> - {0xe280be, 0x7e},
U+213e (OVERLINE). Mapped to acii 0x7e, which is "half width tilde".
> - {0xe28892, 0x817c},
U+2212 (MINUS SIGN). Mapped to "double width minus sign" in SJIS.
> - {0xe3809c, 0x8160},
u+301c (WAVE DASH). Mapped to "double width wave dash" in SJIS.
> Those mappings were added in commit
> a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus
> mapping for the invalid 0xc19c UTF-8 byte sequence was also added by
> that commit, as well a few valid mappings that UCS_to_SJIS.pl also
> produces.
>
> I can't judge if those mappings make sense. If we can't find an
> authoritative source for them, I suggest that we leave them as they
> are, but also hard-code them to UCS_to_SJIS.pl, so that running that
> script produces those mappings in utf8_to_sjis.map, even though they
> are not present in the CP932.TXT source file.
Sounds acceptable.
In summary current PostgreSQL UTF8 <--> SJIS mapping is a somewhat
mixture of SJIS (Shift_JIS) and MS932. There's no cleaner solution to
exodus this situation. I think we need live with it.
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp