Re: Illegal SJIS mapping - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: Illegal SJIS mapping
Date
Msg-id 20161018.131042.13229590.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: Illegal SJIS mapping  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
Hello,

At Fri, 7 Oct 2016 23:58:45 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<9c544547-7214-aebe-9b04-57624aedde96@iki.fi>
> > So, I wonder how the mappings related to SJIS (and/or EUC-JP) are
> > maintained. If no authoritative information is available, the
> > generating script no longer usable. If any other autority is
> > choosed, it is to be modified according to whatever the new
> > source format is.
> 
> The script is clearly intended to read CP932.TXT, rather than
> SHIFTJIS.TXT, despite the comments in it. CP932.TXT can be found at
> 
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
> 
> However, running the script with that doesn't produce exactly what we
> have in utf8_to_sjis.map, either. It's otherwise same, but we have
> some extra mappings:
> 
> -  {0xc2a5, 0x5c},
> -  {0xc2ac, 0x81ca},
> -  {0xe28096, 0x8161},
> -  {0xe280be, 0x7e},
> -  {0xe28892, 0x817c},
> -  {0xe3809c, 0x8160},
> 
> Those mappings were added in commit
> a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus
> mapping for the invalid 0xc19c UTF-8 byte sequence was also added by
> that commit, as well a few valid mappings that UCS_to_SJIS.pl also
> produces.
> 
> I can't judge if those mappings make sense. If we can't find an
> authoritative source for them, I suggest that we leave them as they

The mappings have a hystorical reason came from differences
between Unicode definition and Oracle and Microsoft
implementations and developing of Unicode specification. So the
several SJIS (and EUC-JP) characters have two or more mappings to
Unicode. There's also several variations of the opposite
mapping. But none of them is the autority and what to adopt
depends on system requirement. The only requirement that
PostgreSQL should keep seems to be round-trip consistency starts
from SJIS input.

> are, but also hard-code them to UCS_to_SJIS.pl, so that running that
> script produces those mappings in utf8_to_sjis.map, even though they
> are not present in the CP932.TXT source file.

Agreed. I do that at least for Japanese charsets.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Add PGDLLEXPORT to PG_FUNCTION_INFO_V1
Next
From: Tatsuo Ishii
Date:
Subject: Re: Illegal SJIS mapping