Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8 - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Date
Msg-id 20201030.163301.1642530409261404167.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8  (Amit Langote <amitlangote09@gmail.com>)
Responses Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
At Fri, 30 Oct 2020 14:38:30 +0900, Amit Langote <amitlangote09@gmail.com> wrote in 
> On Fri, Oct 30, 2020 at 12:20 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > So ping-pong between Unicode and SJIS behaves like this:
> >
> > U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...
> 
> Is it the following piece of code in UCS_TO_SJIS.pl that manually adds
> the mapping?

Yes.

> # Add these UTF8->SJIS pairs to the table.
> push @$mapping,
> ...
>     {
>         direction => FROM_UNICODE,
>         ucs       => 0x2212,
>         code      => 0x817c,
>         comment   => '# MINUS SIGN',
>         f         => $this_script,
>         l         => __LINE__
>     },
> 
> Given that U+2212 is encoded by e28892 in utf8, I assume that's how
> utf8_to_sjis.map ends up with the following mapping into sjis for that
> byte sequence:
> 
>   /*** Three byte table, leaf: e288xx - offset 0x004ee ***/
> 
>   /* 80 */  0x81cd, 0x0000, 0x81dd, 0x81ce, 0x0000, 0x0000, 0x0000, 0x81de,
>   /* 88 */  0x81b8, 0x0000, 0x0000, 0x81b9, 0x0000, 0x0000, 0x0000, 0x0000,
>   /* 90 */  0x0000, 0x8794, "0x817c", ...

I'm not sure how we should construct our won mapping, but the
difference made by we simply moved to JIS0208.TXT based as Ishii-san
suggested the differences in the mapping would be as the follows.

1. The following codes (regions) are not defined in JIS0208.

     8ea1 - 8edf      (up to 64 characters (I didn't actually counted them.))
     ada1 - adfc      (up to 92 characters (ditto))
     8ff3f3 - 8ff4a8  (up to 182 characters (ditto))

     a1c0  ff3c: (ff3c: FULLWIDTH REVERSE SOLIDUS)
   8ff4aa  ff07: (ff07: FULLWIDTH APOSTROPHE)

2. some individual differences

   EUC  0208  932
   a1c1 301c ff5e: (301c:WAVE DASH)
   a1c2 2016 2225: (2016:DOUBLE_VERTICAL LINE) : (2225:PARALLEL TO)
*  a1dd 2212 ff0d: (2212: MINUS_SIGN) : (ff0d: FULLWIDTH HYPHEN-MINUS)
   d1f1   a2 ffe0: (00a2: CENT SIGN) :  (ffe0: FULLWIDTH CENT SIGN)
   d1f2   a3 ffe1: (00a3: PUND SIGN) :  (ffe1: FULLWIDTH POUND SIGN)
   a2cc   ac ffe2: (00ac: NOT SIGN)  :  (ffe2: FULLWIDTH NOT SIGN)


*1: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT

> > > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > > MINUS SIGN in SJIS and that is what we expect. Isn't it?
> >
> > I think we don't change authoritative mappings, but maybe can add some
> > one-way conversions for the convenience.
> 
> Maybe UCS_TO_EUC_JP.pl could do something like the above.
> 
> Are there other cases that were fixed like this in the past, either
> for euc_jp or sjis?

Honestly, I don't know how the mapping was decided in 2002, but
removing the regions in 1 would cause confusion.  So what we can do in
this area would be chaning some of 2 to 0208 mapping.  But arbitrary
mixture of different mapings would cause new problem..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Add important info about ANALYZE after create Functional Index
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8