Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8 - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Date
Msg-id 20201030.165638.1664587537743852598.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
At Fri, 30 Oct 2020 16:33:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Fri, 30 Oct 2020 14:38:30 +0900, Amit Langote <amitlangote09@gmail.com> wrote in 
> I'm not sure how we should construct our won mapping, but the
> difference made by we simply moved to JIS0208.TXT based as Ishii-san
> suggested the differences in the mapping would be as the follows.

Mmm..

I'm not sure how we should construct our won mapping, but the
difference made by simply moving to JIS0208.TXT-based as Ishii-san
suggested, the following differences would be seen in the mappings.

> 1. The following codes (regions) are not defined in JIS0208.
> 
>      8ea1 - 8edf      (up to 64 characters (I didn't actually counted them.))
>      ada1 - adfc      (up to 92 characters (ditto))
>      8ff3f3 - 8ff4a8  (up to 182 characters (ditto))

  8ea1 - 8edf      (64 chars. U+ff61 - U+ff9f) (hankaku-kana)
  ada1 - adfc      (83 chars, U+2460 - U+33a1) (numbers with cicle)
  8ff3f3 - 8ff4a8  (20 chars, U+2160 - U+2179) (roman numerals)

>      a1c0  ff3c: (ff3c: FULLWIDTH REVERSE SOLIDUS)
>    8ff4aa  ff07: (ff07: FULLWIDTH APOSTROPHE)
> 
> 2. some individual differences
> 
>    EUC  0208  932
>    a1c1 301c ff5e: (301c:WAVE DASH)
>    a1c2 2016 2225: (2016:DOUBLE_VERTICAL LINE) : (2225:PARALLEL TO)
> *  a1dd 2212 ff0d: (2212: MINUS_SIGN) : (ff0d: FULLWIDTH HYPHEN-MINUS)
>    d1f1   a2 ffe0: (00a2: CENT SIGN) :  (ffe0: FULLWIDTH CENT SIGN)
>    d1f2   a3 ffe1: (00a3: PUND SIGN) :  (ffe1: FULLWIDTH POUND SIGN)
>    a2cc   ac ffe2: (00ac: NOT SIGN)  :  (ffe2: FULLWIDTH NOT SIGN)
> 
> 
> *1: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
> 
> > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > > > MINUS SIGN in SJIS and that is what we expect. Isn't it?
> > >
> > > I think we don't change authoritative mappings, but maybe can add some
> > > one-way conversions for the convenience.
> > 
> > Maybe UCS_TO_EUC_JP.pl could do something like the above.
> > 
> > Are there other cases that were fixed like this in the past, either
> > for euc_jp or sjis?
> 
> Honestly, I don't know how the mapping was decided in 2002, but
> removing the regions in 1 would cause confusion.  So what we can do in
> this area would be chaning some of 2 to 0208 mapping.  But arbitrary
> mixture of different mapings would cause new problem..

 Forgot about adding one-way mappings.  I think we can add several
 such mappings, say.

 U+3031->:   EUC:a1c1 <-> U+ff5e
 U+2016->:   EUC:a1c2 <-> U+2225
 U+2212->:   EUC:a1dd <-> U+ff0d
 U+00a2->:   EUC:d1f1 <-> U+ffe0
 U+00a3->:   EUC:d1f2 <-> U+ffe1
 U+00ac->:   EUC:a2cc <-> U+ffe2

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Next
From: Peter Smith
Date:
Subject: Re: [HACKERS] logical decoding of two-phase transactions