Thread: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Ashutosh Sharma
Date:
Hi All, Today while working on some other task related to database encoding, I noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in UTF-8. See below: postgres=# select convert('\xa1dd', 'euc_jp', 'utf8'); convert ---------- \xefbc8d (1 row) Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH HYPHEN-MINUS SIGN. When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is converted to EUC-JP, the convert functions fails with an error saying: "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no equivalent in encoding EUC_JP". See below: postgres=# select convert('\xe28892', 'utf-8', 'euc_jp'); ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8" has no equivalent in encoding "EUC_JP" However, when the same MINUS SIGN in UTF-8 is converted to SJIS encoding, the convert function returns the correct result. See below: postgres=# select convert('\xe28892', 'utf-8', 'sjis'); convert --------- \x817c (1 row) Please note that the byte sequence (81-7c) in SJIS represents MINUS SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the MINUS SIGN in SJIS and that is what we expect. Isn't it? -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Amit Langote
Date:
On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi All, > > Today while working on some other task related to database encoding, I > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in > UTF-8. See below: > > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8'); > convert > ---------- > \xefbc8d > (1 row) > > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH > HYPHEN-MINUS SIGN. > > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is > converted to EUC-JP, the convert functions fails with an error saying: > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no > equivalent in encoding EUC_JP". See below: > > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp'); > ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8" > has no equivalent in encoding "EUC_JP" > > However, when the same MINUS SIGN in UTF-8 is converted to SJIS > encoding, the convert function returns the correct result. See below: > > postgres=# select convert('\xe28892', 'utf-8', 'sjis'); > convert > --------- > \x817c > (1 row) > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > MINUS SIGN in SJIS and that is what we expect. Isn't it? So we have a1dd in euc_jp, 817c in sjis, efbc8d in utf-8 that convert between each other just fine. But when it comes to e28892 in utf-8 it currently only converts to sjis and that too just one way: select convert('\xe28892', 'utf-8', 'sjis'); convert --------- \x817c (1 row) select convert('\x817c', 'sjis', 'utf-8'); convert ---------- \xefbc8d (1 row) I noticed that the commit a8bd7e1c6e02 from ages ago removed conversions from and to utf-8's e28892, in favor of efbc8d, and that change has stuck. (Note though that these maps looked pretty different back then.) --- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map +++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map - {0xa1dd, 0xe28892}, + {0xa1dd, 0xefbc8d}, --- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map +++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map - {0xe28892, 0xa1dd}, + {0xefbc8d, 0xa1dd}, Can't tell what reason there was to do that, but there must have been some. Maybe the Japanese character sets prefer full-width hyphen minus (unicode U+FF0D) over mathematical minus sign (U+2212)? -- Amit Langote EDB: http://www.enterprisedb.com
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Kyotaro Horiguchi
Date:
Hello. At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > Hi All, > > Today while working on some other task related to database encoding, I > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in > UTF-8. See below: > > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8'); > convert > ---------- > \xefbc8d > (1 row) > > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH > HYPHEN-MINUS SIGN. No it's not a bug, but a well-known "design":( The mapping is generated from CP932.TXT and JIS0212.TXT by UCS_to_UEC_JP.pl. CP932.TXT used here is here. https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows. 0x817C 0xFF0D #FULLWIDTH HYPHEN-MINUS > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is > converted to EUC-JP, the convert functions fails with an error saying: > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no > equivalent in encoding EUC_JP". See below: > > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp'); > ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8" > has no equivalent in encoding "EUC_JP" U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp U+2212(e2 88 92) doesn't have a mapping between euc-jp. > However, when the same MINUS SIGN in UTF-8 is converted to SJIS > encoding, the convert function returns the correct result. See below: > > postgres=# select convert('\xe28892', 'utf-8', 'sjis'); > convert > --------- > \x817c > (1 row) It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason but maybe because it was used widely. So ping-pong between Unicode and SJIS behaves like this: U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ... > Please note that the byte sequence (81-7c) in SJIS represents MINUS > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > MINUS SIGN in SJIS and that is what we expect. Isn't it? I think we don't change authoritative mappings, but maybe can add some one-way conversions for the convenience. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes: > On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> Today while working on some other task related to database encoding, I >> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is >> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in >> UTF-8. See below: >> ... >> Isn't this a bug? > Can't tell what reason there was to do that, but there must have been > some. Maybe the Japanese character sets prefer full-width hyphen > minus (unicode U+FF0D) over mathematical minus sign (U+2212)? The way it's been explained to me in the past is that the conversion between Unicode and the various Japanese encodings is not as well defined as one could wish, because there are multiple quasi-standard versions of the Japanese encodings. So we shouldn't move too hastily on changing this. Maybe it's really a bug, but maybe there are good reasons. regards, tom lane
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Kyotaro Horiguchi
Date:
At Fri, 30 Oct 2020 12:08:51 +0900, Amit Langote <amitlangote09@gmail.com> wrote in > I noticed that the commit a8bd7e1c6e02 from ages ago removed > conversions from and to utf-8's e28892, in favor of efbc8d, and that > change has stuck. (Note though that these maps looked pretty > different back then.) > > --- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map > +++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map > - {0xa1dd, 0xe28892}, > + {0xa1dd, 0xefbc8d}, > > --- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map > +++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map > - {0xe28892, 0xa1dd}, > + {0xefbc8d, 0xa1dd}, > > Can't tell what reason there was to do that, but there must have been > some. Maybe the Japanese character sets prefer full-width hyphen > minus (unicode U+FF0D) over mathematical minus sign (U+2212)? It's a decsion made by Microsoft. Several other characters are in similar issues. I remember many people complained but in the end that wasn't "fixed" and led to the well-known conversion messes of Japanese character conversion involving Unicode in Java. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Tatsuo Ishii
Date:
> Hi All, > > Today while working on some other task related to database encoding, I > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in > UTF-8. See below: > > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8'); > convert > ---------- > \xefbc8d > (1 row) > > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH > HYPHEN-MINUS SIGN. Yeah. Originally EUC_JP 0xa1dd was converted to UTF8 0xe28892. At some point, someone changed the mapping and now you see it. > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is > converted to EUC-JP, the convert functions fails with an error saying: > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no > equivalent in encoding EUC_JP". See below: > > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp'); > ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8" > has no equivalent in encoding "EUC_JP" Again, originally UTF8 0xe28892 was converted to EUC_JP 0xa1dd . At some point, someone changed the mapping. > However, when the same MINUS SIGN in UTF-8 is converted to SJIS > encoding, the convert function returns the correct result. See below: > > postgres=# select convert('\xe28892', 'utf-8', 'sjis'); > convert > --------- > \x817c > (1 row) > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > MINUS SIGN in SJIS and that is what we expect. Isn't it? Agreed. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Tatsuo Ishii
Date:
> The mapping is generated from CP932.TXT and JIS0212.TXT by > UCS_to_UEC_JP.pl. I still don't understand why this change has been made. Originally the conversion was based on JIS0208.txt, JIS0212.txt and JIS0201.txt, which is the exact definition of EUC-JP. CP932.txt is defined by Microsoft for their products. Probably we should call our "EUC-JP" something like "EUC-JP-MS" or whatever to differentiate from true EUC-JP. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Ashutosh Sharma
Date:
On Fri, Oct 30, 2020 at 8:49 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Hello. > > At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > > Hi All, > > > > Today while working on some other task related to database encoding, I > > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is > > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in > > UTF-8. See below: > > > > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8'); > > convert > > ---------- > > \xefbc8d > > (1 row) > > > > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN > > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH > > HYPHEN-MINUS SIGN. > > No it's not a bug, but a well-known "design":( > > The mapping is generated from CP932.TXT and JIS0212.TXT by > UCS_to_UEC_JP.pl. > > CP932.TXT used here is here. > > https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT > > CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows. > > 0x817C 0xFF0D #FULLWIDTH HYPHEN-MINUS > We do have MINUS SIGN (U+2212) defined in both UTF-8 and EUC-JP encoding. So, not sure why converting MINUS SIGN from UTF-8 to EUC-JP should throw an error saying: "... in encoding UTF8 has *no* equivalent in EUC_JP". I mean this information looks misleading and that's I reason I feel its a bug. > > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is > > converted to EUC-JP, the convert functions fails with an error saying: > > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no > > equivalent in encoding EUC_JP". See below: > > > > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp'); > > ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8" > > has no equivalent in encoding "EUC_JP" > > U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp > U+2212(e2 88 92) doesn't have a mapping between euc-jp. > > > However, when the same MINUS SIGN in UTF-8 is converted to SJIS > > encoding, the convert function returns the correct result. See below: > > > > postgres=# select convert('\xe28892', 'utf-8', 'sjis'); > > convert > > --------- > > \x817c > > (1 row) > > It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason > but maybe because it was used widely. > > So ping-pong between Unicode and SJIS behaves like this: > > U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ... > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > > MINUS SIGN in SJIS and that is what we expect. Isn't it? > > I think we don't change authoritative mappings, but maybe can add some > one-way conversions for the convenience. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Kyotaro Horiguchi
Date:
At Fri, 30 Oct 2020 13:17:08 +0900 (JST), Tatsuo Ishii <ishii@sraoss.co.jp> wrote in > > The mapping is generated from CP932.TXT and JIS0212.TXT by > > UCS_to_UEC_JP.pl. > > I still don't understand why this change has been made. Originally the > conversion was based on JIS0208.txt, JIS0212.txt and JIS0201.txt, > which is the exact definition of EUC-JP. CP932.txt is defined by > Microsoft for their products. > > Probably we should call our "EUC-JP" something like "EUC-JP-MS" or > whatever to differentiate from true EUC-JP. Seems valid. Things are already so at the time aeed17d is introduced (I believe it didn't make any difference in conversions.) and the change was made by a8bd7e1c6e in 2002. I'm not sure the point of the change, though.. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Amit Langote
Date:
On Fri, Oct 30, 2020 at 12:20 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > > However, when the same MINUS SIGN in UTF-8 is converted to SJIS > > encoding, the convert function returns the correct result. See below: > > > > postgres=# select convert('\xe28892', 'utf-8', 'sjis'); > > convert > > --------- > > \x817c > > (1 row) > > It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason > but maybe because it was used widely. > > So ping-pong between Unicode and SJIS behaves like this: > > U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ... Is it the following piece of code in UCS_TO_SJIS.pl that manually adds the mapping? # Add these UTF8->SJIS pairs to the table. push @$mapping, ... { direction => FROM_UNICODE, ucs => 0x2212, code => 0x817c, comment => '# MINUS SIGN', f => $this_script, l => __LINE__ }, Given that U+2212 is encoded by e28892 in utf8, I assume that's how utf8_to_sjis.map ends up with the following mapping into sjis for that byte sequence: /*** Three byte table, leaf: e288xx - offset 0x004ee ***/ /* 80 */ 0x81cd, 0x0000, 0x81dd, 0x81ce, 0x0000, 0x0000, 0x0000, 0x81de, /* 88 */ 0x81b8, 0x0000, 0x0000, 0x81b9, 0x0000, 0x0000, 0x0000, 0x0000, /* 90 */ 0x0000, 0x8794, "0x817c", ... > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > > MINUS SIGN in SJIS and that is what we expect. Isn't it? > > I think we don't change authoritative mappings, but maybe can add some > one-way conversions for the convenience. Maybe UCS_TO_EUC_JP.pl could do something like the above. Are there other cases that were fixed like this in the past, either for euc_jp or sjis? -- Amit Langote EDB: http://www.enterprisedb.com
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Kyotaro Horiguchi
Date:
At Fri, 30 Oct 2020 14:38:30 +0900, Amit Langote <amitlangote09@gmail.com> wrote in > On Fri, Oct 30, 2020 at 12:20 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > So ping-pong between Unicode and SJIS behaves like this: > > > > U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ... > > Is it the following piece of code in UCS_TO_SJIS.pl that manually adds > the mapping? Yes. > # Add these UTF8->SJIS pairs to the table. > push @$mapping, > ... > { > direction => FROM_UNICODE, > ucs => 0x2212, > code => 0x817c, > comment => '# MINUS SIGN', > f => $this_script, > l => __LINE__ > }, > > Given that U+2212 is encoded by e28892 in utf8, I assume that's how > utf8_to_sjis.map ends up with the following mapping into sjis for that > byte sequence: > > /*** Three byte table, leaf: e288xx - offset 0x004ee ***/ > > /* 80 */ 0x81cd, 0x0000, 0x81dd, 0x81ce, 0x0000, 0x0000, 0x0000, 0x81de, > /* 88 */ 0x81b8, 0x0000, 0x0000, 0x81b9, 0x0000, 0x0000, 0x0000, 0x0000, > /* 90 */ 0x0000, 0x8794, "0x817c", ... I'm not sure how we should construct our won mapping, but the difference made by we simply moved to JIS0208.TXT based as Ishii-san suggested the differences in the mapping would be as the follows. 1. The following codes (regions) are not defined in JIS0208. 8ea1 - 8edf (up to 64 characters (I didn't actually counted them.)) ada1 - adfc (up to 92 characters (ditto)) 8ff3f3 - 8ff4a8 (up to 182 characters (ditto)) a1c0 ff3c: (ff3c: FULLWIDTH REVERSE SOLIDUS) 8ff4aa ff07: (ff07: FULLWIDTH APOSTROPHE) 2. some individual differences EUC 0208 932 a1c1 301c ff5e: (301c:WAVE DASH) a1c2 2016 2225: (2016:DOUBLE_VERTICAL LINE) : (2225:PARALLEL TO) * a1dd 2212 ff0d: (2212: MINUS_SIGN) : (ff0d: FULLWIDTH HYPHEN-MINUS) d1f1 a2 ffe0: (00a2: CENT SIGN) : (ffe0: FULLWIDTH CENT SIGN) d1f2 a3 ffe1: (00a3: PUND SIGN) : (ffe1: FULLWIDTH POUND SIGN) a2cc ac ffe2: (00ac: NOT SIGN) : (ffe2: FULLWIDTH NOT SIGN) *1: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > > > MINUS SIGN in SJIS and that is what we expect. Isn't it? > > > > I think we don't change authoritative mappings, but maybe can add some > > one-way conversions for the convenience. > > Maybe UCS_TO_EUC_JP.pl could do something like the above. > > Are there other cases that were fixed like this in the past, either > for euc_jp or sjis? Honestly, I don't know how the mapping was decided in 2002, but removing the regions in 1 would cause confusion. So what we can do in this area would be chaning some of 2 to 0208 mapping. But arbitrary mixture of different mapings would cause new problem.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
From
Kyotaro Horiguchi
Date:
At Fri, 30 Oct 2020 16:33:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Fri, 30 Oct 2020 14:38:30 +0900, Amit Langote <amitlangote09@gmail.com> wrote in > I'm not sure how we should construct our won mapping, but the > difference made by we simply moved to JIS0208.TXT based as Ishii-san > suggested the differences in the mapping would be as the follows. Mmm.. I'm not sure how we should construct our won mapping, but the difference made by simply moving to JIS0208.TXT-based as Ishii-san suggested, the following differences would be seen in the mappings. > 1. The following codes (regions) are not defined in JIS0208. > > 8ea1 - 8edf (up to 64 characters (I didn't actually counted them.)) > ada1 - adfc (up to 92 characters (ditto)) > 8ff3f3 - 8ff4a8 (up to 182 characters (ditto)) 8ea1 - 8edf (64 chars. U+ff61 - U+ff9f) (hankaku-kana) ada1 - adfc (83 chars, U+2460 - U+33a1) (numbers with cicle) 8ff3f3 - 8ff4a8 (20 chars, U+2160 - U+2179) (roman numerals) > a1c0 ff3c: (ff3c: FULLWIDTH REVERSE SOLIDUS) > 8ff4aa ff07: (ff07: FULLWIDTH APOSTROPHE) > > 2. some individual differences > > EUC 0208 932 > a1c1 301c ff5e: (301c:WAVE DASH) > a1c2 2016 2225: (2016:DOUBLE_VERTICAL LINE) : (2225:PARALLEL TO) > * a1dd 2212 ff0d: (2212: MINUS_SIGN) : (ff0d: FULLWIDTH HYPHEN-MINUS) > d1f1 a2 ffe0: (00a2: CENT SIGN) : (ffe0: FULLWIDTH CENT SIGN) > d1f2 a3 ffe1: (00a3: PUND SIGN) : (ffe1: FULLWIDTH POUND SIGN) > a2cc ac ffe2: (00ac: NOT SIGN) : (ffe2: FULLWIDTH NOT SIGN) > > > *1: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT > > > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > > > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > > > > MINUS SIGN in SJIS and that is what we expect. Isn't it? > > > > > > I think we don't change authoritative mappings, but maybe can add some > > > one-way conversions for the convenience. > > > > Maybe UCS_TO_EUC_JP.pl could do something like the above. > > > > Are there other cases that were fixed like this in the past, either > > for euc_jp or sjis? > > Honestly, I don't know how the mapping was decided in 2002, but > removing the regions in 1 would cause confusion. So what we can do in > this area would be chaning some of 2 to 0208 mapping. But arbitrary > mixture of different mapings would cause new problem.. Forgot about adding one-way mappings. I think we can add several such mappings, say. U+3031->: EUC:a1c1 <-> U+ff5e U+2016->: EUC:a1c2 <-> U+2225 U+2212->: EUC:a1dd <-> U+ff0d U+00a2->: EUC:d1f1 <-> U+ffe0 U+00a3->: EUC:d1f2 <-> U+ffe1 U+00ac->: EUC:a2cc <-> U+ffe2 regards. -- Kyotaro Horiguchi NTT Open Source Software Center