Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8 - Mailing list pgsql-hackers

Hello.

At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in 
> Hi All,
> 
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
> 
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
>  convert
> ----------
>  \xefbc8d
> (1 row)
> 
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.

No it's not a bug, but a well-known "design":(

The mapping is generated from CP932.TXT and JIS0212.TXT by
UCS_to_UEC_JP.pl.

CP932.TXT used here is here.

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows.

0x817C    0xFF0D    #FULLWIDTH HYPHEN-MINUS

> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"

U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp
U+2212(e2 88 92) doesn't have a mapping between euc-jp.

> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
> 
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
>  convert
> ---------
>  \x817c
> (1 row)

It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason
but maybe because it was used widely.

So ping-pong between Unicode and SJIS behaves like this:

U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...

> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?

I think we don't change authoritative mappings, but maybe can add some
one-way conversions for the convenience.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Next
From: Tom Lane
Date:
Subject: Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8