BUG #17611: SJIS conversion rule about duplicated characters differ from Windows - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
Date
Msg-id 17611-472d27cf395135b7@postgresql.org
Whole thread Raw
Responses Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17611
Logged by:          yusuke egashira
Email address:      egashira.yusuke@fujitsu.com
PostgreSQL version: 12.11
Operating system:   RHEL7(Server) and Windows10(Client)
Description:

SJIS(Windows-31J) has several defined characters that has the 
same glyph but a different code point for it. The SJIS conversion 
rules in PostgreSQL's client_encoding seem to be slightly different 
from the rules in the Windows OS.

In some cases, it causes a bad thing for Windows users. 
For example, some text editors can't display these characters, and 
.NET applications raise exceptions when converting SJIS byte 
sequences to UTF16 (String type). This can happen when using Npgsql[1].

.NET code:
----
Encoding e = Encoding.GetEncoding("shift_jis",
    EncoderFallback.ExceptionFallback,
    DecoderFallback.ExceptionFallback);
var utfString = e.GetString(sjis_byte_sequence);
----

Exception:
----
Exception thrown: 'System.Text.DecoderFallbackException' in mscorlib.dll
An unhandled exception of type 'System.Text.DecoderFallbackException'
occurred in mscorlib.dll
Unable to translate bytes [FA][4A] at index 1632 from specified code page to
Unicode.
----

My customers have difficulty dealing with SJIS code in Windows 
applications because of this difference in conversion rules. 
They are migrating from Oracle and many of the applications are 
written for the SJIS environment.



The rules for converting from Unicode to characters that are 
duplicated in SJIS seem to be as follows in Windows[2]: 

1. If the character is in both JIS X 0208 and NEC special characters, 
   use the code point of JIS X 0208.
2. If the character is in both NEC special characters and IBM selected 
   characters, use the code point of NEC special characters.
3. If the character is in both IBM selected characters and 
   NEC selected-IBM extended characters, use the code point of 
   IBM selected characters.

However, the rules for converting from Unicode to SJIS in PostgreSQL 
seem to differ from the above second rule.
SJIS codepoints corresponding to the second rule are listed below:
- "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A
- "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58

In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array 
defines the not used code points when converting Unicode to SJIS.
According to the second rule above, the @reject_sjis array must contain 
"IBM selected characters", but it currently contains "NEC special
characters".

The current PostgreSQL rules for converting duplicate definition characters

seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75, 
back in 2001, but I could not be found reason for it in past mailing list
logs. 
I think this conversion difference is a bug, 
but is it a rule with some clear reason?


[1] https://www.npgsql.org/
[2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html


pgsql-bugs by date:

Previous
From: Ming
Date:
Subject: Re: Postgres offset and limit bug
Next
From: Tom Lane
Date:
Subject: Re: huge memory of Postgresql backend process