Home > mailing lists

BUG #17611: SJIS conversion rule about duplicated characters differ from Windows - Mailing list pgsql-bugs

From	PG Bug reporting form
Subject	BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
Date	September 8, 2022 14:33:17
Msg-id	17611-472d27cf395135b7@postgresql.org Whole thread Raw
Responses	Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
List	pgsql-bugs

Tree view

The following bug has been logged on the website:

Bug reference:      17611
Logged by:          yusuke egashira
Email address:      egashira.yusuke@fujitsu.com
PostgreSQL version: 12.11
Operating system:   RHEL7(Server) and Windows10(Client)
Description:

SJIS(Windows-31J) has several defined characters that has the 
same glyph but a different code point for it. The SJIS conversion 
rules in PostgreSQL's client_encoding seem to be slightly different 
from the rules in the Windows OS.

In some cases, it causes a bad thing for Windows users. 
For example, some text editors can't display these characters, and 
.NET applications raise exceptions when converting SJIS byte 
sequences to UTF16 (String type). This can happen when using Npgsql[1].

.NET code:
----
Encoding e = Encoding.GetEncoding("shift_jis",
    EncoderFallback.ExceptionFallback,
    DecoderFallback.ExceptionFallback);
var utfString = e.GetString(sjis_byte_sequence);
----

Exception:
----
Exception thrown: 'System.Text.DecoderFallbackException' in mscorlib.dll
An unhandled exception of type 'System.Text.DecoderFallbackException'
occurred in mscorlib.dll
Unable to translate bytes [FA][4A] at index 1632 from specified code page to
Unicode.
----

My customers have difficulty dealing with SJIS code in Windows 
applications because of this difference in conversion rules. 
They are migrating from Oracle and many of the applications are 
written for the SJIS environment.



The rules for converting from Unicode to characters that are 
duplicated in SJIS seem to be as follows in Windows[2]: 

1. If the character is in both JIS X 0208 and NEC special characters, 
   use the code point of JIS X 0208.
2. If the character is in both NEC special characters and IBM selected 
   characters, use the code point of NEC special characters.
3. If the character is in both IBM selected characters and 
   NEC selected-IBM extended characters, use the code point of 
   IBM selected characters.

However, the rules for converting from Unicode to SJIS in PostgreSQL 
seem to differ from the above second rule.
SJIS codepoints corresponding to the second rule are listed below:
- "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A
- "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58

In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array 
defines the not used code points when converting Unicode to SJIS.
According to the second rule above, the @reject_sjis array must contain 
"IBM selected characters", but it currently contains "NEC special
characters".

The current PostgreSQL rules for converting duplicate definition characters

seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75, 
back in 2001, but I could not be found reason for it in past mailing list
logs. 
I think this conversion difference is a bug, 
but is it a rule with some clear reason?


[1] https://www.npgsql.org/
[2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html

pgsql-bugs by date:

From: Ming
Date: 08 September 2022, 13:45:15
Subject: Re: Postgres offset and limit bug

From: Tom Lane
Date: 08 September 2022, 16:47:57
Subject: Re: huge memory of Postgresql backend process

BUG #17611: SJIS conversion rule about duplicated characters differ from Windows - Mailing list pgsql-bugs

Previous

Next