Thread: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17611
Logged by:          yusuke egashira
Email address:      egashira.yusuke@fujitsu.com
PostgreSQL version: 12.11
Operating system:   RHEL7(Server) and Windows10(Client)
Description:

SJIS(Windows-31J) has several defined characters that has the 
same glyph but a different code point for it. The SJIS conversion 
rules in PostgreSQL's client_encoding seem to be slightly different 
from the rules in the Windows OS.

In some cases, it causes a bad thing for Windows users. 
For example, some text editors can't display these characters, and 
.NET applications raise exceptions when converting SJIS byte 
sequences to UTF16 (String type). This can happen when using Npgsql[1].

.NET code:
----
Encoding e = Encoding.GetEncoding("shift_jis",
    EncoderFallback.ExceptionFallback,
    DecoderFallback.ExceptionFallback);
var utfString = e.GetString(sjis_byte_sequence);
----

Exception:
----
Exception thrown: 'System.Text.DecoderFallbackException' in mscorlib.dll
An unhandled exception of type 'System.Text.DecoderFallbackException'
occurred in mscorlib.dll
Unable to translate bytes [FA][4A] at index 1632 from specified code page to
Unicode.
----

My customers have difficulty dealing with SJIS code in Windows 
applications because of this difference in conversion rules. 
They are migrating from Oracle and many of the applications are 
written for the SJIS environment.



The rules for converting from Unicode to characters that are 
duplicated in SJIS seem to be as follows in Windows[2]: 

1. If the character is in both JIS X 0208 and NEC special characters, 
   use the code point of JIS X 0208.
2. If the character is in both NEC special characters and IBM selected 
   characters, use the code point of NEC special characters.
3. If the character is in both IBM selected characters and 
   NEC selected-IBM extended characters, use the code point of 
   IBM selected characters.

However, the rules for converting from Unicode to SJIS in PostgreSQL 
seem to differ from the above second rule.
SJIS codepoints corresponding to the second rule are listed below:
- "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A
- "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58

In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array 
defines the not used code points when converting Unicode to SJIS.
According to the second rule above, the @reject_sjis array must contain 
"IBM selected characters", but it currently contains "NEC special
characters".

The current PostgreSQL rules for converting duplicate definition characters

seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75, 
back in 2001, but I could not be found reason for it in past mailing list
logs. 
I think this conversion difference is a bug, 
but is it a rule with some clear reason?


[1] https://www.npgsql.org/
[2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html


Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

From
Kyotaro Horiguchi
Date:
This is not a bug, but the designed behavior. But we could change that
conversion table if a plausible reasoning is raised.

At Thu, 08 Sep 2022 11:33:17 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> SJIS(Windows-31J) has several defined characters that has the 
> same glyph but a different code point for it. The SJIS conversion 
> rules in PostgreSQL's client_encoding seem to be slightly different 
> from the rules in the Windows OS.

PostgreSQL follows CP932. And no rule on the precedence between
duplicate characters is published as a public standard. According to
[2], it is published as Microsoft's recommended convention.

> In some cases, it causes a bad thing for Windows users. 
> For example, some text editors can't display these characters, and
> .NET applications raise exceptions when converting SJIS byte 
> sequences to UTF16 (String type). This can happen when using Npgsql[1].
> 
> .NET code:
> ----
> Encoding e = Encoding.GetEncoding("shift_jis",

AFAIK generally Shift_jis and CP932 have different character sets.  I
don't know about .Net but doesn't CP932 work in that case?
Specifically, "Encoding.GetEncoding(932)".  There must a way to deal
with that characters since they are in CP932.

> My customers have difficulty dealing with SJIS code in Windows 
> applications because of this difference in conversion rules. 
> They are migrating from Oracle and many of the applications are 
> written for the SJIS environment.
> 
> The rules for converting from Unicode to characters that are 
> duplicated in SJIS seem to be as follows in Windows[2]: 
> 
> 1. If the character is in both JIS X 0208 and NEC special characters, 
>    use the code point of JIS X 0208.
> 2. If the character is in both NEC special characters and IBM selected 
>    characters, use the code point of NEC special characters.
> 3. If the character is in both IBM selected characters and 
>    NEC selected-IBM extended characters, use the code point of 
>    IBM selected characters.

Mmm. I don't reach the original document by Microsoft pointed from
[2]. Could you tell me an alternative URL?  (Goole didn't offer usable
info by kb170559 or somethig like)

> However, the rules for converting from Unicode to SJIS in PostgreSQL 
> seem to differ from the above second rule.
> SJIS codepoints corresponding to the second rule are listed below:
> - "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A
> - "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58
>
> In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array 
> defines the not used code points when converting Unicode to SJIS.
> According to the second rule above, the @reject_sjis array must contain 
> "IBM selected characters", but it currently contains "NEC special
> characters".

Anyway it is not in the public standard and at most that "rule" is a
recommendation. So it's not the case we "must" change the conversion
table following the "rule".

FYI, the following range of SJIS character codes are *excluded* while
unicode->sjis conversion. They are not only NEC/IBM extension
characters.

ed40 - eefc : so-called "NEC extension"
              uses  fa40 - fc40 (IBM extension) instead.
8754 - 875d : numbers with circle, and upper roman numbers
              uses fa4a - fa53 instead.
878a, 8782, 8784, fa5b, fa54: some japanese combined characters "No." "(株)"...
              uses fa58, fa59, fa5a, 81e6, 879a, 81ca
8790 - 8792 : math symbols, uses 81e0, 81df, 81e7
8795 - 8797 : ditto, 81e3, 81db, 81da
879a - 879c : ditto, 879a, 81bf, 81be

> The current PostgreSQL rules for converting duplicate definition characters
> 
> seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75, 
> back in 2001, but I could not be found reason for it in past mailing list
> logs. 
> I think this conversion difference is a bug, 
> but is it a rule with some clear reason?

I don't know about a clear rason for the current conversion, but it is
a reason for *not* changing the conversion table that we had no
complaint about the conversion for more than ten years. Because
changing that tables could cause problems elsewhere.

> [1] https://www.npgsql.org/
> [2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> This is not a bug, but the designed behavior. But we could change that
> conversion table if a plausible reasoning is raised.

Given how long our current behavior has stood, I think it'd have to be
a pretty convincing argument.  As you say, there'd be some serious
compatibility costs to changing that behavior.

IIUC, SJIS<->Unicode conversions have always been a squishy thing
because of inconsistencies between the various versions of "SJIS".
I'm not seeing a good reason we should regard Windows' behavior as
authoritative here.

I'm not saying I can't be convinced, but "Microsoft does it that
way" isn't enough to convince me.

            regards, tom lane



Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

From
Kyotaro Horiguchi
Date:
At Thu, 08 Sep 2022 22:58:31 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > This is not a bug, but the designed behavior. But we could change that
> > conversion table if a plausible reasoning is raised.
> 
> Given how long our current behavior has stood, I think it'd have to be
> a pretty convincing argument.  As you say, there'd be some serious
> compatibility costs to changing that behavior.
> 
> IIUC, SJIS<->Unicode conversions have always been a squishy thing
> because of inconsistencies between the various versions of "SJIS".
> I'm not seeing a good reason we should regard Windows' behavior as
> authoritative here.
> 
> I'm not saying I can't be convinced, but "Microsoft does it that
> way" isn't enough to convince me.

Yeah, it is more or less I meant.  And I suspect that the problem that
his customers are complaining is not caused by our specific conversion
table.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

From
"egashira.yusuke@fujitsu.com"
Date:
Hi,

Thank you replying.

I understand that the difference between these conversion rules is not bug.
Unfortunately, given the many variations of SJIS, we also realized that
matching a translation table to Microsoft's "recommended" rule would
likely cause other problem to happen.
I have been looking for original documents such as kb170559 that Microsoft
changed the URL or stopped publishing, but I also couldn't find them...
Therefore, I agree with you that PostgreSQL should not suddenly change
its conversion rules.

However, I still think it is problem that PostgreSQL returns some characters
which not able to be used in some Windows environment.
Would it be a reasonable solution to this problem to have the user create
a map file with the conversion rules changed and add the conversion by
CREATE CONVERSION ?


> AFAIK generally Shift_jis and CP932 have different character sets.  I
> don't know about .Net but doesn't CP932 work in that case?
> Specifically, "Encoding.GetEncoding(932)".  There must a way to deal
> with that characters since they are in CP932.

Unfortunately, "shift_jis" is the name of "CP932" in .NET[1], so the same
exception occurs for "Encoding.GetEncoding(932)".



> FYI, the following range of SJIS character codes are *excluded* while
> unicode->sjis conversion. They are not only NEC/IBM extension
> characters.
>
> ed40 - eefc : so-called "NEC extension"
>               uses  fa40 - fc40 (IBM extension) instead.
> 8754 - 875d : numbers with circle, and upper roman numbers
>               uses fa4a - fa53 instead.
> 878a, 8782, 8784, fa5b, fa54: some japanese combined characters "No." "(株)"...
>               uses fa58, fa59, fa5a, 81e6, 879a, 81ca
> 8790 - 8792 : math symbols, uses 81e0, 81df, 81e7
> 8795 - 8797 : ditto, 81e3, 81db, 81da
> 879a - 879c : ditto, 879a, 81bf, 81be

Yes, I understand this exclude rules describes the conversion rule for SJIS
duplicated characters in PostgreSQL. In my understanding, characters
related to duplicate characters included in SJIS are as follows[2].
- NEC special characters(Row 13)                    : 8740 - 879c
- NEC selected-IBM extended characters(Row 89 - 92) : ed40 - eefc
- IBM selected characters(Row 115 - 119)            : fa40 - fc4b

The excluding rule of PostgreSQL seems to be match the Microsoft's
recommended rule except for "NEC special characters (Row 13) and
IBM selected characters(Row 115 - 119)" rule.


> At Thu, 08 Sep 2022 22:58:31 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
> > Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > > This is not a bug, but the designed behavior. But we could change that
> > > conversion table if a plausible reasoning is raised.
> >
> > Given how long our current behavior has stood, I think it'd have to be
> > a pretty convincing argument.  As you say, there'd be some serious
> > compatibility costs to changing that behavior.
> >
> > IIUC, SJIS<->Unicode conversions have always been a squishy thing
> > because of inconsistencies between the various versions of "SJIS".
> > I'm not seeing a good reason we should regard Windows' behavior as
> > authoritative here.
> >
> > I'm not saying I can't be convinced, but "Microsoft does it that
> > way" isn't enough to convince me.
>
> Yeah, it is more or less I meant.  And I suspect that the problem that
> his customers are complaining is not caused by our specific conversion
> table.

I think the behavior of Windows obscures the problem of different
conversion rules. I attached the text file extracted 2-byte characters from
CP932.TXT[3]. When this is displayed using notepad.exe or
type command on cmd.exe, all characters are displayed in readable form.
However, when we save the duplicate definition characters displayed
on notepad to a file, they are implicitly converted to
"Microsoft Recommended Code Points". So, my problem is probably
a corner case.

My customer used a third-party text editor instead of notepad and claimed
that some duplicate definition characters could not be displayed.
Npgsql works with the client_encoding=utf8 setting by default, however,
there was a customer who wanted to use client_encoding=sjis, and
the encoding problem came to light.
Of course, both cases can be treat as third-party editors or .NET issues.
However, I thought that this might be a bug because those problems would
not have occurred if PostgreSQL convert the characters via the Microsoft's
recommended conversion rules, and the reason of the current PostgreSQL
conversion rules was not clear.
At least if the reason of the PostgreSQL's current conversion rule is clear,
it will help us to explain to the users.


[1] https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0#list-of-encodings
[2] https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)#Double-byte_character_differences
[3] http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT


Regards.

Yusuke, Egashira

Attachment

Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

From
Kyotaro Horiguchi
Date:
At Fri, 9 Sep 2022 12:22:33 +0000, "egashira.yusuke@fujitsu.com" <egashira.yusuke@fujitsu.com> wrote in 
> However, I still think it is problem that PostgreSQL returns some characters 
> which not able to be used in some Windows environment.
> Would it be a reasonable solution to this problem to have the user create 
> a map file with the conversion rules changed and add the conversion by 
> CREATE CONVERSION ?

The best way nowadays would be to move the entire system to unicode.
Or doesn't it work to let the .Net application to convert UTF-8 into
SJIS locally?

> > AFAIK generally Shift_jis and CP932 have different character sets.  I
> > don't know about .Net but doesn't CP932 work in that case?
> > Specifically, "Encoding.GetEncoding(932)".  There must a way to deal
> > with that characters since they are in CP932.
> 
> Unfortunately, "shift_jis" is the name of "CP932" in .NET[1], so the same 
> exception occurs for "Encoding.GetEncoding(932)".

Wow.. MS uses shift_jis as mere an alias of its variant of CP932 [1]
(MS932?).. It's not Shift_JIS nor even CP932 (at least in the decoding
direction)..

> I think the behavior of Windows obscures the problem of different 
> conversion rules. I attached the text file extracted 2-byte characters from 
> CP932.TXT[3]. When this is displayed using notepad.exe or 
> type command on cmd.exe, all characters are displayed in readable form.
> However, when we save the duplicate definition characters displayed 
> on notepad to a file, they are implicitly converted to 
> "Microsoft Recommended Code Points". So, my problem is probably 
> a corner case.

.Net seems less robust than notepad.exe.  I didn't find a way to
create a custom encoding on .Net framework.  (But I don't think that
is the way to go.)

> My customer used a third-party text editor instead of notepad and claimed 
> that some duplicate definition characters could not be displayed.
> Npgsql works with the client_encoding=utf8 setting by default, however, 
> there was a customer who wanted to use client_encoding=sjis, and 
> the encoding problem came to light.

Ah.. It's nowadays seldom seen, especially about use of level-3 or
more rarely-used characters.  Anyway I don't see a reason for utf8 not
being usable as wire-encoding. If the combination of .Net unicode
decoder and sjis encoder works, wouldn't that problem be gone?  I
believe .Net sjis encoder must yield the desired result.

Goole showed me some complaints about .Npgsql and SJIS, but many of
them came from the default encodig being not utf8 at that time and
some of them are about broken error messages in an unexpected
encoding..

> Of course, both cases can be treat as third-party editors or .NET issues.
> However, I thought that this might be a bug because those problems would 
> not have occurred if PostgreSQL convert the characters via the Microsoft's 
> recommended conversion rules, and the reason of the current PostgreSQL 
> conversion rules was not clear.
> At least if the reason of the PostgreSQL's current conversion rule is clear, 
> it will help us to explain to the users.

The cause of the trouble is that the .Net's specific implement of
CP932 decoder is not actually following CP932; it doesn't accept some
valid characters.  The editor does the same, too. So the correct
measure for this situation seems like to convert the texts (in SJIS or
UTF-8, as mentioned above) into the special encoding following the
MS's recommendation no longer available for some reason.

As mentioned before, one of the reasons for the current PostgreSQL's
SJIS mapping is that the precedence between duplicate characters is
not defined in the standard, in other words, it is implementation
dependent. Thus it is valid to arbitrarily define the mapping as far
as it covers all characters.  It was more than a decade ago so I don't
know the principle for the mapping, though. But it seems like putting
precedence to characters in the IBM extension area.

The reason we don't change it is it's now sufficiently legacy and the
lack of complaint until it became legacy despite of (I believe) a
certain amount of use cases. It's being leagcy suggests there may be
use cases where that conversion is expected.

[1] (Japanese doc) https://docs.microsoft.com/ja-jp/dotnet/api/system.text.encoding?view=net-6.0

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center