Re: Errors in our encoding conversion tables - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: Errors in our encoding conversion tables
Date
Msg-id 20151127.110027.1989081859519291674.t-ishii@sraoss.co.jp
Whole thread Raw
In response to Errors in our encoding conversion tables  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Errors in our encoding conversion tables  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
> There's a discussion over at
> http://www.postgresql.org/message-id/flat/2sa.Dhu5.1hk1yrpTNFy.1MLOlb@seznam.cz
> of an apparent error in our WIN1250 -> LATIN2 conversion.  I looked into this
> and found that indeed, the code will happily translate certain characters
> for which there seems to be no justification.  I made up a quick script
> that would recompute the conversion tables in latin2_and_win1250.c from
> the Unicode mapping files in src/backend/utils/mb/Unicode, and what it
> computes is shown in the attached diff.  (Zeroes in the tables indicate
> codes with no translation, for which an error should be thrown.)
> 
> Having done that, I thought it would be a good idea to see if we had any
> other conversion tables that weren't directly based on the Unicode data.
> The only ones I could find were in cyrillic_and_mic.c, and those seem to
> be absolutely filled with errors, to the point where I wonder if they were
> made from the claimed encodings or some other ones.  The attached patch
> recomputes those from the Unicode data, too.
> 
> None of this data seems to have been touched since Tatsuo-san's original
> commit 969e0246, so it looks like we simply didn't vet that submission
> closely enough.
> 
> I have not attempted to reverify the files in utils/mb/Unicode against the
> original Unicode Consortium data, but maybe we ought to do that before
> taking any further steps here.
> 
> Anyway, what are we going to do about this?  I'm concerned that simply
> shoving in corrections may cause problems for users.  Almost certainly,
> we should not back-patch this kind of change.

I have started to looking into it. I wonder how do you create the part
of your patch:

*** 154,163 **** win12502mic(const unsigned char *l, unsigned char *p, int len) {     static const unsigned char
win1250_2_iso88592[]= {
 
!         0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,
!         0x88, 0x89, 0xA9, 0x8B, 0xA6, 0xAB, 0xAE, 0xAC,
!         0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97,
!         0x98, 0x99, 0xB9, 0x9B, 0xB6, 0xBB, 0xBE, 0xBC,         0xA0, 0xB7, 0xA2, 0xA3, 0xA4, 0xA1, 0x00, 0xA7,
 0xA8, 0x00, 0xAA, 0x00, 0x00, 0xAD, 0x00, 0xAF,         0xB0, 0x00, 0xB2, 0xB3, 0xB4, 0x00, 0x00, 0x00,
 
--- 154,163 ---- win12502mic(const unsigned char *l, unsigned char *p, int len) {     static const unsigned char
win1250_2_iso88592[]= {
 
!         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
!         0x00, 0x00, 0xA9, 0x00, 0xA6, 0xAB, 0xAE, 0xAC,
!         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
!         0x00, 0x00, 0xB9, 0x00, 0xB6, 0xBB, 0xBE, 0xBC,         0xA0, 0xB7, 0xA2, 0xA3, 0xA4, 0xA1, 0x00, 0xA7,
 0xA8, 0x00, 0xAA, 0x00, 0x00, 0xAD, 0x00, 0xAF,         0xB0, 0x00, 0xB2, 0xB3, 0xB4, 0x00, 0x00, 0x00,
 

In the above you seem to disable the conversion from 0x96 of win1250
to ISO-8859-2 by using the Unicode mapping files in
src/backend/utils/mb/Unicode. But the corresponding mapping file
(iso8859_2_to_utf8.amp) does include following entry:
 {0x0096, 0xc296},

How do you know 0x96 should be removed from the conversion?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: New email address
Next
From: Tom Lane
Date:
Subject: Re: Errors in our encoding conversion tables