Re: Errors in our encoding conversion tables - Mailing list pgsql-hackers

From Albe Laurenz
Subject Re: Errors in our encoding conversion tables
Date
Msg-id A737B7A37273E048B164557ADEF4A58B50FECB63@ntex2010i.host.magwien.gv.at
Whole thread Raw
In response to Errors in our encoding conversion tables  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Errors in our encoding conversion tables  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:
> There's a discussion over at
> http://www.postgresql.org/message-id/flat/2sa.Dhu5.1hk1yrpTNFy.1MLOlb@seznam.cz
> of an apparent error in our WIN1250 -> LATIN2 conversion.  I looked into this
> and found that indeed, the code will happily translate certain characters
> for which there seems to be no justification.  I made up a quick script
> that would recompute the conversion tables in latin2_and_win1250.c from
> the Unicode mapping files in src/backend/utils/mb/Unicode, and what it
> computes is shown in the attached diff.  (Zeroes in the tables indicate
> codes with no translation, for which an error should be thrown.)
> 
> Having done that, I thought it would be a good idea to see if we had any
> other conversion tables that weren't directly based on the Unicode data.
> The only ones I could find were in cyrillic_and_mic.c, and those seem to
> be absolutely filled with errors, to the point where I wonder if they were
> made from the claimed encodings or some other ones.  The attached patch
> recomputes those from the Unicode data, too.
> 
> None of this data seems to have been touched since Tatsuo-san's original
> commit 969e0246, so it looks like we simply didn't vet that submission
> closely enough.
> 
> I have not attempted to reverify the files in utils/mb/Unicode against the
> original Unicode Consortium data, but maybe we ought to do that before
> taking any further steps here.
> 
> Anyway, what are we going to do about this?  I'm concerned that simply
> shoving in corrections may cause problems for users.  Almost certainly,
> we should not back-patch this kind of change.

Thanks for picking this up.

I agree with your proposed fix, the only thing that makes me feel uncomfortable
is that you get error messages like: ERROR:  character with byte sequence 0x96 in encoding "WIN1250" has no equivalent
inencoding "MULE_INTERNAL"
 
which is a bit misleading.
But the main thing is that no corrupt data can be entered.

I can understand the reluctance to back-patch; nobody likes his
application to suddenly fail after a minor database upgrade.

However, the people who would fail if this were back-patched are
people who will certainly run into trouble if they
a) upgrade to a release where this is fixed or
b) try to convert their database to, say, UTF8.

The least thing we should do is stick a fat warning into the release notes
of the first version where this is fixed, along with some guidelines what
to do (though I am afraid that there is not much more helpful to say than
"If your database encoding is X and data have been entered with client_encoding Y,
fix your data in the old system").

But I think that this fix should be applied to 9.6.
PostgreSQL has a strong reputation for being strict about correct encoding
(not saying that everybody appreciates that), and I think we shouldn't mar
that reputation.

Yours,
Laurenz Albe

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Error with index on unlogged table
Next
From: Ashutosh Bapat
Date:
Subject: Re: Getting sorted data from foreign server for merge join