Thread: Unicode mapping scripts cleanup

Unicode mapping scripts cleanup

From
Peter Eisentraut
Date:
Here is a series of patches to clean up the Unicode mapping script
business in src/backend/utils/mb/Unicode/.  It overlaps with the
perlcritic work that I recently wrote about, except that these pieces
are not strictly related to Perl, but wrong comments, missing makefile
pieces, and such.

I discovered that some of the source files that one is supposed to
download don't exist anymore or are labeled obsolete.  Also, running the
scripts produces slight differences in the output.  So apparently, the
CJK to Unicode mappings are still evolving and should be updated
occasionally.  Next steps would be to commit some or all of these
differences after additional verification, and then update the scripts
to use whatever the non-obsolete mapping sources are supposed to be.

Attachment

Re: Unicode mapping scripts cleanup

From
Greg Stark
Date:
On Tue, Sep 1, 2015 at 5:13 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>   So apparently, the
> CJK to Unicode mappings are still evolving and should be updated
> occasionally.  Next steps would be to commit some or all of these
> differences after additional verification, and then update the scripts
> to use whatever the non-obsolete mapping sources are supposed to be.

Would that pose a problem for databases which have data in them
already using the old mappings?

-- 
greg



Re: Unicode mapping scripts cleanup

From
Tatsuo Ishii
Date:
> On Tue, Sep 1, 2015 at 5:13 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>>   So apparently, the
>> CJK to Unicode mappings are still evolving and should be updated
>> occasionally.  Next steps would be to commit some or all of these
>> differences after additional verification, and then update the scripts
>> to use whatever the non-obsolete mapping sources are supposed to be.
> 
> Would that pose a problem for databases which have data in them
> already using the old mappings?

I think so. We must be very careful updating the maps. Adding new
mapping data would cause less problem, but replacing existing mappings
will be definitely a big problem for users.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Unicode mapping scripts cleanup

From
Tatsuo Ishii
Date:
> I discovered that some of the source files that one is supposed to
> download don't exist anymore or are labeled obsolete.  Also, running the
> scripts produces slight differences in the output.  So apparently, the
> CJK to Unicode mappings are still evolving and should be updated
> occasionally.  Next steps would be to commit some or all of these
> differences after additional verification, and then update the scripts
> to use whatever the non-obsolete mapping sources are supposed to be.

Some of maps were "hand tweaked" from the output of the script, for
example utf8_to_sjis.map. See git log for more details. This is due to
part of the source file was not incomplete or inappropriate. Also we
needed to compromise while creating a mapping between some local
encodings (for example SJIS) and Unicode, because in the source
mapping file round trip conversion is not guaranteed.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Unicode mapping scripts cleanup

From
Peter Eisentraut
Date:
On 9/1/15 7:27 PM, Tatsuo Ishii wrote:
>> On Tue, Sep 1, 2015 at 5:13 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>>>   So apparently, the
>>> CJK to Unicode mappings are still evolving and should be updated
>>> occasionally.  Next steps would be to commit some or all of these
>>> differences after additional verification, and then update the scripts
>>> to use whatever the non-obsolete mapping sources are supposed to be.
>>
>> Would that pose a problem for databases which have data in them
>> already using the old mappings?
> 
> I think so. We must be very careful updating the maps. Adding new
> mapping data would cause less problem, but replacing existing mappings
> will be definitely a big problem for users.

Note that I'm not actually proposing to change the mappings, I just want
to get the scripts into working order, to put us into a position to
consider changes if necessary.

That said, I'm not sure what the problem with changes would be.  The
data in the databases doesn't change.  You just see different data
coming out.  It is in the nature of encoding conversion that you don't
get the original data, but an approximation.  Then again, I don't have
any knowledge about how to handle such changes.  But the fact that the
standards bodies are still making changes indicates that such changes
are to be expected and should be handled.  I think this is similar to
time zone changes, and also similar in different ways to collation changes.




Re: Unicode mapping scripts cleanup

From
Tatsuo Ishii
Date:
>> I think so. We must be very careful updating the maps. Adding new
>> mapping data would cause less problem, but replacing existing mappings
>> will be definitely a big problem for users.
> 
> Note that I'm not actually proposing to change the mappings, I just want
> to get the scripts into working order, to put us into a position to
> consider changes if necessary.
> 
> That said, I'm not sure what the problem with changes would be.  The
> data in the databases doesn't change.  You just see different data
> coming out.  It is in the nature of encoding conversion that you don't
> get the original data, but an approximation.

I don't buy the argument "user's should accept the behavior change
because data inside PostgreSQL does not change". I think we should
care about user's application in total.

>  Then again, I don't have
> any knowledge about how to handle such changes.  But the fact that the
> standards bodies are still making changes indicates that such changes
> are to be expected and should be handled.  I think this is similar to
> time zone changes, and also similar in different ways to collation changes.

The question here is, as far as I know, the encoding mappings are
*not* part of the Unicode standard, nor any kind of other standards,
then why do we need strictly follow the mapping data with sacrificing
application's compatibility.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Unicode mapping scripts cleanup

From
Robert Haas
Date:
On Tue, Sep 15, 2015 at 9:00 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
>>  Then again, I don't have
>> any knowledge about how to handle such changes.  But the fact that the
>> standards bodies are still making changes indicates that such changes
>> are to be expected and should be handled.  I think this is similar to
>> time zone changes, and also similar in different ways to collation changes.
>
> The question here is, as far as I know, the encoding mappings are
> *not* part of the Unicode standard, nor any kind of other standards,
> then why do we need strictly follow the mapping data with sacrificing
> application's compatibility.

What if we discovered that one of our mappings was wrong?  Suppose
that there is some encoding where the Unicode mapping for "a" is
inadvertently mapped to the letter "b" in some other character set,
and "b" is mapped to "a".  I imagine that anyone using that encoding
would want this fixed; it's a bug.

Other cases might be less clear.  The cost of changing the mappings
should always be compared against the benefit.  But it might still be
the right thing to do in some cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Unicode mapping scripts cleanup

From
Tatsuo Ishii
Date:
> What if we discovered that one of our mappings was wrong?  Suppose
> that there is some encoding where the Unicode mapping for "a" is
> inadvertently mapped to the letter "b" in some other character set,
> and "b" is mapped to "a".  I imagine that anyone using that encoding
> would want this fixed; it's a bug.

I am not against fixing the mapping if it *clearly* includes a
bug. However we must be very careful before deciding if it's really a
bug or not.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Unicode mapping scripts cleanup

From
Andres Freund
Date:
Hi,

On 2015-09-01 00:13:07 -0400, Peter Eisentraut wrote:
> Here is a series of patches to clean up the Unicode mapping script
> business in src/backend/utils/mb/Unicode/.  It overlaps with the
> perlcritic work that I recently wrote about, except that these pieces
> are not strictly related to Perl, but wrong comments, missing makefile
> pieces, and such.

I looked through the patches, and afaics they're generally a good
idea. And they're all, IIUC, independent of us applying or not applying
updates. So why don't we go ahead with these changes?

I've marked this as returned-with-feedback for now, since there hasn't
been much progress lately.

Greetings,

Andres Freund



Re: Unicode mapping scripts cleanup

From
Peter Eisentraut
Date:
On 9/1/15 12:13 AM, Peter Eisentraut wrote:
> ere is a series of patches to clean up the Unicode mapping script
> business in src/backend/utils/mb/Unicode/.

I never committed the last of these patches, which have the download
locations of the files.  I have updated this a bit now and propose it
here again.

I have also added download locations for the source files we do have in
git.  I wonder why we ship these and none of the other ones:

845974 gb-18030-2000.xml
324237 euc-jis-2004-std.txt
319198 sjis-0213-2004-std.txt

I recall it might have been license issues with the other files.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment