Thread: Unicode mapping scripts cleanup
Here is a series of patches to clean up the Unicode mapping script business in src/backend/utils/mb/Unicode/. It overlaps with the perlcritic work that I recently wrote about, except that these pieces are not strictly related to Perl, but wrong comments, missing makefile pieces, and such. I discovered that some of the source files that one is supposed to download don't exist anymore or are labeled obsolete. Also, running the scripts produces slight differences in the output. So apparently, the CJK to Unicode mappings are still evolving and should be updated occasionally. Next steps would be to commit some or all of these differences after additional verification, and then update the scripts to use whatever the non-obsolete mapping sources are supposed to be.
Attachment
- 0001-UCS_to_most.pl-Make-executable-for-consistency-with-.patch
- 0002-Fix-comments.patch
- 0003-Remove-manually-added-header-comments-from-generated.patch
- 0004-Add-Unicode-map-generation-scripts-as-rule-prerequis.patch
- 0005-Add-missing-rules-related-to-EUC_JIS_2004-and-SHIFT_.patch
- 0006-Make-some-adjustments-in-variable-assignments.patch
- 0007-Add-prerequisite-for-KOI8-U.TXT.patch
- 0008-Add-rules-to-download-raw-mapping-files.patch
- 0009-Make-spacing-and-punctuation-consistent.patch
- 0010-Turn-off-test-mode-by-default.patch
On Tue, Sep 1, 2015 at 5:13 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > So apparently, the > CJK to Unicode mappings are still evolving and should be updated > occasionally. Next steps would be to commit some or all of these > differences after additional verification, and then update the scripts > to use whatever the non-obsolete mapping sources are supposed to be. Would that pose a problem for databases which have data in them already using the old mappings? -- greg
> On Tue, Sep 1, 2015 at 5:13 AM, Peter Eisentraut <peter_e@gmx.net> wrote: >> So apparently, the >> CJK to Unicode mappings are still evolving and should be updated >> occasionally. Next steps would be to commit some or all of these >> differences after additional verification, and then update the scripts >> to use whatever the non-obsolete mapping sources are supposed to be. > > Would that pose a problem for databases which have data in them > already using the old mappings? I think so. We must be very careful updating the maps. Adding new mapping data would cause less problem, but replacing existing mappings will be definitely a big problem for users. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
> I discovered that some of the source files that one is supposed to > download don't exist anymore or are labeled obsolete. Also, running the > scripts produces slight differences in the output. So apparently, the > CJK to Unicode mappings are still evolving and should be updated > occasionally. Next steps would be to commit some or all of these > differences after additional verification, and then update the scripts > to use whatever the non-obsolete mapping sources are supposed to be. Some of maps were "hand tweaked" from the output of the script, for example utf8_to_sjis.map. See git log for more details. This is due to part of the source file was not incomplete or inappropriate. Also we needed to compromise while creating a mapping between some local encodings (for example SJIS) and Unicode, because in the source mapping file round trip conversion is not guaranteed. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On 9/1/15 7:27 PM, Tatsuo Ishii wrote: >> On Tue, Sep 1, 2015 at 5:13 AM, Peter Eisentraut <peter_e@gmx.net> wrote: >>> So apparently, the >>> CJK to Unicode mappings are still evolving and should be updated >>> occasionally. Next steps would be to commit some or all of these >>> differences after additional verification, and then update the scripts >>> to use whatever the non-obsolete mapping sources are supposed to be. >> >> Would that pose a problem for databases which have data in them >> already using the old mappings? > > I think so. We must be very careful updating the maps. Adding new > mapping data would cause less problem, but replacing existing mappings > will be definitely a big problem for users. Note that I'm not actually proposing to change the mappings, I just want to get the scripts into working order, to put us into a position to consider changes if necessary. That said, I'm not sure what the problem with changes would be. The data in the databases doesn't change. You just see different data coming out. It is in the nature of encoding conversion that you don't get the original data, but an approximation. Then again, I don't have any knowledge about how to handle such changes. But the fact that the standards bodies are still making changes indicates that such changes are to be expected and should be handled. I think this is similar to time zone changes, and also similar in different ways to collation changes.
>> I think so. We must be very careful updating the maps. Adding new >> mapping data would cause less problem, but replacing existing mappings >> will be definitely a big problem for users. > > Note that I'm not actually proposing to change the mappings, I just want > to get the scripts into working order, to put us into a position to > consider changes if necessary. > > That said, I'm not sure what the problem with changes would be. The > data in the databases doesn't change. You just see different data > coming out. It is in the nature of encoding conversion that you don't > get the original data, but an approximation. I don't buy the argument "user's should accept the behavior change because data inside PostgreSQL does not change". I think we should care about user's application in total. > Then again, I don't have > any knowledge about how to handle such changes. But the fact that the > standards bodies are still making changes indicates that such changes > are to be expected and should be handled. I think this is similar to > time zone changes, and also similar in different ways to collation changes. The question here is, as far as I know, the encoding mappings are *not* part of the Unicode standard, nor any kind of other standards, then why do we need strictly follow the mapping data with sacrificing application's compatibility. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Tue, Sep 15, 2015 at 9:00 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: >> Then again, I don't have >> any knowledge about how to handle such changes. But the fact that the >> standards bodies are still making changes indicates that such changes >> are to be expected and should be handled. I think this is similar to >> time zone changes, and also similar in different ways to collation changes. > > The question here is, as far as I know, the encoding mappings are > *not* part of the Unicode standard, nor any kind of other standards, > then why do we need strictly follow the mapping data with sacrificing > application's compatibility. What if we discovered that one of our mappings was wrong? Suppose that there is some encoding where the Unicode mapping for "a" is inadvertently mapped to the letter "b" in some other character set, and "b" is mapped to "a". I imagine that anyone using that encoding would want this fixed; it's a bug. Other cases might be less clear. The cost of changing the mappings should always be compared against the benefit. But it might still be the right thing to do in some cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> What if we discovered that one of our mappings was wrong? Suppose > that there is some encoding where the Unicode mapping for "a" is > inadvertently mapped to the letter "b" in some other character set, > and "b" is mapped to "a". I imagine that anyone using that encoding > would want this fixed; it's a bug. I am not against fixing the mapping if it *clearly* includes a bug. However we must be very careful before deciding if it's really a bug or not. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Hi, On 2015-09-01 00:13:07 -0400, Peter Eisentraut wrote: > Here is a series of patches to clean up the Unicode mapping script > business in src/backend/utils/mb/Unicode/. It overlaps with the > perlcritic work that I recently wrote about, except that these pieces > are not strictly related to Perl, but wrong comments, missing makefile > pieces, and such. I looked through the patches, and afaics they're generally a good idea. And they're all, IIUC, independent of us applying or not applying updates. So why don't we go ahead with these changes? I've marked this as returned-with-feedback for now, since there hasn't been much progress lately. Greetings, Andres Freund
On 9/1/15 12:13 AM, Peter Eisentraut wrote: > ere is a series of patches to clean up the Unicode mapping script > business in src/backend/utils/mb/Unicode/. I never committed the last of these patches, which have the download locations of the files. I have updated this a bit now and propose it here again. I have also added download locations for the source files we do have in git. I wonder why we ship these and none of the other ones: 845974 gb-18030-2000.xml 324237 euc-jis-2004-std.txt 319198 sjis-0213-2004-std.txt I recall it might have been license issues with the other files. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services