Thread: Encoding issues
Receiving a request to add ISO 8859-15 and 16, I review the multibyte support code and found several errors in it. 1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not ISO 8859-5, but is actually ISO 8859-9. Should werename LATIN5 to "ISO8859-5" (or whatever) as the encoding name? I think we should. For your information, here are thecorrect mapping between ISO 8859-n and LATINn. ISO 8859-1 LATIN1 ISO 8859-2 LATIN2 ISO 8859-3 LATIN3 ISO 8859-4 LATIN4 ISO 8859-9 LATIN5 ISO 8859-10 LATIN6 2) The leading characters for some Cyrillic charsets are wrong. Currently they are defined as: #define LC_KOI8_R 0x8c /* Cyrillic KOI8-R */ #define LC_KOI8_U 0x8c /* Cyrillic KOI8-U */ #define LC_ISO8859_5 0x8d /* ISO8859 Cyrillic */ These should be: #define LC_KOI8_R 0x8b /* Cyrillic KOI8-R */ #define LC_KOI8_U 0x8b /* Cyrillic KOI8-U */ #define LC_ISO8859_5 0x8c /* ISO8859 Cyrillic */ The impact of correcting them would be for users who are storing their data into database using MULE internal code.I think they are quite few people using MULE internal code. So we could correct them for 7.2. Comments? BTW, should we support ISO 8859-6 and beyond for 7.2? There have been some requests to do that. Supporting them are actually trivial works, should be one day job. The harder part is writing conversion function between encodings. However, there is very few demands to do that, I guess. If so, we could ommit the conversion capability for 7.2. Comments? -- Tatsuo Ishii
> 1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not > ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to > "ISO8859-5" (or whatever) as the encoding name? I think we should. > For your information, here are the correct mapping between ISO > 8859-n and LATINn. > > ISO 8859-1 LATIN1 > ISO 8859-2 LATIN2 > ISO 8859-3 LATIN3 > ISO 8859-4 LATIN4 > ISO 8859-9 LATIN5 > ISO 8859-10 LATIN6 I just found additions: ISO 8859-13 LATIN7 ISO 8859-14 LATIN8 ISO 8859-15 LATIN9 -- Tatsuo Ishii
On Wed, Oct 10, 2001 at 03:40:25PM +0900, Tatsuo Ishii wrote: > Receiving a request to add ISO 8859-15 and 16, I review the multibyte > support code and found several errors in it. > > 1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not > ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to > "ISO8859-5" (or whatever) as the encoding name? I think we should. > For your information, here are the correct mapping between ISO > 8859-n and LATINn. > > ISO 8859-1 LATIN1 > ISO 8859-2 LATIN2 > ISO 8859-3 LATIN3 > ISO 8859-4 LATIN4 > ISO 8859-9 LATIN5 > ISO 8859-10 LATIN6You are right. Now I see some old version of PostgreSQL and thereis this confusion in some headersand comments too. > 2) The leading characters for some Cyrillic charsets are wrong. > > Currently they are defined as: > > #define LC_KOI8_R 0x8c /* Cyrillic KOI8-R */ > #define LC_KOI8_U 0x8c /* Cyrillic KOI8-U */ > #define LC_ISO8859_5 0x8d /* ISO8859 Cyrillic */ > > These should be: > > #define LC_KOI8_R 0x8b /* Cyrillic KOI8-R */ > #define LC_KOI8_U 0x8b /* Cyrillic KOI8-U */ > #define LC_ISO8859_5 0x8c /* ISO8859 Cyrillic */ Again, it's long time in sources too (interesting is that we don't understand some bugreport). > The impact of correcting them would be for users who are storing > their data into database using MULE internal code. I think they > are quite few people using MULE internal code. So we could correct > them for 7.2. > > Comments? I agree with you, make release with know bugs is dirty thing. > BTW, should we support ISO 8859-6 and beyond for 7.2? There have been > some requests to do that. Supporting them are actually trivial works, > should be one day job. The harder part is writing conversion function > between encodings. However, there is very few demands to do that, I > guess. If so, we could ommit the conversion capability for 7.2. > Comments? You will hear "we are in the feature freeze state.." :-) Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Tatsuo Ishii writes: > BTW, should we support ISO 8859-6 and beyond for 7.2? If possible we should. Otherwise people might spread the word that PostgreSQL is not ready for the Euro. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
* Tatsuo Ishii <t-ishii@sra.co.jp> [011010 18:21]: > Receiving a request to add ISO 8859-15 and 16, I review the multibyte > support code and found several errors in it. > > 1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not > ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to > "ISO8859-5" (or whatever) as the encoding name? I think we should. > For your information, here are the correct mapping between ISO > 8859-n and LATINn. > > ISO 8859-1 LATIN1 > ISO 8859-2 LATIN2 > ISO 8859-3 LATIN3 > ISO 8859-4 LATIN4 > ISO 8859-9 LATIN5 > ISO 8859-10 LATIN6 ISO-8859-14 LATIN 8 ISO-8859-15 LATIN 9 or LATIN 0 ISO-8859-16 LATIN 10 :) > 2) The leading characters for some Cyrillic charsets are wrong. > > Currently they are defined as: > > #define LC_KOI8_R 0x8c /* Cyrillic KOI8-R */ > #define LC_KOI8_U 0x8c /* Cyrillic KOI8-U */ > #define LC_ISO8859_5 0x8d /* ISO8859 Cyrillic */ > > These should be: > > #define LC_KOI8_R 0x8b /* Cyrillic KOI8-R */ > #define LC_KOI8_U 0x8b /* Cyrillic KOI8-U */ > #define LC_ISO8859_5 0x8c /* ISO8859 Cyrillic */ > > The impact of correcting them would be for users who are storing > their data into database using MULE internal code. I think they > are quite few people using MULE internal code. So we could correct > them for 7.2. > > Comments? > > BTW, should we support ISO 8859-6 and beyond for 7.2? There have been > some requests to do that. Supporting them are actually trivial works, > should be one day job. The harder part is writing conversion function > between encodings. However, there is very few demands to do that, I > guess. If so, we could ommit the conversion capability for 7.2. > Comments? I think iso-8859-15 and 16 are important, if only because they are the only two encodings which support the Euro (not speaking of unicode, of course !), and at least iso-8859-15 has some official status in western europe (on Unix systems at least... Windows users have their own table where the Euro sign is stored somewhere else, I think at 0x80). I have done the conversion for the mappings to and from unicode, but you could get the original tables at : http://www.unicode.org/Public/MAPPINGS/ISO8859/ (you can get iso-8859-10, 13 and 14 there as well ! 10 is supposed to be for greenlandic and sámi, 13 for the baltic rim, and 14 for gaelic) Just found on google the following link, where you can see quite a few charsets (it doesn't have -16, too new probably) : http://www.kostis.net/charsets/ Patrice -- Patrice Hédé email: patrice hede à islande org www : http://www.islande.org/