Thread: Encoding issues

Encoding issues

From

Tatsuo Ishii

Date:

10 October 2001, 02:40:40

Receiving a request to add ISO 8859-15 and 16, I review the multibyte
support code and found several errors in it.

1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not ISO 8859-5, but is actually ISO 8859-9. Should
werename LATIN5 to "ISO8859-5" (or whatever) as the encoding name? I think we should. For your information, here are
thecorrect mapping between ISO 8859-n and LATINn.

ISO 8859-1 LATIN1 ISO 8859-2 LATIN2 ISO 8859-3 LATIN3 ISO 8859-4 LATIN4 ISO 8859-9 LATIN5 ISO
8859-10 LATIN6

2) The leading characters for some Cyrillic charsets are wrong.

Currently they are defined as:

#define LC_KOI8_R 0x8c /* Cyrillic KOI8-R */
#define LC_KOI8_U 0x8c /* Cyrillic KOI8-U */
#define LC_ISO8859_5 0x8d /* ISO8859 Cyrillic */

These should be:

#define LC_KOI8_R 0x8b /* Cyrillic KOI8-R */
#define LC_KOI8_U 0x8b /* Cyrillic KOI8-U */
#define LC_ISO8859_5 0x8c /* ISO8859 Cyrillic */
The impact of correcting them would be for users who are storing their data into database using MULE internal
code.I think they are quite few people using MULE internal code. So we could correct them for 7.2.

Comments?

BTW, should we support ISO 8859-6 and beyond for 7.2? There have been
some requests to do that. Supporting them are actually trivial works,
should be one day job. The harder part is writing conversion function
between encodings. However, there is very few demands to do that, I
guess. If so, we could ommit the conversion capability for 7.2.
Comments?
--
Tatsuo Ishii

Re: Encoding issues

From

Tatsuo Ishii

Date:

10 October 2001, 03:10:50

> 1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not
>    ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to
>    "ISO8859-5" (or whatever) as the encoding name? I think we should.
>    For your information, here are the correct mapping between ISO
>    8859-n and LATINn.
> 
>    ISO 8859-1    LATIN1
>    ISO 8859-2    LATIN2
>    ISO 8859-3    LATIN3
>    ISO 8859-4    LATIN4
>    ISO 8859-9    LATIN5
>    ISO 8859-10    LATIN6

I just found additions:
 ISO 8859-13    LATIN7 ISO 8859-14    LATIN8 ISO 8859-15    LATIN9
--
Tatsuo Ishii

Re: Encoding issues

From

Karel Zak

Date:

10 October 2001, 03:54:21

On Wed, Oct 10, 2001 at 03:40:25PM +0900, Tatsuo Ishii wrote:
> Receiving a request to add ISO 8859-15 and 16, I review the multibyte
> support code and found several errors in it.
> 
> 1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not
>    ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to
>    "ISO8859-5" (or whatever) as the encoding name? I think we should.
>    For your information, here are the correct mapping between ISO
>    8859-n and LATINn.
> 
>    ISO 8859-1    LATIN1
>    ISO 8859-2    LATIN2
>    ISO 8859-3    LATIN3
>    ISO 8859-4    LATIN4
>    ISO 8859-9    LATIN5
>    ISO 8859-10    LATIN6You are right. Now I see some old version of PostgreSQL and thereis this confusion in some
headersand comments too.
 
> 2) The leading characters for some Cyrillic charsets are wrong.
> 
> Currently they are defined as:
> 
> #define LC_KOI8_R    0x8c    /* Cyrillic KOI8-R */
> #define LC_KOI8_U    0x8c    /* Cyrillic KOI8-U */
> #define LC_ISO8859_5    0x8d    /* ISO8859 Cyrillic */
> 
> These should be:
> 
> #define LC_KOI8_R    0x8b    /* Cyrillic KOI8-R */
> #define LC_KOI8_U    0x8b    /* Cyrillic KOI8-U */
> #define LC_ISO8859_5    0x8c    /* ISO8859 Cyrillic */
Again, it's long time in sources too (interesting is that we don't understand some bugreport).

>     The impact of correcting them would be for users who are storing
>     their data into database using MULE internal code. I think they
>     are quite few people using MULE internal code. So we could correct
>     them for 7.2.
> 
> Comments?
I agree with you, make release with know bugs is dirty thing.

> BTW, should we support ISO 8859-6 and beyond for 7.2? There have been
> some requests to do that. Supporting them are actually trivial works,
> should be one day job. The harder part is writing conversion function
> between encodings. However, there is very few demands to do that, I
> guess. If so, we could ommit the conversion capability for 7.2.
> Comments?
You will hear "we are in the feature freeze state.." :-)
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz

Re: Encoding issues

From

Peter Eisentraut

Date:

10 October 2001, 12:47:10

Tatsuo Ishii writes:

> BTW, should we support ISO 8859-6 and beyond for 7.2?

If possible we should.  Otherwise people might spread the word that
PostgreSQL is not ready for the Euro.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter

Re: Encoding issues

From

Patrice Hédé

Date:

10 October 2001, 14:04:08

* Tatsuo Ishii <t-ishii@sra.co.jp> [011010 18:21]:
> Receiving a request to add ISO 8859-15 and 16, I review the multibyte
> support code and found several errors in it.
> 
> 1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not
>    ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to
>    "ISO8859-5" (or whatever) as the encoding name? I think we should.
>    For your information, here are the correct mapping between ISO
>    8859-n and LATINn.
> 
>    ISO 8859-1  LATIN1
>    ISO 8859-2  LATIN2
>    ISO 8859-3  LATIN3
>    ISO 8859-4  LATIN4
>    ISO 8859-9  LATIN5
>    ISO 8859-10 LATIN6

ISO-8859-14 LATIN 8
ISO-8859-15 LATIN 9 or LATIN 0
ISO-8859-16 LATIN 10

:)

> 2) The leading characters for some Cyrillic charsets are wrong.
> 
> Currently they are defined as:
> 
> #define LC_KOI8_R    0x8c    /* Cyrillic KOI8-R */
> #define LC_KOI8_U    0x8c    /* Cyrillic KOI8-U */
> #define LC_ISO8859_5    0x8d    /* ISO8859 Cyrillic */
> 
> These should be:
> 
> #define LC_KOI8_R    0x8b    /* Cyrillic KOI8-R */
> #define LC_KOI8_U    0x8b    /* Cyrillic KOI8-U */
> #define LC_ISO8859_5    0x8c    /* ISO8859 Cyrillic */
> 
>     The impact of correcting them would be for users who are storing
>     their data into database using MULE internal code. I think they
>     are quite few people using MULE internal code. So we could correct
>     them for 7.2.
> 
> Comments?
> 
> BTW, should we support ISO 8859-6 and beyond for 7.2? There have been
> some requests to do that. Supporting them are actually trivial works,
> should be one day job. The harder part is writing conversion function
> between encodings. However, there is very few demands to do that, I
> guess. If so, we could ommit the conversion capability for 7.2.
> Comments?

I think iso-8859-15 and 16 are important, if only because they are the
only two encodings which support the Euro (not speaking of unicode, of
course !), and at least iso-8859-15 has some official status in
western europe (on Unix systems at least... Windows users have their
own table where the Euro sign is stored somewhere else, I think at
0x80).

I have done the conversion for the mappings to and from unicode, but
you could get the original tables at :

http://www.unicode.org/Public/MAPPINGS/ISO8859/

(you can get iso-8859-10, 13 and 14 there as well ! 10 is supposed to
be for greenlandic and sámi, 13 for the baltic rim, and 14 for gaelic)

Just found on google the following link, where you can see quite a few
charsets (it doesn't have -16, too new probably) :

http://www.kostis.net/charsets/

Patrice

-- 
Patrice Hédé
email: patrice hede à islande org
www  : http://www.islande.org/