Re: A rough roadmap for internationalization fixes - Mailing list pgsql-hackers

From Kurt Roeckx
Subject Re: A rough roadmap for internationalization fixes
Date
Msg-id 20031125181336.GA13791@ping.be
Whole thread Raw
In response to Re: A rough roadmap for internationalization fixes  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Responses Re: A rough roadmap for internationalization fixes  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Tue, Nov 25, 2003 at 08:40:57PM +0900, Tatsuo Ishii wrote:
> > On Tue, 25 Nov 2003, Peter Eisentraut wrote:
> > 
> > I've always thought unicode was enough to even represent Japanese. Then 
> > the client encoding can be something else that we can convert to. In any 
> > way, the encoding of the message catalog has to be known to the system so 
> > it can be converted to the correct encoding for the client.
> 
> I'm tired of telling that Unicode is not that perfect.

Maybe it should be explained what the problems really are,
instead of saying it "isn't perfect"?

From what I understand there is only a problem converting from
the "legacy" encoding to unicode, and the other way around, and
no problem if you stop doing the conversion.

The conversion problem is because what in an encoding is only
represented by 1 character can be several characters in unicode.

Some examples people might understand are:
- µ: In iso 8859-1 it's char 0xB5.  In unicode it can be U+00B5 (micro
sign) or U+03BC (greek letter small mu)
- Å: ISO 8859-1: 0xC5. Unicode U+00C5 (latin capital letter a
with ring above) or U+212B (angstrom sign)
- The ohm sign vs the greek letter omega.
- Quotation marks: You have left double quote, right double quote, and a few others.

> Another gottcha
> with Unicode is the UTF-8 encoding (currently we use) consumes 3
> bytes for each Kanji character, while other encodings consume only 2
> bytes. IMO 3/2 storage ratio could not be neglected for database use.

You can encode unicode in different ways, and UTF-8 is only one
of them.  Is there a problem with using UCS-2 (except that it
would require more storage for ASCII)?


Kurt



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Considerations for lib64
Next
From: Dennis Bjorklund
Date:
Subject: Re: Function parameter names