Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text - Mailing list pgsql-odbc

From Bart Samwel
Subject Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date
Msg-id 442C4F6C.2000607@samwel.tk
Whole thread Raw
In response to Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields  (Johann Zuschlag <zuschlag2@online.de>)
List pgsql-odbc
Johann Zuschlag wrote:
> The problem with UTF-8 is that all ASCII characters are represented by
> one byte and all non ASCII characters, e.g. German Umlauts, are
> represented by two bytes. That's why UTF-8 is called a "variable-length
> multibyte encoding". In a pure Unicode world, e.g. U+xxxx with two
> bytes, every character is represented by two bytes (fixed-length
> multibyte encoding). So Unicode is not equal to UTF-8, even though the
> PostgreSQL documentation is stating that.

Well, it's actually even more complicated, because Unicode is actually a
32-bit character set. There is actually UTF8 (variable-length multibyte,
8 bits per unit), UTF16 (variable-length multibyte) and UTF32
(fixed-length multibyte). There is also UCS2 (fixed-length 16-bit),
which is limited to the 16 bits of the Basic Multilingual Plane, and
UCS4, which is functionally identical to UTF32. UTF-8 actually supports
up to 4 bytes per character, so it is more complete than the purely
16-bit UCS-2. Any of the variable-length encodings, and the 32-bit
UTF-32 and UCS-4 encodings can represent the whole of the character set.
A pure Unicode world can use any of those encodings, so it's a tradeoff.
If you want a direct relationship between the number of characters in a
string and the number of bytes taken, use a fixed-length encoding. If
you want to be able to encode everything, use a variable-length encoding
or a 32-bit encoding. If you want to use little space, use an 8-bit
encoding. That's it.

 > Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
 > Unfortunately (or fortunately?) Windows seems to use UTF-8 for European
 > languages. Hiroshi can you explain that? I guess the Japanese edition of
 > Windows XP is using pure 2 byte Unicode.

In fact, the Win32 API is UTF-16 even in European languages(started out
as UCS-2 but became UTF-16 when Unicode went 32-bit :-) ), but it
provides an 8-bit compatibility interface. Don't know if te 8-bit
encoding is UTF-8 or plain 8-bit code pages though.

Reference: http://en.wikipedia.org/wiki/Unicode

Cheers,
Bart

pgsql-odbc by date:

Previous
From: Hiroshi Inoue
Date:
Subject: Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Next
From: Marc Herbert
Date:
Subject: Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields