Re: Unicode problem ??? - Mailing list pgsql-general
From | Björn Lundin |
---|---|
Subject | Re: Unicode problem ??? |
Date | |
Msg-id | c66idm$22q5$1@news.hub.org Whole thread Raw |
In response to | Re: Unicode problem ??? ("Stijn Vanroye" <s.vanroye@farcourier.com>) |
List | pgsql-general |
Stijn Vanroye wrote: > Of what I hear, UNICODE indeed seems the best option. But then again, that > encoding stuff is still a bit of a mistery to me. What I personally don't The following is cut from the documentation of XML Ada (xmlada-1.0/docs/xml_2.html#SEC6) ( which is available at http://libre.act-europe.fr/xmlada/) <quote> We now know how each encoded character can be represented by an integer value (code point) depending on the character set. Character encoding schemes deal with the representation of a sequence of integers to a sequence of code units. A code unit is a sequence of bytes on a computer architecture. There exists a number of possible encoding schemes. Some of them encode all integers on the same number of bytes. They are called fixed-width encoding forms, and include the standard encoding for Internet emails (7bits, but it can't encode all characters), as well as the simple 8bits scheme, or the EBCDIC scheme. Among them is also the UTF-32 scheme which is defined in the Unicode standard. Another set of encoding schemes encode integers on a variable number of bytes. These include two schemes that are also defined in the Unicode standard, namely Utf-8 and Utf-16. Unicode doesn't impose any specific encoding. However, it is most often associated with one of the Utf encodings. They each have their own properties and advantages: Utf32 This is the simplest of all these encodings. It simply encodes all the characters on 32 bits (4 bytes). This encodes all the possible characters in Unicode, and is obviously straightforward to manipulate. However, given that the first 65535 characters in Unicode are enough to encode all known languages currently in use, Utf32 is also a waste of space in most cases. Utf16 For the above reason, Utf16 was defined. Most characters are only encoded on two bytes (which is enough for the first 65535 and most current characters). In addition, a number of special code points have been defined, known as surrogate pairs, that make the encoding of integers greater than 65535 possible. The integers are then encoded on four bytes. As a result, Utf16 is thus much more memory-efficient and requires less space than Utf32 to encode sequences of characters. However, it is also more complex to decode. Utf8 This is an even more space-efficient encoding, but is also more complex to decode. More important, it is compatible with the most currently used simple 8bit encoding. Utf8 has the following properties: Characters 0 to 127 (ASCII) are encoded simply as a single byte. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. Characters greater than 127 are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte can appear as part of any other character. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. UTF-8 encoded characters may theoretically be up to six bytes long, however the first 16-bit characters are only up to three bytes long. Note that the encodings above, except for Utf8, have two versions, depending on the chosen byte order on the machine. </quote> So yes, Unicode in Utf8 is tricky to handle /Björn
pgsql-general by date: