Thread: UTF-8 question.
I'm new to PostgreSQL, and from the looks of it, it's a great database, and I'll be using more of it in the future. I had a quick question if anyone could clear this up. The documentation for PostgreSQL (version 7.1, the version this server is using) says that it supports multibyte character encodings like Unicode (which implies UTF-16 encoding). Later on, the same page says that Unicode is represented using UTF-8 encoding. UTF-8 is the 8-bit version of Unicode. The multibyte version of Unicode is UTF-16. So, which is it? If I create a database using Unicode as the encoding, will the encoding be UTF-8 (singlebyte) or UTF-16 (multibyte)? Thanks! Rich
At 8:39 PM -0400 9/16/04, Richard Connamacher wrote: >I'm new to PostgreSQL, and from the looks of it, it's a great database, >and I'll be using more of it in the future. > >I had a quick question if anyone could clear this up. The documentation >for PostgreSQL (version 7.1, the version this server is using) says that >it supports multibyte character encodings like Unicode (which implies >UTF-16 encoding). Don't confuse Unicode, the 'character set' and rules for characters, represented by a sequence of abstract 32 bit integers, with UTF-[8|16|32] which is a way to encode those abstract integers into a stream of bytes someplace. > Later on, the same page says that Unicode is >represented using UTF-8 encoding. UTF-8 is the 8-bit version of Unicode. >The multibyte version of Unicode is UTF-16. > >So, which is it? If I create a database using Unicode as the encoding, >will the encoding be UTF-8 (singlebyte) or UTF-16 (multibyte)? Erm... UTF-8 *is* a multibyte encoding. Up to 6 bytes per code point, if things get really degenerate. (And, last I checked, means you can have up to 70 bytes for really degenerate characters, but my memory might be off (could be 80)) UTF-8, UTF-16, and UTF-32 will all encode Unicode characters just fine. -- Dan --------------------------------------it's like this------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
On Sep 17, 2004, at 9:39 AM, Richard Connamacher wrote: > UTF-8 is the 8-bit version of Unicode. > The multibyte version of Unicode is UTF-16. UTF-8 encodes characters with varying numbers of bytes, not just 1 byte per character. IIRC, it's anywhere from 1 to 5 bytes, actually. PostgreSQL uses UTF-8. If you can, upgrade. 7.1 is nearing prehistoric. :) Michael Glaesemann grzm myrealbox com
Thanks to both Dan Sugalski and Michael Glaesemann for answering my question. I probably should have realized that, while Latin letters are one byte, the fact that others are encoded into up to 5-byte groups qualifies it as a multi-byte encoding. I don't anticipate having very many non-latin letters in my database, I just want it to have the option if it ever becomes necessary. So, UTF-8 is be much more space efficient for my needs. 7.1 may be prehistoric, but it's running on an off-site server that I'm renting, and this version came pre-installed. Since it's already there and working, I'd like to get familiar with it before I try to reinstall a newer version. I doubt I'd know what to do with many of the newer features anyway, since this is my first time playing with PostgreSQL and my knowledge is currently limited to simple relationships and basic SQL queries. Many thanks for the clarification, Rich > > On Sep 17, 2004, at 9:39 AM, Richard Connamacher wrote: > > > UTF-8 is the 8-bit version of Unicode. > > The multibyte version of Unicode is UTF-16. > > UTF-8 encodes characters with varying numbers of bytes, not just 1 byte > per character. IIRC, it's anywhere from 1 to 5 bytes, actually. > PostgreSQL uses UTF-8. > > If you can, upgrade. 7.1 is nearing prehistoric. :) > > Michael Glaesemann > grzm myrealbox com > >
"Richard Connamacher" <rich.n1@indieimage.com> writes: > 7.1 may be prehistoric, but it's running on an off-site server that I'm > renting, and this version came pre-installed. Since it's already there > and working, I'd like to get familiar with it before I try to reinstall > a newer version. I doubt I'd know what to do with many of the newer > features anyway, It's not so much "more features" as "fewer bugs". There are known data loss problems in 7.1.* and before (transaction ID wraparound, for instance, though you might call that a design shortcoming rather than a bug per se). regards, tom lane
=> show client_encoding ; client_encoding ----------------- UNICODE (1 ligne) => select char_length('a'), bit_length('a'); char_length | bit_length -------------+------------ 1 | 8 (1 ligne) # that's an accented "e" => select char_length('é'), bit_length('é'); ; char_length | bit_length -------------+------------ 1 | 16 <= two bytes (1 ligne) pg does not simply store utf-8 data, it also understands it if you set your encoding correctly (ie. initdb to UNICODE and client_encoding too so that data doesn't get mangled on the way to the db). It will refuse to eat illegal UTF8 characters too. Once you try unicode, all the codepage mess starts to look old... On Thu, 16 Sep 2004 20:39:48 -0400, Richard Connamacher <rich.n1@indieimage.com> wrote: > I'm new to PostgreSQL, and from the looks of it, it's a great database, > and I'll be using more of it in the future. > > I had a quick question if anyone could clear this up. The documentation > for PostgreSQL (version 7.1, the version this server is using) says that > it supports multibyte character encodings like Unicode (which implies > UTF-16 encoding). Later on, the same page says that Unicode is > represented using UTF-8 encoding. UTF-8 is the 8-bit version of Unicode. > The multibyte version of Unicode is UTF-16. > > So, which is it? If I create a database using Unicode as the encoding, > will the encoding be UTF-8 (singlebyte) or UTF-16 (multibyte)? > > Thanks! > Rich > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if > your > joining column's datatypes do not match >