Thread: UTF-8 question.

UTF-8 question.

From

"Richard Connamacher"

Date:

17 September 2004, 01:40:11

I'm new to PostgreSQL, and from the looks of it, it's a great database,
and I'll be using more of it in the future.

I had a quick question if anyone could clear this up. The documentation
for PostgreSQL (version 7.1, the version this server is using) says that
it supports multibyte character encodings like Unicode (which implies
UTF-16 encoding). Later on, the same page says that Unicode is
represented using UTF-8 encoding. UTF-8 is the 8-bit version of Unicode.
The multibyte version of Unicode is UTF-16.

So, which is it? If I create a database using Unicode as the encoding,
will the encoding be UTF-8 (singlebyte) or UTF-16 (multibyte)?

Thanks!
Rich

Re: UTF-8 question.

From

Dan Sugalski

Date:

17 September 2004, 02:13:28

At 8:39 PM -0400 9/16/04, Richard Connamacher wrote:
>I'm new to PostgreSQL, and from the looks of it, it's a great database,
>and I'll be using more of it in the future.
>
>I had a quick question if anyone could clear this up. The documentation
>for PostgreSQL (version 7.1, the version this server is using) says that
>it supports multibyte character encodings like Unicode (which implies
>UTF-16 encoding).

Don't confuse Unicode, the 'character set' and rules for characters,
represented by a sequence of abstract 32 bit integers, with
UTF-[8|16|32] which is a way to encode those abstract integers into a
stream of bytes someplace.

>  Later on, the same page says that Unicode is
>represented using UTF-8 encoding. UTF-8 is the 8-bit version of Unicode.
>The multibyte version of Unicode is UTF-16.
>
>So, which is it? If I create a database using Unicode as the encoding,
>will the encoding be UTF-8 (singlebyte) or UTF-16 (multibyte)?

Erm... UTF-8 *is* a multibyte encoding. Up to 6 bytes per code point,
if things get really degenerate. (And, last I checked, means you can
have up to 70 bytes for really degenerate characters, but my memory
might be off (could be 80))

UTF-8, UTF-16, and UTF-32 will all encode Unicode characters just fine.
--
                Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
dan@sidhe.org                         have teddy bears and even
                                       teddy bears get drunk

Re: UTF-8 question.

From

Michael Glaesemann

Date:

17 September 2004, 02:20:42

On Sep 17, 2004, at 9:39 AM, Richard Connamacher wrote:

> UTF-8 is the 8-bit version of Unicode.
> The multibyte version of Unicode is UTF-16.

UTF-8 encodes characters with varying numbers of bytes, not just 1 byte
per character. IIRC, it's anywhere from 1 to 5 bytes, actually.
PostgreSQL uses UTF-8.

If you can, upgrade. 7.1 is nearing prehistoric. :)

Michael Glaesemann
grzm myrealbox com

Re: UTF-8 question.

From

"Richard Connamacher"

Date:

17 September 2004, 03:06:42

Thanks to both Dan Sugalski and Michael Glaesemann for answering my
question. I probably should have realized that, while Latin letters are
one byte, the fact that others are encoded into up to 5-byte groups
qualifies it as a multi-byte encoding. I don't anticipate having very
many non-latin letters in my database, I just want it to have the option
if it ever becomes necessary. So, UTF-8 is be much more space efficient
for my needs.

7.1 may be prehistoric, but it's running on an off-site server that I'm
renting, and this version came pre-installed. Since it's already there
and working, I'd like to get familiar with it before I try to reinstall
a newer version. I doubt I'd know what to do with many of the newer
features anyway, since this is my first time playing with PostgreSQL and
my knowledge is currently limited to simple relationships and basic SQL
queries.

Many thanks for the clarification,
Rich

>
> On Sep 17, 2004, at 9:39 AM, Richard Connamacher wrote:
>
> > UTF-8 is the 8-bit version of Unicode.
> > The multibyte version of Unicode is UTF-16.
>
> UTF-8 encodes characters with varying numbers of bytes, not just 1 byte
> per character. IIRC, it's anywhere from 1 to 5 bytes, actually.
> PostgreSQL uses UTF-8.
>
> If you can, upgrade. 7.1 is nearing prehistoric. :)
>
> Michael Glaesemann
> grzm myrealbox com
>
>

Re: UTF-8 question.

From

Tom Lane

Date:

17 September 2004, 04:23:46

"Richard Connamacher" <rich.n1@indieimage.com> writes:
> 7.1 may be prehistoric, but it's running on an off-site server that I'm
> renting, and this version came pre-installed. Since it's already there
> and working, I'd like to get familiar with it before I try to reinstall
> a newer version. I doubt I'd know what to do with many of the newer
> features anyway,

It's not so much "more features" as "fewer bugs".  There are known data
loss problems in 7.1.* and before (transaction ID wraparound, for
instance, though you might call that a design shortcoming rather than
a bug per se).

            regards, tom lane

Re: UTF-8 question.

From

Pierre-Frédéric Caillaud

Date:

17 September 2004, 07:40:56

=> show client_encoding ;
  client_encoding
-----------------
  UNICODE
(1 ligne)
=> select char_length('a'), bit_length('a');
  char_length | bit_length
-------------+------------
            1 |          8
(1 ligne)


# that's an accented "e"
=> select char_length('é'), bit_length('é'); ;
  char_length | bit_length
-------------+------------
            1 |         16        <= two bytes
(1 ligne)


    pg does not simply store utf-8 data, it also understands it if you set
your encoding correctly (ie. initdb to UNICODE and client_encoding too so
that data doesn't get mangled on the way to the db). It will refuse to eat
illegal UTF8 characters too.
    Once you try unicode, all the codepage mess starts to look old...

On Thu, 16 Sep 2004 20:39:48 -0400, Richard Connamacher
<rich.n1@indieimage.com> wrote:

> I'm new to PostgreSQL, and from the looks of it, it's a great database,
> and I'll be using more of it in the future.
>
> I had a quick question if anyone could clear this up. The documentation
> for PostgreSQL (version 7.1, the version this server is using) says that
> it supports multibyte character encodings like Unicode (which implies
> UTF-16 encoding). Later on, the same page says that Unicode is
> represented using UTF-8 encoding. UTF-8 is the 8-bit version of Unicode.
> The multibyte version of Unicode is UTF-16.
>
> So, which is it? If I create a database using Unicode as the encoding,
> will the encoding be UTF-8 (singlebyte) or UTF-16 (multibyte)?
>
> Thanks!
> Rich
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
> your
>       joining column's datatypes do not match
>