Re: UTF8 or Unicode - Mailing list pgsql-hackers

From Karel Zak
Subject Re: UTF8 or Unicode
Date
Msg-id 1108459323.4044.171.camel@petra
Whole thread Raw
In response to Re: UTF8 or Unicode  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: UTF8 or Unicode  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-hackers
On Mon, 2005-02-14 at 22:05 -0500, Bruce Momjian wrote:
> Abhijit Menon-Sen wrote:
> > At 2005-02-14 21:14:54 -0500, pgman@candle.pha.pa.us wrote:
> > >
> > > Should our multi-byte encoding be referred to as UTF8 or Unicode?
> > 
> > The *encoding* should certainly be referred to as UTF-8. Unicode is a
> > character set, not an encoding; Unicode characters may be encoded with
> > UTF-8, among other things.
> > 
> > (One might think of a charset as being a set of integers representing
> > characters, and an encoding as specifying how those integers may be
> > converted to bytes.)
> > 
> > > I know UTF8 is a type of unicode but do we need to rename anything
> > > from Unicode to UTF8?
> > 
> > I don't know. I'll go through the documentation to see if I can find
> > anything that needs changing.
> 
> I looked at encoding.sgml and that mentions Unicode, and then UTF8 as an
> acronym. I am wondering if we need to make UTF8 first and Unicode
> second.  Does initdb accept UTF8 as an encoding?

in PG: unicode = utf8 = utf-8 

Our internal routines in src/backend/utils/mb/encnames.c accept all
synonyms. The "official" internal PG name for UTF-8 is "UNICODE" :-(

It's historical reason that UTF8 = UNICODE, because there was "UNICODE"
first. It's same like "WIN" for WIN1251 (in sources it's marked as
"_dirty_ alias")...

I think initdb uses pg_char_to_encoding() from
src/backend/utils/mb/encnames.c and it should be accept all aliases.
Karel

-- 
Karel Zak <zakkr@zf.jcu.cz>



pgsql-hackers by date:

Previous
From: Christopher Kings-Lynne
Date:
Subject: Re: Help me recovering data
Next
From: pgsql@mohawksoft.com
Date:
Subject: Re: I will be on Boston