Thread: Unicode comment on Postgres vs Sql Server

Unicode comment on Postgres vs Sql Server

From

"Swaminathan Saikumar"

Date:

02 March 2008, 15:56:42

I am familiar with MS Sql Server & just started using Postgres.
For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses char/varchar for ASCII.
Postgres has this encoding setting at the database level.

I am using UTF8 Unicode for most of my data, but there is some data that I know for sure will be ASCII. However, this is also stored as UTF8, using up more space.

At first sight, it looks like the the more granular level design is better. Any comments? If you agree, does it make sense to add this as a new datatype to Postgres?

Thanks

Re: Unicode comment on Postgres vs Sql Server

From

Stephane Bortzmeyer

Date:

02 March 2008, 16:40:32

On Sun, Mar 02, 2008 at 11:50:01AM -0800,
 Swaminathan Saikumar <swami@giveexam.com> wrote
 a message of 30 lines which said:

> Postgres has this encoding setting at the database level.

Which is simpler, IMHO. "One encoding to rule them all"

> I am using UTF8 Unicode for most of my data, but there is some data
> that I know for sure will be ASCII. However, this is also stored as
> UTF8, using up more space.

Excuse me, but this shows a serious ignorance of UTF-8. A character of
the ASCII range, in UTF-8, is stored in one byte, exactly the same
size as ASCII (any ASCII file is an UTF-8 file, that's an important
property of UTF-8).

Re: Unicode comment on Postgres vs Sql Server

From

Tino Wildenhain

Date:

02 March 2008, 17:04:30

Swaminathan Saikumar wrote:
> I didn't have proper knowledge about the UTF8 format, thanks.
> I originally meant nvarchar & nchar, which is basically varchar & char
> that supports Unicode regardless of the database encoding.

Well, we don't need that when we have UTF8. There could be edge cases
speed wise when you use UCS16 or UCS32 internally but I'm not sure
how well this would justify a new datatype.

The current problem isnt so much with encoding database wise, its more
about collating database cluster wise - which is something not
easily solved when you want to do it according to the SQL spec.

You could work around that with a functional index.

Regards
Tino Wildenhain

Re: Unicode comment on Postgres vs Sql Server

From

Tino Wildenhain

Date:

02 March 2008, 17:07:05

Swaminathan Saikumar wrote:
> I am familiar with MS Sql Server & just started using Postgres.
> For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses
> char/varchar for ASCII.
> Postgres has this encoding setting at the database level.
>
> I am using UTF8 Unicode for most of my data, but there is some data that
> I know for sure will be ASCII. However, this is also stored as UTF8,
> using up more space.

This is wrong - ASCII is a subset of UTF8 and therefore uses
exactly one byte for every ASCII char.

See http://en.wikipedia.org/wiki/UTF-8 for example.

>
> At first sight, it looks like the the more granular level design is
> better. Any comments? If you agree, does it make sense to add this as a
> new datatype to Postgres?

Which new datatype?

Regards
Tino

Re: Unicode comment on Postgres vs Sql Server

From

"Swaminathan Saikumar"

Date:

02 March 2008, 17:10:51

I didn't have proper knowledge about the UTF8 format, thanks.
I originally meant nvarchar & nchar, which is basically varchar & char that supports Unicode regardless of the database encoding.

On 3/2/08, Tino Wildenhain <tino@wildenhain.de> wrote:

Swaminathan Saikumar wrote:
> I am familiar with MS Sql Server & just started using Postgres.
> For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses
> char/varchar for ASCII.
> Postgres has this encoding setting at the database level.
>
> I am using UTF8 Unicode for most of my data, but there is some data that
> I know for sure will be ASCII. However, this is also stored as UTF8,
> using up more space.

This is wrong - ASCII is a subset of UTF8 and therefore uses
exactly one byte for every ASCII char.

See http://en.wikipedia.org/wiki/UTF-8 for example.

>
> At first sight, it looks like the the more granular level design is
> better. Any comments? If you agree, does it make sense to add this as a
> new datatype to Postgres?

Which new datatype?

Regards

Tino

Re: Unicode comment on Postgres vs Sql Server

From

"Leif B. Kristensen"

Date:

02 March 2008, 17:35:14

On Sunday 2. March 2008, Swaminathan Saikumar wrote:
>I am using UTF8 Unicode for most of my data, but there is some data
> that I know for sure will be ASCII. However, this is also stored as
> UTF8, using up more space.

ASCII stored as UTF8 doesn't take up more space than plain ASCII, it's
exactly the same thing. It's one byte per character unless the
character number is above 127.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/