Thread: Unicode comment on Postgres vs Sql Server
I am familiar with MS Sql Server & just started using Postgres.
For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses char/varchar for ASCII.
Postgres has this encoding setting at the database level.
I am using UTF8 Unicode for most of my data, but there is some data that I know for sure will be ASCII. However, this is also stored as UTF8, using up more space.
At first sight, it looks like the the more granular level design is better. Any comments? If you agree, does it make sense to add this as a new datatype to Postgres?
Thanks
For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses char/varchar for ASCII.
Postgres has this encoding setting at the database level.
I am using UTF8 Unicode for most of my data, but there is some data that I know for sure will be ASCII. However, this is also stored as UTF8, using up more space.
At first sight, it looks like the the more granular level design is better. Any comments? If you agree, does it make sense to add this as a new datatype to Postgres?
Thanks
On Sun, Mar 02, 2008 at 11:50:01AM -0800, Swaminathan Saikumar <swami@giveexam.com> wrote a message of 30 lines which said: > Postgres has this encoding setting at the database level. Which is simpler, IMHO. "One encoding to rule them all" > I am using UTF8 Unicode for most of my data, but there is some data > that I know for sure will be ASCII. However, this is also stored as > UTF8, using up more space. Excuse me, but this shows a serious ignorance of UTF-8. A character of the ASCII range, in UTF-8, is stored in one byte, exactly the same size as ASCII (any ASCII file is an UTF-8 file, that's an important property of UTF-8).
Swaminathan Saikumar wrote: > I didn't have proper knowledge about the UTF8 format, thanks. > I originally meant nvarchar & nchar, which is basically varchar & char > that supports Unicode regardless of the database encoding. Well, we don't need that when we have UTF8. There could be edge cases speed wise when you use UCS16 or UCS32 internally but I'm not sure how well this would justify a new datatype. The current problem isnt so much with encoding database wise, its more about collating database cluster wise - which is something not easily solved when you want to do it according to the SQL spec. You could work around that with a functional index. Regards Tino Wildenhain
Swaminathan Saikumar wrote: > I am familiar with MS Sql Server & just started using Postgres. > For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses > char/varchar for ASCII. > Postgres has this encoding setting at the database level. > > I am using UTF8 Unicode for most of my data, but there is some data that > I know for sure will be ASCII. However, this is also stored as UTF8, > using up more space. This is wrong - ASCII is a subset of UTF8 and therefore uses exactly one byte for every ASCII char. See http://en.wikipedia.org/wiki/UTF-8 for example. > > At first sight, it looks like the the more granular level design is > better. Any comments? If you agree, does it make sense to add this as a > new datatype to Postgres? Which new datatype? Regards Tino
I didn't have proper knowledge about the UTF8 format, thanks.
I originally meant nvarchar & nchar, which is basically varchar & char that supports Unicode regardless of the database encoding.
I originally meant nvarchar & nchar, which is basically varchar & char that supports Unicode regardless of the database encoding.
On 3/2/08, Tino Wildenhain <tino@wildenhain.de> wrote:
Swaminathan Saikumar wrote:
> I am familiar with MS Sql Server & just started using Postgres.
> For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses
> char/varchar for ASCII.
> Postgres has this encoding setting at the database level.
>
> I am using UTF8 Unicode for most of my data, but there is some data that
> I know for sure will be ASCII. However, this is also stored as UTF8,
> using up more space.
This is wrong - ASCII is a subset of UTF8 and therefore uses
exactly one byte for every ASCII char.
See http://en.wikipedia.org/wiki/UTF-8 for example.
>
> At first sight, it looks like the the more granular level design is
> better. Any comments? If you agree, does it make sense to add this as a
> new datatype to Postgres?
Which new datatype?
Regards
Tino
On Sunday 2. March 2008, Swaminathan Saikumar wrote: >I am using UTF8 Unicode for most of my data, but there is some data > that I know for sure will be ASCII. However, this is also stored as > UTF8, using up more space. ASCII stored as UTF8 doesn't take up more space than plain ASCII, it's exactly the same thing. It's one byte per character unless the character number is above 127. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/