Thread: What's a good default encoding?

What's a good default encoding?

From

CSN

Date:

14 March 2006, 16:10:55

If you're going to be putting emdashes, letters with
lines and circles above them, and similar stuff that's
mostly European and American in a database, what's a
good default encoding to use - UTF-8?

CSN

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: What's a good default encoding?

From

John DeSoi

Date:

14 March 2006, 23:00:33

On Mar 14, 2006, at 3:10 PM, CSN wrote:

> If you're going to be putting emdashes, letters with
> lines and circles above them, and similar stuff that's
> mostly European and American in a database, what's a
> good default encoding to use - UTF-8?

Yes, UTF-8 is good because it can represent every possible character
and is efficient with respect to latin character sets.

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL

Re: What's a good default encoding?

From

"Michael Schmidt"

Date:

15 March 2006, 19:18:44

Perhaps others can comment on encoding versus type of data. I would add that the manner in which data are accessed may also be a consideration. Specifically, UTF-8 is a good choice if one is going to use JDBC.

Michael Schmidt

Re: What's a good default encoding?

From

"Junaili Lie"

Date:

15 March 2006, 21:32:25

I am wondering if somebody here can tell me the difference between UTF-8 and SQL-ASCII, whether there are any benefits of converting SQL-ASCII to UTF-8? If so, under what circumstances do we want to convert to UTF-8?
Thanks,

On 3/15/06, Michael Schmidt <michaelmschmidt@msn.com> wrote:

Perhaps others can comment on encoding versus type of data. I would add that the manner in which data are accessed may also be a consideration. Specifically, UTF-8 is a good choice if one is going to use JDBC.

Michael Schmidt

Re: What's a good default encoding?

From

Tom Lane

Date:

15 March 2006, 23:02:25

"Junaili Lie" <junaili@gmail.com> writes:
> I am wondering if somebody here can tell me the difference between
> UTF-8 and SQL-ASCII, whether there are any benefits of converting
> SQL-ASCII to UTF-8?

SQL_ASCII isn't really an encoding; it's more like a declaration of
ignorance.  If the encoding is set to SQL_ASCII, the backend will store
any high-bit-on data you send it, and return it without any sort of
conversion.

If you are dealing with data beyond the 7-bit ASCII set, it's probably a
really bad idea to be using the SQL_ASCII setting, because the database
won't give you any help at all in checking for bad data or converting
between the encodings wanted by different client programs.  There are a
few situations where this is what you want, but I think most people are
better off picking a specific encoding.

            regards, tom lane

Re: What's a good default encoding?

From

"Harald Armin Massa"

Date:

16 March 2006, 02:30:40

Good default encoding:

does somebody NOT agree that UTF8 is quite a recommendation, at least for all the people without Korean, Japanese and Chinese Chars? I know, that's at maximum 2/3 of our potential user base, but better then nothing.

Maybe we could even "suggest" UTF8 in the "getting started" (i.e. the windows installer initdb screen, or other default installations) Sth. like "if you do not know better, take utf8"

Harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
When I visit a mosque, I show my respect by taking off my shoes. I follow the customs, just as I do in a church, synagogue or other holy place. But if a believer demands that I, as a nonbeliever, observe his taboos in the public domain, he is not asking for my respect, but for my submission. And that is incompatible with a secular democracy.

Re: What's a good default encoding?

From

Martijn van Oosterhout

Date:

16 March 2006, 04:37:06

On Thu, Mar 16, 2006 at 07:30:36AM +0100, Harald Armin Massa wrote:
> Good default encoding:
>
> does somebody NOT agree that UTF8 is quite a recommendation, at least for
> all the people without Korean, Japanese and Chinese Chars? I know, that's at
> maximum 2/3 of our potential user base, but better then nothing.

Umm, you should choose an encoding supported by your platform and the
locales you use. For example, UTF-8 is a bad choice on *BSD because
there is no collation support for UTF-8 on those platforms. On
Linux/Glibc UTF-8 is well supported but you need to make sure the
locale you initdb with is a UTF-8 locale. By and large postgres
correctly autodetects the encoding from the locale.

> Maybe we could even "suggest" UTF8 in the "getting started" (i.e. the
> windows installer initdb screen, or other default installations) Sth. like
> "if you do not know better, take utf8"

UTF-8 on windows works pretty well.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

signature.asc

Re: What's a good default encoding?

From

"Magnus Hagander"

Date:

16 March 2006, 05:09:28

> > Maybe we could even "suggest" UTF8 in the "getting started"
> (i.e. the
> > windows installer initdb screen, or other default
> installations) Sth.
> > like "if you do not know better, take utf8"
>
> UTF-8 on windows works pretty well.

It does, but it has an extra speed penalty. For any comparison operation
(which means sort), the string must be converted to UTF-16, compared,
discarded. Win32  can't do native comparisions in UTF-8. Thoguh I
haven't specifically measured the difference, I doubt it would be
unnoticable. Which is the mani reason we didn't go with it as the
default for the 8.1 installer.

//Magnus

Re: What's a good default encoding?

From

CSN

Date:

16 March 2006, 07:11:29

I tried changing my database to UTF8 and then
importing the dump (even tried iconv). It choked (on
an accented e). Then somehow the database got created
as LATIN9, and I was able to import successfully. I
guess if it works, I'll be leaving it alone for the
time being.

I still have problems when emdashes are stored in the
database as HTML entities, but they're displayed as
emdashes in a web form, but then get stored back in
the database wrong when edited (an accented A IIRC). I
dunno - maybe it's a browser or Rails thing.

CSN

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: What's a good default encoding?

From

Martijn van Oosterhout

Date:

16 March 2006, 07:24:31

On Thu, Mar 16, 2006 at 03:11:27AM -0800, CSN wrote:
> I tried changing my database to UTF8 and then
> importing the dump (even tried iconv). It choked (on
> an accented e). Then somehow the database got created
> as LATIN9, and I was able to import successfully. I
> guess if it works, I'll be leaving it alone for the
> time being.

Note, when you create a dump, pg_dump adds a "set client_encoding" at
the top of the dump. If you change the encoding using iconv without
changing that line, you'll get problems in the import. In theory
dumping from a latin9 database into a utf8 one should Just Work(tm)
because PostgreSQL will convert the data while loading.

> I still have problems when emdashes are stored in the
> database as HTML entities, but they're displayed as
> emdashes in a web form, but then get stored back in
> the database wrong when edited (an accented A IIRC). I
> dunno - maybe it's a browser or Rails thing.

I think the stuff submitted by the browser has a given encoding and
won't be encoded with HTML entities. Converting unicode to HTML
entities has to be done somewhere there...

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

signature.asc

Re: What's a good default encoding?

From

Vivek Khera

Date:

20 March 2006, 18:53:25

On Mar 16, 2006, at 3:36 AM, Martijn van Oosterhout wrote:

> Umm, you should choose an encoding supported by your platform and the
> locales you use. For example, UTF-8 is a bad choice on *BSD because
> there is no collation support for UTF-8 on those platforms. On
> Linux/Glibc UTF-8 is well supported but you need to make sure the

Shouldn't postgres be providing the collating routines for UTF8
anyhow?  How else can we guarantee identical behavior across platforms?

Re: What's a good default encoding?

From

Peter Eisentraut

Date:

20 March 2006, 19:04:06

Vivek Khera wrote:
> Shouldn't postgres be providing the collating routines for UTF8
> anyhow?

Start typing ...

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: What's a good default encoding?

From

Vivek Khera

Date:

20 March 2006, 19:07:16

On Mar 20, 2006, at 6:04 PM, Peter Eisentraut wrote:

> Vivek Khera wrote:
>> Shouldn't postgres be providing the collating routines for UTF8
>> anyhow?
>
> Start typing ...

So, if I use a UTF8 encoded DB on FreeBSD, all hell will break loose
or what?  Will things not compare correctly?  Where from does the
code to do the collating come, then?

Re: What's a good default encoding?

From

Tom Lane

Date:

20 March 2006, 19:18:31

Vivek Khera <vivek@khera.org> writes:
> Shouldn't postgres be providing the collating routines for UTF8
> anyhow?  How else can we guarantee identical behavior across platforms?

We don't make any such guarantee.

            regards, tom lane

Re: What's a good default encoding?

From

Martijn van Oosterhout

Date:

21 March 2006, 07:50:33

On Mon, Mar 20, 2006 at 06:07:16PM -0500, Vivek Khera wrote:
> So, if I use a UTF8 encoded DB on FreeBSD, all hell will break loose
> or what?  Will things not compare correctly?  Where from does the
> code to do the collating come, then?

It just won't collate properly. PostgreSQL collation is provided by the
underlying C library via strcoll(). FreeBSD simply doesn't support
UTF-8 collation. IIRC the UTF-8 collation code simply uses the ASCII
collation. It's an order, just not the order most people will be
expecting.

If you look at the collation code in FreeBSD you'll see it doesn't work
for any multibyte encoding. That's OK, it's obviously not important to
FreeBSD users.

But I'm ademantly against building and maintaining a special UTF-8
collation library just for PostgreSQL. That's just reinventing the
wheel. There already exist cross-platform libraries to handle collation
and we should work towards allowing people to use one of those...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

signature.asc

Re: What's a good default encoding?

From

"Merlin Moncure"

Date:

21 March 2006, 09:54:37

Is there any downside to using the C locale with UTF-8 encoding (on
linux)?  I need things to run quickly and proper sort order is not
critically important (but storage of international characters is).

Merlin

Re: What's a good default encoding?

From

Martijn van Oosterhout

Date:

21 March 2006, 10:11:43

On Tue, Mar 21, 2006 at 08:54:36AM -0500, Merlin Moncure wrote:
> Is there any downside to using the C locale with UTF-8 encoding (on
> linux)?  I need things to run quickly and proper sort order is not
> critically important (but storage of international characters is).

Sure, why not. C locale is the fastest and if the sort order is close
enough, go for it.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

signature.asc

Re: What's a good default encoding?

From

Vivek Khera

Date:

21 March 2006, 11:10:53

On Mar 21, 2006, at 6:50 AM, Martijn van Oosterhout wrote:

> But I'm ademantly against building and maintaining a special UTF-8
> collation library just for PostgreSQL. That's just reinventing the
> wheel. There already exist cross-platform libraries to handle
> collation
> and we should work towards allowing people to use one of those...

This would be a Good Thing[tm].  I'm starting an investigation into
this now...  thanks for the info.

Re: What's a good default encoding?

From

Martijn van Oosterhout

Date:

21 March 2006, 11:45:34

On Tue, Mar 21, 2006 at 10:10:53AM -0500, Vivek Khera wrote:
>
> On Mar 21, 2006, at 6:50 AM, Martijn van Oosterhout wrote:
>
> >But I'm ademantly against building and maintaining a special UTF-8
> >collation library just for PostgreSQL. That's just reinventing the
> >wheel. There already exist cross-platform libraries to handle
> >collation
> >and we should work towards allowing people to use one of those...
>
> This would be a Good Thing[tm].  I'm starting an investigation into
> this now...  thanks for the info.

For quite a while now there have been patches floating around allowing
PostgreSQL to use ICU for collations. I would start there...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

signature.asc