Thread: What's a good default encoding?
If you're going to be putting emdashes, letters with lines and circles above them, and similar stuff that's mostly European and American in a database, what's a good default encoding to use - UTF-8? CSN __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
On Mar 14, 2006, at 3:10 PM, CSN wrote: > If you're going to be putting emdashes, letters with > lines and circles above them, and similar stuff that's > mostly European and American in a database, what's a > good default encoding to use - UTF-8? Yes, UTF-8 is good because it can represent every possible character and is efficient with respect to latin character sets. John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL
Perhaps others can comment on encoding versus type of data. I would add that the manner in which data are accessed may also be a consideration. Specifically, UTF-8 is a good choice if one is going to use JDBC.
Michael Schmidt
I am wondering if somebody here can tell me the difference between UTF-8 and SQL-ASCII, whether there are any benefits of converting SQL-ASCII to UTF-8? If so, under what circumstances do we want to convert to UTF-8?
Thanks,
Thanks,
On 3/15/06, Michael Schmidt <michaelmschmidt@msn.com> wrote:
Perhaps others can comment on encoding versus type of data. I would add that the manner in which data are accessed may also be a consideration. Specifically, UTF-8 is a good choice if one is going to use JDBC.Michael Schmidt
"Junaili Lie" <junaili@gmail.com> writes: > I am wondering if somebody here can tell me the difference between > UTF-8 and SQL-ASCII, whether there are any benefits of converting > SQL-ASCII to UTF-8? SQL_ASCII isn't really an encoding; it's more like a declaration of ignorance. If the encoding is set to SQL_ASCII, the backend will store any high-bit-on data you send it, and return it without any sort of conversion. If you are dealing with data beyond the 7-bit ASCII set, it's probably a really bad idea to be using the SQL_ASCII setting, because the database won't give you any help at all in checking for bad data or converting between the encodings wanted by different client programs. There are a few situations where this is what you want, but I think most people are better off picking a specific encoding. regards, tom lane
Good default encoding:
does somebody NOT agree that UTF8 is quite a recommendation, at least for all the people without Korean, Japanese and Chinese Chars? I know, that's at maximum 2/3 of our potential user base, but better then nothing.
Maybe we could even "suggest" UTF8 in the "getting started" (i.e. the windows installer initdb screen, or other default installations) Sth. like "if you do not know better, take utf8"
Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
When I visit a mosque, I show my respect by taking off my shoes. I follow the customs, just as I do in a church, synagogue or other holy place. But if a believer demands that I, as a nonbeliever, observe his taboos in the public domain, he is not asking for my respect, but for my submission. And that is incompatible with a secular democracy.
does somebody NOT agree that UTF8 is quite a recommendation, at least for all the people without Korean, Japanese and Chinese Chars? I know, that's at maximum 2/3 of our potential user base, but better then nothing.
Maybe we could even "suggest" UTF8 in the "getting started" (i.e. the windows installer initdb screen, or other default installations) Sth. like "if you do not know better, take utf8"
Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
When I visit a mosque, I show my respect by taking off my shoes. I follow the customs, just as I do in a church, synagogue or other holy place. But if a believer demands that I, as a nonbeliever, observe his taboos in the public domain, he is not asking for my respect, but for my submission. And that is incompatible with a secular democracy.
On Thu, Mar 16, 2006 at 07:30:36AM +0100, Harald Armin Massa wrote: > Good default encoding: > > does somebody NOT agree that UTF8 is quite a recommendation, at least for > all the people without Korean, Japanese and Chinese Chars? I know, that's at > maximum 2/3 of our potential user base, but better then nothing. Umm, you should choose an encoding supported by your platform and the locales you use. For example, UTF-8 is a bad choice on *BSD because there is no collation support for UTF-8 on those platforms. On Linux/Glibc UTF-8 is well supported but you need to make sure the locale you initdb with is a UTF-8 locale. By and large postgres correctly autodetects the encoding from the locale. > Maybe we could even "suggest" UTF8 in the "getting started" (i.e. the > windows installer initdb screen, or other default installations) Sth. like > "if you do not know better, take utf8" UTF-8 on windows works pretty well. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
> > Maybe we could even "suggest" UTF8 in the "getting started" > (i.e. the > > windows installer initdb screen, or other default > installations) Sth. > > like "if you do not know better, take utf8" > > UTF-8 on windows works pretty well. It does, but it has an extra speed penalty. For any comparison operation (which means sort), the string must be converted to UTF-16, compared, discarded. Win32 can't do native comparisions in UTF-8. Thoguh I haven't specifically measured the difference, I doubt it would be unnoticable. Which is the mani reason we didn't go with it as the default for the 8.1 installer. //Magnus
I tried changing my database to UTF8 and then importing the dump (even tried iconv). It choked (on an accented e). Then somehow the database got created as LATIN9, and I was able to import successfully. I guess if it works, I'll be leaving it alone for the time being. I still have problems when emdashes are stored in the database as HTML entities, but they're displayed as emdashes in a web form, but then get stored back in the database wrong when edited (an accented A IIRC). I dunno - maybe it's a browser or Rails thing. CSN __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
On Thu, Mar 16, 2006 at 03:11:27AM -0800, CSN wrote: > I tried changing my database to UTF8 and then > importing the dump (even tried iconv). It choked (on > an accented e). Then somehow the database got created > as LATIN9, and I was able to import successfully. I > guess if it works, I'll be leaving it alone for the > time being. Note, when you create a dump, pg_dump adds a "set client_encoding" at the top of the dump. If you change the encoding using iconv without changing that line, you'll get problems in the import. In theory dumping from a latin9 database into a utf8 one should Just Work(tm) because PostgreSQL will convert the data while loading. > I still have problems when emdashes are stored in the > database as HTML entities, but they're displayed as > emdashes in a web form, but then get stored back in > the database wrong when edited (an accented A IIRC). I > dunno - maybe it's a browser or Rails thing. I think the stuff submitted by the browser has a given encoding and won't be encoded with HTML entities. Converting unicode to HTML entities has to be done somewhere there... Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Mar 16, 2006, at 3:36 AM, Martijn van Oosterhout wrote: > Umm, you should choose an encoding supported by your platform and the > locales you use. For example, UTF-8 is a bad choice on *BSD because > there is no collation support for UTF-8 on those platforms. On > Linux/Glibc UTF-8 is well supported but you need to make sure the Shouldn't postgres be providing the collating routines for UTF8 anyhow? How else can we guarantee identical behavior across platforms?
Vivek Khera wrote: > Shouldn't postgres be providing the collating routines for UTF8 > anyhow? Start typing ... -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Mar 20, 2006, at 6:04 PM, Peter Eisentraut wrote: > Vivek Khera wrote: >> Shouldn't postgres be providing the collating routines for UTF8 >> anyhow? > > Start typing ... So, if I use a UTF8 encoded DB on FreeBSD, all hell will break loose or what? Will things not compare correctly? Where from does the code to do the collating come, then?
Vivek Khera <vivek@khera.org> writes: > Shouldn't postgres be providing the collating routines for UTF8 > anyhow? How else can we guarantee identical behavior across platforms? We don't make any such guarantee. regards, tom lane
On Mon, Mar 20, 2006 at 06:07:16PM -0500, Vivek Khera wrote: > So, if I use a UTF8 encoded DB on FreeBSD, all hell will break loose > or what? Will things not compare correctly? Where from does the > code to do the collating come, then? It just won't collate properly. PostgreSQL collation is provided by the underlying C library via strcoll(). FreeBSD simply doesn't support UTF-8 collation. IIRC the UTF-8 collation code simply uses the ASCII collation. It's an order, just not the order most people will be expecting. If you look at the collation code in FreeBSD you'll see it doesn't work for any multibyte encoding. That's OK, it's obviously not important to FreeBSD users. But I'm ademantly against building and maintaining a special UTF-8 collation library just for PostgreSQL. That's just reinventing the wheel. There already exist cross-platform libraries to handle collation and we should work towards allowing people to use one of those... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Is there any downside to using the C locale with UTF-8 encoding (on linux)? I need things to run quickly and proper sort order is not critically important (but storage of international characters is). Merlin
On Tue, Mar 21, 2006 at 08:54:36AM -0500, Merlin Moncure wrote: > Is there any downside to using the C locale with UTF-8 encoding (on > linux)? I need things to run quickly and proper sort order is not > critically important (but storage of international characters is). Sure, why not. C locale is the fastest and if the sort order is close enough, go for it. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Mar 21, 2006, at 6:50 AM, Martijn van Oosterhout wrote: > But I'm ademantly against building and maintaining a special UTF-8 > collation library just for PostgreSQL. That's just reinventing the > wheel. There already exist cross-platform libraries to handle > collation > and we should work towards allowing people to use one of those... This would be a Good Thing[tm]. I'm starting an investigation into this now... thanks for the info.
On Tue, Mar 21, 2006 at 10:10:53AM -0500, Vivek Khera wrote: > > On Mar 21, 2006, at 6:50 AM, Martijn van Oosterhout wrote: > > >But I'm ademantly against building and maintaining a special UTF-8 > >collation library just for PostgreSQL. That's just reinventing the > >wheel. There already exist cross-platform libraries to handle > >collation > >and we should work towards allowing people to use one of those... > > This would be a Good Thing[tm]. I'm starting an investigation into > this now... thanks for the info. For quite a while now there have been patches floating around allowing PostgreSQL to use ICU for collations. I would start there... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.