Thread: Encoding, Unicode, locales, etc.
Hi, Even though I *think* I have a fairly clear understanding of encoding and locale principles, I'm somewhat unsure of how some of the tools available with PostgreSQL apply (or rather, how should they be used). 1) The way I understand it, encoding (character set) and locale are two different things. Yet, I see that initdb allows me to specify the default encoding (I can override it when creating the database) for the databases that I create later, and also the locale(s). Why is it that the database cluster is resrticted to a single locale (or single set of locales) instead of being configurable on a per-database basis? 2) On the same token (more or less), I have a test database, for which I ran initdb without specifying encoding or locale; then, I create a database with UTF8 encoding. Then, from a psql console on the same Linux machine that is running the server, I try lower of a string that contains characters with accents (e.g., Spanish or French characters), and it works as it should according to Spanish or French rules --- it returns a string with the same characters in lowecase, with the same accent. Why did that work? My Linux machine has all en_US.UTF-8 locales, and en_US is not even aware of characters with accents, so it doesn't seem like it's taking by default the encoding from the OS. (is it simply that the case is "too obvious" so by default case conversion does the "obvious" thing?) I have several other details in which I'm not too clear, but perhaps with any responses or pointers that I might get for the above, it would clarify the whole confusion? BTW, I did read the online PG documentation --- the section localization; the thing is, with everything that I read in there, my reaction was more or less "ok, I knew that"; that is, it states facts for which I know (or at least I think I know) the "theory", but it did not clarify how to use the given tools. Thanks, Carlos --
Carlos Moreno <moreno_pg@mochima.com> writes: > Why is it that the database > cluster is resrticted to a single locale (or single set of locales) instead > of being configurable on a per-database basis? Because we depend on libc's locale support, which (on many platforms) isn't designed to switch between locales cheaply. The fact that we allow a per-database encoding spec at all was probably a bad idea in hindsight --- it's out front of what the code can really deal with. My recollection is that the Japanese contingent argued for it on the grounds that they needed to deal with multiple encodings and didn't care about encoding/locale mismatch because they were going to use C locale anyway. For everybody else though, it's a gotcha waiting to happen. This stuff is certainly far from ideal, but the amount of work involved to fix it is daunting; see many past pg-hackers discussions. > 2) On the same token (more or less), I have a test database, for which > I ran initdb without specifying encoding or locale; then, I create a > database with UTF8 encoding. There's no such thing as "you didn't specify a locale". If you didn't specify one on the initdb command line, then it was taken from the environment. Try "show lc_collate" and "show lc_ctype" to see what got used. > I try lower of a string that > contains characters with accents (e.g., Spanish or French characters), > and it works as it should according to Spanish or French rules --- it > returns a string with the same characters in lowecase, with the same > accent. Why did that work? My Linux machine has all en_US.UTF-8 > locales, and en_US is not even aware of characters with accents, You sure? I'd sort of expect a UTF8 locale to know this stuff anyway. In any case, Postgres doesn't know anything about case conversion beyond what toupper/tolower tell it, so your experimental result is sufficient proof that that locale includes these conversions. regards, tom lane
On Tue, Oct 31, 2006 at 11:47:56PM -0500, Tom Lane wrote: > Because we depend on libc's locale support, which (on many platforms) > isn't designed to switch between locales cheaply. The fact that we > allow a per-database encoding spec at all was probably a bad idea in > hindsight --- it's out front of what the code can really deal with. > My recollection is that the Japanese contingent argued for it on the > grounds that they needed to deal with multiple encodings and didn't > care about encoding/locale mismatch because they were going to use > C locale anyway. For everybody else though, it's a gotcha waiting > to happen. Could this paragraph be put into the docs and/or the FAQ, please ? Along with the recommendation that if you require multiple encodings for your databases you better had your OS locale configured properly for UTF8 and use UNICODE databases or do initdb with the C-locale. > This stuff is certainly far from ideal, but the amount of work involved > to fix it is daunting; see many past pg-hackers discussions. Here are a few data points from my Debian/Testing system in favour of not worrying too much about installed ICU size as it is being used by other packages anyways: libicu36 Reverse Depends: openoffice.org-writer * OOo openoffice.org-filter-so52 openoffice.org-core libxerces27 * Xerces XML parser (Apache camp) libboost-regex1.33.1 libboost-dbg icu Reverse Depends: libicu36 libicu36 libxercesicu26 * Xerces, again libxercesicu25 libicu28-dev libicu28 libicu21c102 icu-i18ndata icu-data libwine * Wine This, of course, does not decrease the work required to get this going in PostgreSQL. Thanks for the great work, Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Thanks Tom, for your reply. Tom Lane wrote: >Carlos Moreno <moreno_pg@mochima.com> writes: > > >>Why is it that the database >>cluster is resrticted to a single locale (or single set of locales) instead >>of being configurable on a per-database basis? >> >> > >Because we depend on libc's locale support, which (on many platforms) >isn't designed to switch between locales cheaply [...] > >This stuff is certainly far from ideal, but the amount of work involved >to fix it is daunting; see many past pg-hackers discussions. > > Fair enough --- and good to know. >>2) On the same token (more or less), I have a test database, for which >>I ran initdb without specifying encoding or locale; then, I create a >>database with UTF8 encoding. >> >> > >There's no such thing as "you didn't specify a locale". If you didn't >specify one on the initdb command line, then it was taken from the >environment. Try "show lc_collate" and "show lc_ctype" to see what >got used. > > Yes, that's what I meant --- I meant that I did not use the --locale or -E command- line switches for the initdb command. Both lc_ctype and lc_collate show en_US.UTF-8 >>I try lower of a string that >>contains characters with accents (e.g., Spanish or French characters), >>and it works as it should according to Spanish or French rules --- it >>returns a string with the same characters in lowecase, with the same >>accent. Why did that work? My Linux machine has all en_US.UTF-8 >>locales, and en_US is not even aware of characters with accents, >> >> > >You sure? I'd sort of expect a UTF8 locale to know this stuff anyway. >In any case, Postgres doesn't know anything about case conversion >beyond what toupper/tolower tell it, so your experimental result is >sufficient proof that that locale includes these conversions. > > Are you sure there's nothing about the way PostgreSQL interacts with C conversion functions? I ask because, as part of a "sanity check", I repeated the tests --- now with two machines; one that has PG 8.1.4, and the other one has 7.4.14, and they behave differently. The one that does the case conversion "correctly" (read: as I expect it as per Spanish or French rules) is 8.1.4 with en_US locale (LC_CTYPE and LC_COLLATE both showing en_US.UTF-8). PG 7.4.14, *even with locale es_ES*, does not do the case conversion (characters with accent or tilde are left untouched). I wonder if someone could shed some light on this little mystery....??? Perhaps to add more confusion to my experimental/informal tests, PG 8.1.4 is running on a FC4 AMD64 X2 box (the command "locale" at the shell prompt shows all en_US.utf8), and PG 7.4.14 is running on a laptop with FC5 on an Intel Celeron M (the command locale shows exactly the same in that case). Does this perhaps account for the difference? Thanks, Carlos --
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes: > Could this paragraph be put into the docs and/or the FAQ, > please ? Section 21.2. Character Set Support already has something about it: Important: Although you can specify any encoding you want for a database, it is unwise to choose an encoding that is not what is expected by the locale you have selected. The LC_COLLATE and LC_CTYPE settings imply a particular encoding, and locale-dependent operations (such as sorting) are likely to misinterpret data that is in an incompatible encoding. Since these locale settings are frozen by initdb, the apparent flexibility to use different encodings in different databases of a cluster is more theoretical than real. It is likely that these mechanisms will be revisited in future versions of PostgreSQL. One way to use multiple encodings safely is to set the locale to C or POSIX during initdb, thus disabling any real locale awareness. regards, tom lane
On Wed, Nov 01, 2006 at 11:41:43AM +0100, Karsten Hilbert wrote: > Could this paragraph be put into the docs and/or the FAQ, > please ? Along with the recommendation that if you require > multiple encodings for your databases you better had your OS > locale configured properly for UTF8 and use UNICODE > databases or do initdb with the C-locale. Err, multiple encodings don't work full-stop. Any particular locale (as defined by POSIX) is only really designed to work with one encoding. The fact that the C locale produces an order when sorting UTF8 text is really just luck. In hindsight the people in POSIX who decided to tie locale and encoding into one variable should probably be shot, but it's a bit late now. > > This stuff is certainly far from ideal, but the amount of work involved > > to fix it is daunting; see many past pg-hackers discussions. > > Here are a few data points from my Debian/Testing system in > favour of not worrying too much about installed ICU size as > it is being used by other packages anyways: We'd need a suitable patch first before we start worrying about that. I think diskspace is less of an issue now. There are discussions going on about having the clog and the xlog taking dozens of megabytes. At the end of the day I don't think 10MB for the Unicode data it going to be that big a deal, *if* the patch solves all the problems in this area in a reasonably clean way... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
On Wed, Nov 01, 2006 at 08:50:30PM +0100, Martijn van Oosterhout wrote: > > Could this paragraph be put into the docs and/or the FAQ, > > please ? Along with the recommendation that if you require > > multiple encodings for your databases you better had your OS > > locale configured properly for UTF8 and use UNICODE > > databases or do initdb with the C-locale. > > Err, multiple encodings don't work full-stop. Well, yes, I was thinking of multiple client encodings which can be supported either via a C-locale-initdb with the databases set to the encoding you require (but sorting/etc won't work, I know) or by doing a unicode-initdb and using unicode databases. In each case the client encodings can be "multiple" ones - as long as conversion is possible. Sorting etc may still be wrong, but at least the proper characters are going in and coming back. > Any particular locale (as > defined by POSIX) is only really designed to work with one encoding. Sure. What I meant is that if you have a unicode database you can use several client encodings and get back the properly encoded characters. > The fact that the C locale produces an order when sorting UTF8 text is > really just luck. Yes. > > Here are a few data points from my Debian/Testing system in > > favour of not worrying too much about installed ICU size as > > it is being used by other packages anyways: > > We'd need a suitable patch first before we start worrying about that. I > think diskspace is less of an issue now. Well, size did come up in a "recent" discussion so I thought I'd mention the above facts. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Carlos Moreno <moreno_pg@mochima.com> writes: > ... The one that does the case conversion "correctly" (read: as I > expect it as per Spanish or French rules) is 8.1.4 with en_US locale > (LC_CTYPE and LC_COLLATE both showing en_US.UTF-8). PG 7.4.14, *even > with locale es_ES*, does not do the case conversion (characters with > accent or tilde are left untouched). IIRC, 7.4 has no chance of doing upper/lower sanely with multibyte UTF8 characters, because it only knows about the old-line toupper/tolower <ctype.h> functions. 8.0 and up know about <wctype.h> and can do this as you'd expect. See the CVS history at http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/oracle_compat.c regards, tom lane