Thread: Bug #3924: Create Database with another encoding as the encoding from postgres
Hi, I just run into this bug (http://archives.postgresql.org/pgsql-bugs/2008-02/msg00010.php). In my case, Roundup (Python bug tracker) is trying to CREATE DATABASE WITH ENCODING='UNICODE' in an 8.3.4 cluster created with LATIN1 encoding, and it gets the error: encoding UTF8 does not match server's locale en_US DETAIL: The server's LC_CTYPE setting requires encoding LATIN1. Same problem if I try createdb -E UTF8. Heikki's response to the OP referred him to the 8.3 release notes: Disallow database encodings that are inconsistent with the server's locale setting (Tom) On most platforms, C locale is the only locale that will work with any database encoding. Other locale settings imply a specific encoding and will misbehave if the database encoding is something different. (Typical symptoms include bogus textual sort order and wrong results from upper() or lower().) The server now rejects attempts to create databases that have an incompatible encoding. I can understand that creating a LATIN2, EUC_CN or some other encoding in a LATIN1 cluster may cause problems, but don't quite see how UTF-8 can be a problem since in essence it includes the others. Perhaps someone can provide further explanation? Assuming this situation will remain as is in the immediate future, what are my options? The 8.3 cluster has a LATIN1 database that must stay that way (at least for a while). I also have an 8.2 LATIN1 cluster where SQL_ASCII, LATIN1 and UTF8 appear to coexist happily but was in the process of migrating the remaining db's to 8.3. Do I need to create another 8.3 cluster, this time with UTF8 encoding? Will that be allowed if my host LC_CTYPE remains "en_US" (the machine has both en_US.iso88591 and en_US.utf8 installed but operates mostly with the former)? As an aside, how can one find the status of a Postgres bug? Joe
Re: Bug #3924: Create Database with another encoding as the encoding from postgres
From
Tom Lane
Date:
Joe <dev@freedomcircle.net> writes: > I can understand that creating a LATIN2, EUC_CN or some other encoding > in a LATIN1 cluster may cause problems, but don't quite see how UTF-8 > can be a problem since in essence it includes the others. Perhaps > someone can provide further explanation? The locale setting implies a specific encoding. UTF8 may contain all the same characters that, say, LATIN1 does, but it's not anywhere near representationally the same, and locale-dependent functions will do the wrong thing if they are fed UTF8 when they are expecting LATIN1. > Assuming this situation will remain as is in the immediate future, what > are my options? The 8.3 cluster has a LATIN1 database that must stay > that way (at least for a while). Why does it have to be LATIN1? If your answer is "my client code deals in LATIN1", just set client_encoding = LATIN1. You could do that with ALTER DATABASE SET or possibly ALTER USER SET so that it's transparent to the clients. regards, tom lane
Hi Tom, Tom Lane wrote: > Why does it have to be LATIN1? If your answer is "my client code deals > in LATIN1", just set client_encoding = LATIN1. You could do that with > ALTER DATABASE SET or possibly ALTER USER SET so that it's transparent > to the clients. > The client code is primarily encoding-agnostic at this time. However, the database was just migrated from SQL_ASCII after undergoing a text-column "cleanup" (like removing Win-1252 funny quotes) so I'm not eager to migrate to UTF8 right away, particularly when we're not convinced we need it (although it seems eventually we'll do that). Going back to my options, are you saying that if I re-initdb the 8.3 cluster to UTF8, restore the LATIN1 db as UTF8, and then do "set client_encoding = LATIN1" in the application code, then everything will work fine, even if the machine locale remains en_US.iso88591? And, at least in theory, if a non-LATIN1 character (like 0x92) is presented to the converted database, will it be stored as is or silently transformed (or will an error be issued)? Joe
Re: Bug #3924: Create Database with another encoding as the encoding from postgres
From
Tom Lane
Date:
Joe <dev@freedomcircle.net> writes: > Going back to my options, are you saying that if I re-initdb the 8.3 > cluster to UTF8, restore the LATIN1 db as UTF8, and then do "set > client_encoding = LATIN1" in the application code, then everything will > work fine, even if the machine locale remains en_US.iso88591? No, you'd have to set the database's locale to en_US.utf8 to do that. Plan B would be to set the database locale to C, which is encoding-agnostic and hence allows different databases to have different encodings. Do you actually need en_US sort order? regards, tom lane
Tom Lane wrote: > Plan B would be to set the database locale to C, which is > encoding-agnostic and hence allows different databases to have different > encodings. Do you actually need en_US sort order? > I wasn't aware of the difference in sort orders until about two months ago when i had to compare the output of the production db which was in a C locale with SQL_ASCII encoding (now in C with LATIN1) with the development db with en_US.iso88591/LATIN1. For our purposes, the en_US sort order seems much more reasonable, since for example, in a list by titles, the article "Meet The New Boss, Same as the Old Boss" (with the quotes being part of the title) sorts among the M titles, and not at the beginning, before the A titles. As I understand, 8.4 will include LC_COLLATE support at the database rather than cluster level which should help in this regard. Joe