Thread: Bug #3924: Create Database with another encoding as the encoding from postgres

Hi,

I just run into this bug
(http://archives.postgresql.org/pgsql-bugs/2008-02/msg00010.php).  In my
case, Roundup (Python bug tracker) is trying to CREATE DATABASE WITH
ENCODING='UNICODE' in an 8.3.4 cluster created with LATIN1 encoding, and
it gets the error:

encoding UTF8 does not match server's locale en_US
DETAIL:  The server's LC_CTYPE setting requires encoding LATIN1.

Same problem if I try createdb -E UTF8.

Heikki's response to the OP referred him to the 8.3 release notes:

    Disallow database encodings that are inconsistent with the server's locale setting (Tom)



    On most platforms, C locale is the only locale that will work with
    any database encoding. Other locale settings imply a specific
    encoding and will misbehave if the database encoding is something
    different. (Typical symptoms include bogus textual sort order and
    wrong results from upper() or lower().) The server now rejects
    attempts to create databases that have an incompatible encoding.



I can understand that creating a LATIN2, EUC_CN or some other encoding
in a LATIN1 cluster may cause problems, but don't quite see how UTF-8
can be a problem since in essence it includes the others.  Perhaps
someone can provide further explanation?

Assuming this situation will remain as is in the immediate future, what
are my options?  The 8.3 cluster has a LATIN1 database that must stay
that way (at least for a while).  I also have an 8.2 LATIN1 cluster
where SQL_ASCII, LATIN1 and UTF8 appear to coexist happily but was in
the process of migrating the remaining db's to 8.3.  Do I need to create
another 8.3 cluster, this time with UTF8 encoding?  Will that be allowed
if my host LC_CTYPE remains "en_US" (the machine has both en_US.iso88591
and en_US.utf8 installed but operates mostly with the former)?

As an aside, how can one find the status of a Postgres bug?

Joe
Joe <dev@freedomcircle.net> writes:
> I can understand that creating a LATIN2, EUC_CN or some other encoding
> in a LATIN1 cluster may cause problems, but don't quite see how UTF-8
> can be a problem since in essence it includes the others.  Perhaps
> someone can provide further explanation?

The locale setting implies a specific encoding.  UTF8 may contain all
the same characters that, say, LATIN1 does, but it's not anywhere near
representationally the same, and locale-dependent functions will do the
wrong thing if they are fed UTF8 when they are expecting LATIN1.

> Assuming this situation will remain as is in the immediate future, what
> are my options?  The 8.3 cluster has a LATIN1 database that must stay
> that way (at least for a while).

Why does it have to be LATIN1?  If your answer is "my client code deals
in LATIN1", just set client_encoding = LATIN1.  You could do that with
ALTER DATABASE SET or possibly ALTER USER SET so that it's transparent
to the clients.

            regards, tom lane
Hi Tom,

Tom Lane wrote:
> Why does it have to be LATIN1?  If your answer is "my client code deals
> in LATIN1", just set client_encoding = LATIN1.  You could do that with
> ALTER DATABASE SET or possibly ALTER USER SET so that it's transparent
> to the clients.
>

The client code is primarily encoding-agnostic at this time.  However,
the database was just migrated from SQL_ASCII after undergoing a
text-column "cleanup" (like removing Win-1252 funny quotes) so I'm not
eager to migrate to UTF8 right away, particularly when we're not
convinced we need it (although it seems eventually we'll do that).

Going back to my options, are you saying that if I re-initdb the 8.3
cluster to UTF8, restore the LATIN1 db as UTF8, and then do "set
client_encoding = LATIN1" in the application code, then everything will
work fine, even if the machine locale remains en_US.iso88591?  And, at
least in theory, if a non-LATIN1 character (like 0x92) is presented to
the converted database, will it be stored as is or silently transformed
(or will an error be issued)?

Joe
Joe <dev@freedomcircle.net> writes:
> Going back to my options, are you saying that if I re-initdb the 8.3
> cluster to UTF8, restore the LATIN1 db as UTF8, and then do "set
> client_encoding = LATIN1" in the application code, then everything will
> work fine, even if the machine locale remains en_US.iso88591?

No, you'd have to set the database's locale to en_US.utf8 to do that.

Plan B would be to set the database locale to C, which is
encoding-agnostic and hence allows different databases to have different
encodings.  Do you actually need en_US sort order?

            regards, tom lane
Tom Lane wrote:
> Plan B would be to set the database locale to C, which is
> encoding-agnostic and hence allows different databases to have different
> encodings.  Do you actually need en_US sort order?
>

I wasn't aware of the difference in sort orders until about two months
ago when i had to compare the output of the production db which was in a
C locale with SQL_ASCII encoding (now in C with LATIN1) with the
development db with en_US.iso88591/LATIN1.  For our purposes, the en_US
sort order seems much more reasonable, since for example, in a list by
titles, the article "Meet The New Boss, Same as the Old Boss" (with the
quotes being part of the title) sorts among the M titles, and not at the
beginning, before the A titles.  As I understand, 8.4 will include
LC_COLLATE support at the database rather than cluster level which
should help in this regard.

Joe