Thread: problem comparing strings when different cluster / database encoding

problem comparing strings when different cluster / database encoding

From
Tomas Vondra
Date:
Greetings,

I've encountered a strange problem. We have a PG 8.0.x database cluster
(in the sense used in initdb, i.e. bunch of databases) created with
UNICODE encoding, namely cs_CZ.UTF-8 locale.

When a database is created with a different encoding (in our case it's
LATIN2) the string comparison doesn't work correctly. For example the query
  SELECT 'ě' = 'é';

returns 'true' which is obviously incorrect, as those two letters have
different accents (I hope you can see that). And of course, it's not
possible to create an unique index (or primary key) over a column of
words (for example in a dictionnary), because false collisions are
found, and the sorting works in a really strange way too.

If the both cluster and database are in the same encoding (UNICODE or
LATIN2), everything works fine.

Below is a short description how the database cluster and the databases
have been created.

----------------------------------------------------------------------
$ export LANG="cs_CZ.UTF-8"
$ initdb ... (cluster created with UNICODE encoding, cs_CZ.UTF-8 locale)
$ ... (postgres started, users created, etc.)
$ createdb -E LATIN2 my_database;
$ psql my_database;
> SELECT 'ě' = 'é'; (returns 'true', which is incorrect)
----------------------------------------------------------------------

If we create the cluster with LATIN2 encoding (or on the contrary the
database is created with UNICODE encoding), everything works fine.
For example the following works as expected.

----------------------------------------------------------------------
$ export LANG="cs_CZ" (thus the ISO-8859-2 encoding is used)
$ initdb ... (cluster created with LATIN2 encoding, cs_CZ locale)
$ ... (postgres started, users created, etc.)
$ createdb -E LATIN2 my_database;
$ psql my_database;
> SELECT 'ě' = 'é'; (returns 'false', which is correct)
----------------------------------------------------------------------

I'm trying to solve this for several days, but unsuccesfully. Is there
something I've missed? Some obvious solution I don't see?

The queston is why we need different encodings for cluster / databases.
(a) Until recently we've used LATIN2 cluster and LATIN2 databases (and    applications expecting LATIN2 encoding) -
that'sthe reason why we    need LATIN2 databases.
 
(c) On the other way some of the new clients want to 'internationalize'    their applications, so we need UNICODE
infrastructuretoo - that's    the reason why we use UNICODE cluster and databases.
 

I've came accross the nls_string function - with it it works fine, but
that's not an option for us, as it would require rewriting all the SQL
queries in the applications (and that's something we don't want).

Thanks for your suggestions
Tomas


Tomas Vondra <tv@fuzzy.cz> writes:
> I've encountered a strange problem. We have a PG 8.0.x database cluster
> (in the sense used in initdb, i.e. bunch of databases) created with
> UNICODE encoding, namely cs_CZ.UTF-8 locale.

> When a database is created with a different encoding (in our case it's
> LATIN2) the string comparison doesn't work correctly.

That's right.  This doesn't work, and is warned against in our docs (at
least in recent versions).  You really must keep server_encoding
matching the encoding the locale expects, for all except "trivial"
locales such as C that have no encoding expectations.

We'd forbid these combinations if there were any fully portable way to
detect which encoding the locale expects...
        regards, tom lane


Re: problem comparing strings when different cluster / database encoding

From
Alvaro Herrera
Date:
Tomas Vondra wrote:

Hi,

> I've encountered a strange problem. We have a PG 8.0.x database cluster
> (in the sense used in initdb, i.e. bunch of databases) created with
> UNICODE encoding, namely cs_CZ.UTF-8 locale.
> 
> When a database is created with a different encoding (in our case it's
> LATIN2) the string comparison doesn't work correctly.

Actually this is sort of expected.  Your locale configuration expects a
certain encoding; all databases should be created using that encoding
for string comparison to work properly.  What definitely is a bug is the
fact that CREATE DATABASE allows you to create a database with an
encoding different from the one defined by initdb.

The conclusion is that you should create your databases using UTF-8
encoding if you are using cs_CZ.UTF-8, otherwise it won't work as
expected.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support