problem comparing strings when different cluster / database encoding - Mailing list pgsql-sql

From Tomas Vondra
Subject problem comparing strings when different cluster / database encoding
Date
Msg-id 443444B5.1010205@fuzzy.cz
Whole thread Raw
Responses Re: problem comparing strings when different cluster / database encoding
Re: problem comparing strings when different cluster / database encoding
List pgsql-sql
Greetings,

I've encountered a strange problem. We have a PG 8.0.x database cluster
(in the sense used in initdb, i.e. bunch of databases) created with
UNICODE encoding, namely cs_CZ.UTF-8 locale.

When a database is created with a different encoding (in our case it's
LATIN2) the string comparison doesn't work correctly. For example the query
  SELECT 'ě' = 'é';

returns 'true' which is obviously incorrect, as those two letters have
different accents (I hope you can see that). And of course, it's not
possible to create an unique index (or primary key) over a column of
words (for example in a dictionnary), because false collisions are
found, and the sorting works in a really strange way too.

If the both cluster and database are in the same encoding (UNICODE or
LATIN2), everything works fine.

Below is a short description how the database cluster and the databases
have been created.

----------------------------------------------------------------------
$ export LANG="cs_CZ.UTF-8"
$ initdb ... (cluster created with UNICODE encoding, cs_CZ.UTF-8 locale)
$ ... (postgres started, users created, etc.)
$ createdb -E LATIN2 my_database;
$ psql my_database;
> SELECT 'ě' = 'é'; (returns 'true', which is incorrect)
----------------------------------------------------------------------

If we create the cluster with LATIN2 encoding (or on the contrary the
database is created with UNICODE encoding), everything works fine.
For example the following works as expected.

----------------------------------------------------------------------
$ export LANG="cs_CZ" (thus the ISO-8859-2 encoding is used)
$ initdb ... (cluster created with LATIN2 encoding, cs_CZ locale)
$ ... (postgres started, users created, etc.)
$ createdb -E LATIN2 my_database;
$ psql my_database;
> SELECT 'ě' = 'é'; (returns 'false', which is correct)
----------------------------------------------------------------------

I'm trying to solve this for several days, but unsuccesfully. Is there
something I've missed? Some obvious solution I don't see?

The queston is why we need different encodings for cluster / databases.
(a) Until recently we've used LATIN2 cluster and LATIN2 databases (and    applications expecting LATIN2 encoding) -
that'sthe reason why we    need LATIN2 databases.
 
(c) On the other way some of the new clients want to 'internationalize'    their applications, so we need UNICODE
infrastructuretoo - that's    the reason why we use UNICODE cluster and databases.
 

I've came accross the nls_string function - with it it works fine, but
that's not an option for us, as it would require rewriting all the SQL
queries in the applications (and that's something we don't want).

Thanks for your suggestions
Tomas


pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: SELECT composite type
Next
From: Tom Lane
Date:
Subject: Re: problem comparing strings when different cluster / database encoding