problem comparing strings when different cluster / database encoding - Mailing list pgsql-sql
From | Tomas Vondra |
---|---|
Subject | problem comparing strings when different cluster / database encoding |
Date | |
Msg-id | 443444B5.1010205@fuzzy.cz Whole thread Raw |
Responses |
Re: problem comparing strings when different cluster / database encoding
(Tom Lane <tgl@sss.pgh.pa.us>)
Re: problem comparing strings when different cluster / database encoding (Alvaro Herrera <alvherre@commandprompt.com>) |
List | pgsql-sql |
Greetings, I've encountered a strange problem. We have a PG 8.0.x database cluster (in the sense used in initdb, i.e. bunch of databases) created with UNICODE encoding, namely cs_CZ.UTF-8 locale. When a database is created with a different encoding (in our case it's LATIN2) the string comparison doesn't work correctly. For example the query SELECT 'ě' = 'é'; returns 'true' which is obviously incorrect, as those two letters have different accents (I hope you can see that). And of course, it's not possible to create an unique index (or primary key) over a column of words (for example in a dictionnary), because false collisions are found, and the sorting works in a really strange way too. If the both cluster and database are in the same encoding (UNICODE or LATIN2), everything works fine. Below is a short description how the database cluster and the databases have been created. ---------------------------------------------------------------------- $ export LANG="cs_CZ.UTF-8" $ initdb ... (cluster created with UNICODE encoding, cs_CZ.UTF-8 locale) $ ... (postgres started, users created, etc.) $ createdb -E LATIN2 my_database; $ psql my_database; > SELECT 'ě' = 'é'; (returns 'true', which is incorrect) ---------------------------------------------------------------------- If we create the cluster with LATIN2 encoding (or on the contrary the database is created with UNICODE encoding), everything works fine. For example the following works as expected. ---------------------------------------------------------------------- $ export LANG="cs_CZ" (thus the ISO-8859-2 encoding is used) $ initdb ... (cluster created with LATIN2 encoding, cs_CZ locale) $ ... (postgres started, users created, etc.) $ createdb -E LATIN2 my_database; $ psql my_database; > SELECT 'ě' = 'é'; (returns 'false', which is correct) ---------------------------------------------------------------------- I'm trying to solve this for several days, but unsuccesfully. Is there something I've missed? Some obvious solution I don't see? The queston is why we need different encodings for cluster / databases. (a) Until recently we've used LATIN2 cluster and LATIN2 databases (and applications expecting LATIN2 encoding) - that'sthe reason why we need LATIN2 databases. (c) On the other way some of the new clients want to 'internationalize' their applications, so we need UNICODE infrastructuretoo - that's the reason why we use UNICODE cluster and databases. I've came accross the nls_string function - with it it works fine, but that's not an option for us, as it would require rewriting all the SQL queries in the applications (and that's something we don't want). Thanks for your suggestions Tomas