Thread: 'order by' does "wrong" with unicode-chars (german umlauts)
postgres 7.3.2 I store unicode-data in postgresql. The data is retrieved via webinterfaces, processed with perl and then stored in postgresql (and viceversa). All is going nice with one problem. If performing a "select * order by field"-query the result is not what I expected. German umlauts (ie: Ö) are stored as doublechars ("Ö" is "Ã") and only the first char seems to be taken into account when sorting. So it happens that the order is like: Österreich America Pakistan instead of Amerika Österreich Pakistan How to deal with this Problem ? Of course converting to latin before storing would be a solution but we plan to offer support for many non-latin languages later and the meaning of unicode is to get rid of all this converting-stuff after all. thnx, peter -- IT-Consulting mag. peter pilsl tel:+43-699-1-3574035 fax:+43-699-4-3574035 pilsl@goldfisch.at http://www.goldfisch.at
On Friday 19 September 2003 13:11, peter pilsl wrote: > postgres 7.3.2 > > I store unicode-data in postgresql. The data is retrieved via > webinterfaces, processed with perl and then stored in postgresql (and > viceversa). > > All is going nice with one problem. If performing a "select * order by > field"-query the result is not what I expected. > > German umlauts (ie: Ö) are stored as doublechars ("Ö" is "Ã") and only the > first char seems to be taken into account when sorting. > > So it happens that the order is like: > > Österreich > America > Pakistan > > instead of > > Amerika > Österreich > Pakistan I'm no expert on locales, but I think you're confusing two things. Your character-set determines what symbols you can store. Your locale determines sorting rules. Check the end of the postgresql.conf file for details of what your current settings are. > How to deal with this Problem ? Of course converting to latin before > storing would be a solution but we plan to offer support for many non-latin > languages later and the meaning of unicode is to get rid of all this > converting-stuff after all. What sorting-order do you want? You probably have options like: C, en_GB, de, it etc. If you are storing non-latin1 symbols as well as latin1, I can't think of what a reasonable sort order would be. Unfortunately, PG only supports one locale at a time, and gets set during initdb. See the chapter on Localisation in the manuals for details. -- Richard Huxton Archonet Ltd
> > I'm no expert on locales, but I think you're confusing two things. > Your character-set determines what symbols you can store. > Your locale determines sorting rules. Check the end of the postgresql.conf > file for details of what your current settings are. > I dont think that this is my problem. I get my text from a web-form, process it via perl and store it in postgreSQL via DBI-Interface. The unicode-text appears as multibyte in perl and I got the suspect that postgresql simply takes this multibyte-text and doesnt even reckognize that it could be unicode. If I store a german-umlaut-O (uppercase) to postgres and then retrieve it using the lower-function on it I dont get a german-umlaut-o (lowercase) at all. Only the first byte is converted to lowercase and the second is left untouched, while in "real" unicode-lowercasing the first byte would stay untouched and the second would change. I still dont know how to tell postgres that the data it receives is unicode and not just "singlebyte". I'll rethink my problem and post a somehow more precise question to the mainlist then, but any comments to shorten and improve my rethinking are highly welcome. thnx, peter
On Saturday 20 September 2003 13:56, peter pilsl wrote: > > I'm no expert on locales, but I think you're confusing two things. > > Your character-set determines what symbols you can store. > > Your locale determines sorting rules. Check the end of the > > postgresql.conf file for details of what your current settings are. > > I dont think that this is my problem. Sorry - looks like the sorting part of your question threw me off track. > I get my text from a web-form, process it via perl and store it in > postgreSQL via DBI-Interface. The unicode-text appears as multibyte in perl > and I got the suspect that postgresql simply takes this multibyte-text and > doesnt even reckognize that it could be unicode. Could be the case - try "show client_encoding" in psql to see what encoding you are using. > If I store a german-umlaut-O (uppercase) to postgres and then retrieve it > using the lower-function on it I dont get a german-umlaut-o (lowercase) at > all. Only the first byte is converted to lowercase and the second is left > untouched, while in "real" unicode-lowercasing the first byte would stay > untouched and the second would change. > I still dont know how to tell postgres that the data it receives is unicode > and not just "singlebyte". If it turns out you want to change encoding to multibyte, I think you'll need to dump an initdb again. See the chapter on localization - multi-byte encodings for details. -- Richard Huxton Archonet Ltd
peter pilsl <pilsl@goldfisch.at> writes: > I get my text from a web-form, process it via perl and store it in postgreSQL > via DBI-Interface. The unicode-text appears as multibyte in perl and I got the > suspect that postgresql simply takes this multibyte-text and doesnt even > reckognize that it could be unicode. If you have set the database encoding as SQL_ASCII, then that's exactly what it will (and should) do. You need to make the database encoding be unicode. regards, tom lane
It makes no difference if I use a database with encoding unicode: # \l List of databases Name | Owner | Encoding --------------------+--------------+----------- <skip> test | peter | SQL_ASCII unicode | peter | UNICODE unicode2 | peter | LATIN1 I tried with all these databases. The problem stays exactely the same. The german umlaut-O appears as double-byte "ö" and when ordering it appears between "A" and "B", cause imho only the first byte "Ã" is taking into account. I still have no idea, if there is a problem outside postgreSQL (in perl or maybe in the DBD::Pg-interface) or inside postgreSQL. I still struggle with serious debugging cause I dont know how to insert "real" unicode into postgres ... Any debugging-hints welcome also :) Dealing with the UNICODE-database raises new problems: unicode=# insert into test values ('österreich'); ERROR: Unicode >= 0x10000 is not supported unicode=# show client_encoding; client_encoding ----------------- UNICODE (1 row) thnx, peter Quoting Tom Lane <tgl@sss.pgh.pa.us>: > peter pilsl <pilsl@goldfisch.at> writes: > > I get my text from a web-form, process it via perl and store it in > postgreSQL > > via DBI-Interface. The unicode-text appears as multibyte in perl and I got > the > > suspect that postgresql simply takes this multibyte-text and doesnt even > > reckognize that it could be unicode. > > If you have set the database encoding as SQL_ASCII, then that's exactly > what it will (and should) do. You need to make the database encoding > be unicode. > > regards, tom lane > >
Sat, 20 Sep 2003 18:39:35 +0200 skrev pilsl@goldfisch.at (peter pilsl): > It makes no difference if I use a database with encoding unicode: > ah@ahb:~$ LC_ALL=da_DK initdb ah@ahb:~$ su postgres -c "/usr/local/pgsql/bin/createuser -ad ah" ah@ahb:~$ createdb ah ah@ahb:~$ psql ah ah=# \l List of databases Name | Owner | Encoding -----------+----------+----------- ah | ah | SQL_ASCII ah=# show client_encoding; client_encoding ----------------- SQL_ASCII (1 row) CREATE TABLE test (f1 varchar); INSERT INTO test VALUES ('A'); INSERT INTO test VALUES ('B'); INSERT INTO test VALUES ('AA'); INSERT INTO test VALUES ('Æ'); INSERT INTO test VALUES ('Å'); INSERT INTO test VALUES ('Ø'); INSERT INTO test VALUES ('Ä'); INSERT INTO test VALUES ('Ö'); INSERT INTO test VALUES ('Ü'); SELECT * FROM test ORDER BY f1; t ---- A B Ü Æ Ä Ø Ö Å AA (9 rows) Looks OK to me ;-)
I think it is an expected result. As far as I understand it, LC_ALL determines the collation/sorting rule, and unicode allows you to store and retrieve multibytes character which has nothing to do with the soring rules. kathy > X-Original-To: pgsql-general-postgresql.org@localhost.postgresql.org > Date: Sat, 20 Sep 2003 22:53:54 +0200 > From: Andreas Hinz <news3@winopticc.dk> > To: pgsql-general@postgresql.org > Subject: Re: [GENERAL] 'order by' does "wrong" with unicode-chars (german umlauts) > X-Virus-Scanned: by amavisd-new at postgresql.org > X-Mailing-List: pgsql-general > X-AntiAbuse: This header was added to track abuse, please include it with any abuse report > X-AntiAbuse: Primary Hostname - noon.pghoster.com > X-AntiAbuse: Original Domain - sun.com > X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] > X-AntiAbuse: Sender Address Domain - postgresql.org > Content-Transfer-Encoding: 8bit > X-MIME-Autoconverted: from quoted-printable to 8bit by amon.Central.Sun.COM id h8M2O1c23732 > > Sat, 20 Sep 2003 18:39:35 +0200 > skrev pilsl@goldfisch.at (peter pilsl): > > > It makes no difference if I use a database with encoding unicode: > > > > ah@ahb:~$ LC_ALL=da_DK initdb > ah@ahb:~$ su postgres -c "/usr/local/pgsql/bin/createuser -ad ah" > ah@ahb:~$ createdb ah > ah@ahb:~$ psql ah > > ah=# \l > > List of databases > Name | Owner | Encoding > -----------+----------+----------- > ah | ah | SQL_ASCII > > > ah=# show client_encoding; > > client_encoding > ----------------- > SQL_ASCII > (1 row) > > > CREATE TABLE test (f1 varchar); > > INSERT INTO test VALUES ('A'); > INSERT INTO test VALUES ('B'); > INSERT INTO test VALUES ('AA'); > INSERT INTO test VALUES ('Æ'); > INSERT INTO test VALUES ('Å'); > INSERT INTO test VALUES ('Ø'); > INSERT INTO test VALUES ('Ä'); > INSERT INTO test VALUES ('Ö'); > INSERT INTO test VALUES ('Ü'); > > SELECT * FROM test ORDER BY f1; > > t > ---- > A > B > Ü > Æ > Ä > Ø > Ö > Å > AA > (9 rows) > > > Looks OK to me ;-) > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >