Thread: 'order by' does "wrong" with unicode-chars (german umlauts)

'order by' does "wrong" with unicode-chars (german umlauts)

From
peter pilsl
Date:
postgres 7.3.2

I store unicode-data in postgresql. The data is retrieved via webinterfaces,
processed with perl and then stored in postgresql (and viceversa).

All is going nice with one problem. If performing a "select * order by
field"-query the result is not what I expected.

German umlauts (ie: Ö) are stored as doublechars ("Ö" is "Ö") and only the
first char seems to be taken into account when sorting.

So it happens that the order is like:

Österreich
America
Pakistan

instead of

Amerika
Österreich
Pakistan


How to deal with this Problem ? Of course converting to latin before storing
would be a solution but we plan to offer support for many non-latin languages
later and the meaning of unicode is to get rid of all this converting-stuff
after all.

thnx,
peter



--
IT-Consulting
mag. peter  pilsl
tel:+43-699-1-3574035
fax:+43-699-4-3574035
pilsl@goldfisch.at
http://www.goldfisch.at




Re: 'order by' does "wrong" with unicode-chars (german umlauts)

From
Richard Huxton
Date:
On Friday 19 September 2003 13:11, peter pilsl wrote:
> postgres 7.3.2
>
> I store unicode-data in postgresql. The data is retrieved via
> webinterfaces, processed with perl and then stored in postgresql (and
> viceversa).
>
> All is going nice with one problem. If performing a "select * order by
> field"-query the result is not what I expected.
>
> German umlauts (ie: Ö) are stored as doublechars ("Ö" is "Ö") and only the
> first char seems to be taken into account when sorting.
>
> So it happens that the order is like:
>
> Österreich
> America
> Pakistan
>
> instead of
>
> Amerika
> Österreich
> Pakistan

I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the postgresql.conf
file for details of what your current settings are.

> How to deal with this Problem ? Of course converting to latin before
> storing would be a solution but we plan to offer support for many non-latin
> languages later and the meaning of unicode is to get rid of all this
> converting-stuff after all.

What sorting-order do you want? You probably have options like: C, en_GB, de,
it etc.
If you are storing non-latin1 symbols as well as latin1, I can't think of what
a reasonable sort order would be.

Unfortunately, PG only supports one locale at a time, and gets set during
initdb. See the chapter on Localisation in the manuals for details.

--
  Richard Huxton
  Archonet Ltd

Re: 'order by' does "wrong" with unicode-chars (german umlauts)

From
peter pilsl
Date:
>
> I'm no expert on locales, but I think you're confusing two things.
> Your character-set determines what symbols you can store.
> Your locale determines sorting rules. Check the end of the postgresql.conf
> file for details of what your current settings are.
>

I dont think that this is my problem.

I get my text from a web-form, process it via perl and store it in postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.

If I store a german-umlaut-O (uppercase) to postgres and then retrieve it using
the lower-function on it I dont get a german-umlaut-o (lowercase) at all.
Only the first byte is converted to lowercase and the second is left untouched,
while in "real" unicode-lowercasing the first byte would stay untouched and the
second would change.
I still dont know how to tell postgres that the data it receives is unicode and
not just "singlebyte".

I'll rethink my problem and post a somehow more precise question to the mainlist
then, but any comments to shorten and improve my rethinking are highly welcome.

thnx,
peter



Re: 'order by' does "wrong" with unicode-chars (german umlauts)

From
Richard Huxton
Date:
On Saturday 20 September 2003 13:56, peter pilsl wrote:
> > I'm no expert on locales, but I think you're confusing two things.
> > Your character-set determines what symbols you can store.
> > Your locale determines sorting rules. Check the end of the
> > postgresql.conf file for details of what your current settings are.
>
> I dont think that this is my problem.

Sorry - looks like the sorting part of your question threw me off track.

> I get my text from a web-form, process it via perl and store it in
> postgreSQL via DBI-Interface. The unicode-text appears as multibyte in perl
> and I got the suspect that postgresql simply takes this multibyte-text and
> doesnt even reckognize that it could be unicode.

Could be the case - try "show client_encoding" in psql to see what encoding
you are using.

> If I store a german-umlaut-O (uppercase) to postgres and then retrieve it
> using the lower-function on it I dont get a german-umlaut-o (lowercase) at
> all. Only the first byte is converted to lowercase and the second is left
> untouched, while in "real" unicode-lowercasing the first byte would stay
> untouched and the second would change.
> I still dont know how to tell postgres that the data it receives is unicode
> and not just "singlebyte".

If it turns out you want to change encoding to multibyte, I think you'll need
to dump an initdb again. See the chapter on localization - multi-byte
encodings for details.


--
  Richard Huxton
  Archonet Ltd

Re: 'order by' does "wrong" with unicode-chars (german umlauts)

From
Tom Lane
Date:
peter pilsl <pilsl@goldfisch.at> writes:
> I get my text from a web-form, process it via perl and store it in postgreSQL
> via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
> suspect that postgresql simply takes this multibyte-text and doesnt even
> reckognize that it could be unicode.

If you have set the database encoding as SQL_ASCII, then that's exactly
what it will (and should) do.  You need to make the database encoding
be unicode.

            regards, tom lane

Re: 'order by' does "wrong" with unicode-chars (german umlauts)

From
peter pilsl
Date:

It makes no difference if I use a database with encoding unicode:


# \l
               List of databases
        Name        |    Owner     | Encoding
--------------------+--------------+-----------
<skip>
 test               | peter        | SQL_ASCII
 unicode            | peter        | UNICODE
 unicode2           | peter        | LATIN1


I tried with all these databases. The problem stays exactely the same. The
german umlaut-O appears as double-byte "ö" and when ordering it appears between
"A" and "B", cause imho only the first byte "Ã" is taking into account.

I still have no idea, if there is a problem outside postgreSQL (in perl or maybe
in the DBD::Pg-interface) or inside postgreSQL. I still struggle with serious
debugging cause I dont know how to insert "real" unicode into postgres ... Any
debugging-hints welcome also :)


Dealing with the UNICODE-database raises new problems:
unicode=# insert into test values ('österreich');
ERROR:  Unicode >= 0x10000 is not supported
unicode=# show client_encoding;
 client_encoding
-----------------
 UNICODE
(1 row)


thnx,
peter


Quoting Tom Lane <tgl@sss.pgh.pa.us>:

> peter pilsl <pilsl@goldfisch.at> writes:
> > I get my text from a web-form, process it via perl and store it in
> postgreSQL
> > via DBI-Interface. The unicode-text appears as multibyte in perl and I got
> the
> > suspect that postgresql simply takes this multibyte-text and doesnt even
> > reckognize that it could be unicode.
>
> If you have set the database encoding as SQL_ASCII, then that's exactly
> what it will (and should) do.  You need to make the database encoding
> be unicode.
>
>             regards, tom lane
>
>

Re: 'order by' does "wrong" with unicode-chars (german umlauts)

From
Andreas Hinz
Date:
Sat, 20 Sep 2003 18:39:35 +0200
skrev pilsl@goldfisch.at (peter pilsl):

> It makes no difference if I use a database with encoding unicode:
>

ah@ahb:~$ LC_ALL=da_DK initdb
ah@ahb:~$ su postgres -c "/usr/local/pgsql/bin/createuser -ad ah"
ah@ahb:~$ createdb ah
ah@ahb:~$ psql ah

ah=# \l

        List of databases
   Name    |  Owner   | Encoding
-----------+----------+-----------
 ah        | ah       | SQL_ASCII


ah=# show client_encoding;

 client_encoding
-----------------
 SQL_ASCII
(1 row)


CREATE TABLE test (f1 varchar);

INSERT INTO test VALUES ('A');
INSERT INTO test VALUES ('B');
INSERT INTO test VALUES ('AA');
INSERT INTO test VALUES ('Æ');
INSERT INTO test VALUES ('Å');
INSERT INTO test VALUES ('Ø');
INSERT INTO test VALUES ('Ä');
INSERT INTO test VALUES ('Ö');
INSERT INTO test VALUES ('Ü');

SELECT * FROM test ORDER BY f1;

 t
----
 A
 B
 Ü
 Æ
 Ä
 Ø
 Ö
 Å
 AA
(9 rows)


Looks OK to me  ;-)


Re: 'order by' does "wrong" with unicode-chars (german umlauts)

From
Kathy Zhu
Date:
I think it is an expected result.
As far as I understand it, LC_ALL determines the collation/sorting rule, and
unicode allows you to store and retrieve multibytes character which has nothing
to do with the soring rules.

kathy


> X-Original-To: pgsql-general-postgresql.org@localhost.postgresql.org
> Date: Sat, 20 Sep 2003 22:53:54 +0200
> From: Andreas Hinz <news3@winopticc.dk>
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] 'order by' does "wrong" with unicode-chars (german
umlauts)
> X-Virus-Scanned: by amavisd-new at postgresql.org
> X-Mailing-List: pgsql-general
> X-AntiAbuse: This header was added to track abuse, please include it with any
abuse report
> X-AntiAbuse: Primary Hostname - noon.pghoster.com
> X-AntiAbuse: Original Domain - sun.com
> X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
> X-AntiAbuse: Sender Address Domain - postgresql.org
> Content-Transfer-Encoding: 8bit
> X-MIME-Autoconverted: from quoted-printable to 8bit by amon.Central.Sun.COM id
h8M2O1c23732
>
> Sat, 20 Sep 2003 18:39:35 +0200
> skrev pilsl@goldfisch.at (peter pilsl):
>
> > It makes no difference if I use a database with encoding unicode:
> >
>
> ah@ahb:~$ LC_ALL=da_DK initdb
> ah@ahb:~$ su postgres -c "/usr/local/pgsql/bin/createuser -ad ah"
> ah@ahb:~$ createdb ah
> ah@ahb:~$ psql ah
>
> ah=# \l
>
>         List of databases
>    Name    |  Owner   | Encoding
> -----------+----------+-----------
>  ah        | ah       | SQL_ASCII
>
>
> ah=# show client_encoding;
>
>  client_encoding
> -----------------
>  SQL_ASCII
> (1 row)
>
>
> CREATE TABLE test (f1 varchar);
>
> INSERT INTO test VALUES ('A');
> INSERT INTO test VALUES ('B');
> INSERT INTO test VALUES ('AA');
> INSERT INTO test VALUES ('Æ');
> INSERT INTO test VALUES ('Å');
> INSERT INTO test VALUES ('Ø');
> INSERT INTO test VALUES ('Ä');
> INSERT INTO test VALUES ('Ö');
> INSERT INTO test VALUES ('Ü');
>
> SELECT * FROM test ORDER BY f1;
>
>  t
> ----
>  A
>  B
>  Ü
>  Æ
>  Ä
>  Ø
>  Ö
>  Å
>  AA
> (9 rows)
>
>
> Looks OK to me  ;-)
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>