Re: once again, sorting with Unicode - Mailing list pgsql-sql

From Troy
Subject Re: once again, sorting with Unicode
Date
Msg-id 200302191049.h1JAnT3a007870@tksoft.com
Whole thread Raw
In response to once again, sorting with Unicode  (JBJ <postgre@totw.org>)
Responses Re: once again, sorting with Unicode  (Antti Haapala <antti.haapala@iki.fi>)
List pgsql-sql
JBJ,

The cause for the different values is the fact that unicode
characters have different numeric values from ISO8859-1 and other
encodings. Only ascii values are in sync with unicode numeric
values. This I am sure you knew.

Myself, I've only had to deal with data in different encodings,
not unicode. I've implemented functions to assist in lower case
conversions and comparisons for the different encodings. The same
would work for unicode as well, you'd just need different functions
for each encoding. Remember that some European scripts have
multi character letters (e.g. "ll," "ch" etc.), so reordering characters
alone would never work. Sorting multibyte data (e.g. Japanese), where
you need to use dictionaries, would of course be a much more complex task.

Frankly, I haven't looked at the latest changes to postgres
in terms of encoding related functions, since there hasn't been the need
for me. So all this could be outdated info. I'll leave judgement to someone
whose looked at the features more recently.

Anyway, if a solution doesn't exist already, one solution I would
like would be to have built in helper functions for each encoding,
with unicode as input data. This way one could sort the data in
different ways by switching the function.
E.g. "... order by sortval_is_is(countryname)" could sort unicode data
in Icelandic order by reading the unicode data and then returning
an integer value based on the meaning of the characters in Icelandic.


Troy






>
> Thus is the nature of my problem:
>
> I have a multi-lingual database (currently 11 languages) which sorts fine
> in MySQL (8859-1 character set)
> I have now converted the data to Unicode and compiled Postgre with unicode
> support.
>
> I can select and insert unicode and so was rather pleased about that. Until
> I saw that it wasn't working properly when ordering!
>
> I have a table with all 11 languages containing a country list, the webuser
> will be able (this is not live yet, MySQL is still running the live stuff
> and handling it well) to switch between languages and so I select from a
> different column (english, icelandic, russian etc) to display the country
> list correctly and do ORDER BY that specific column. I noticed this problem
> when the output was Icelandic, and have confirmed it with german and other
> languages using non-English characters. The sorting goes all wrong.
>
> All the special letters come in between the A, so we have Þ A Ú A Í A Ó et
> cetera. This is wrong, Þ should be the third last letter in the row for
> example, not the first and A should be in a right order.
>
> I have is_IS as all the LC_X and so am even more surprised by this.
> Furthermore I will have to be able to switch between is_IS and others to
> get the correct sort order ( russian sort order different from icelandic
> for example, simply due to the nature of the letters) live when the user
> selects a different language.
>
> Example is here: http://www.worldfootball.org/sort-test.php
> -notice how countries beginning with A are surrounded by various others
> starting with non-english letters in the Icelandic section
> -in the German section you can see that Österreich comes in between the A
> countries
>
> I'm sure this is a simple matter but my googling for the last 3 days has
> proved unsuccessful so far.
>
> Feedback greatly appreciated
>
> regards,
> JBJ
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>



pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: Rule Creates Duplicates retry
Next
From: Antti Haapala
Date:
Subject: Re: once again, sorting with Unicode