Thread: once again, sorting with Unicode

once again, sorting with Unicode

From

JBJ

Date:

18 February 2003, 18:13:24

Thus is the nature of my problem:

I have a multi-lingual database (currently 11 languages) which sorts fine
in MySQL (8859-1 character set)
I have now converted the data to Unicode and compiled Postgre with unicode
support.

I can select and insert unicode and so was rather pleased about that. Until
I saw that it wasn't working properly when ordering!

I have a table with all 11 languages containing a country list, the webuser
will be able (this is not live yet, MySQL is still running the live stuff
and handling it well) to switch between languages and so I select from a
different column (english, icelandic, russian etc) to display the country
list correctly and do ORDER BY that specific column. I noticed this problem
when the output was Icelandic, and have confirmed it with german and other
languages using non-English characters. The sorting goes all wrong.

All the special letters come in between the A, so we have Þ A Ú A Í A Ó et
cetera. This is wrong, Þ should be the third last letter in the row for
example, not the first and A should be in a right order.

I have is_IS as all the LC_X and so am even more surprised by this.
Furthermore I will have to be able to switch between is_IS and others to
get the correct sort order ( russian sort order different from icelandic
for example, simply due to the nature of the letters) live when the user
selects a different language.

Example is here: http://www.worldfootball.org/sort-test.php
-notice how countries beginning with A are surrounded by various others
starting with non-english letters in the Icelandic section
-in the German section you can see that Österreich comes in between the A
countries

I'm sure this is a simple matter but my googling for the last 3 days has
proved unsuccessful so far.

Feedback greatly appreciated

regards,
JBJ

Re: once again, sorting with Unicode

From

"Troy"

Date:

19 February 2003, 06:16:41

JBJ,

The cause for the different values is the fact that unicode
characters have different numeric values from ISO8859-1 and other
encodings. Only ascii values are in sync with unicode numeric
values. This I am sure you knew.

Myself, I've only had to deal with data in different encodings,
not unicode. I've implemented functions to assist in lower case
conversions and comparisons for the different encodings. The same
would work for unicode as well, you'd just need different functions
for each encoding. Remember that some European scripts have
multi character letters (e.g. "ll," "ch" etc.), so reordering characters
alone would never work. Sorting multibyte data (e.g. Japanese), where
you need to use dictionaries, would of course be a much more complex task.

Frankly, I haven't looked at the latest changes to postgres
in terms of encoding related functions, since there hasn't been the need
for me. So all this could be outdated info. I'll leave judgement to someone
whose looked at the features more recently.

Anyway, if a solution doesn't exist already, one solution I would
like would be to have built in helper functions for each encoding,
with unicode as input data. This way one could sort the data in
different ways by switching the function.
E.g. "... order by sortval_is_is(countryname)" could sort unicode data
in Icelandic order by reading the unicode data and then returning
an integer value based on the meaning of the characters in Icelandic.


Troy






>
> Thus is the nature of my problem:
>
> I have a multi-lingual database (currently 11 languages) which sorts fine
> in MySQL (8859-1 character set)
> I have now converted the data to Unicode and compiled Postgre with unicode
> support.
>
> I can select and insert unicode and so was rather pleased about that. Until
> I saw that it wasn't working properly when ordering!
>
> I have a table with all 11 languages containing a country list, the webuser
> will be able (this is not live yet, MySQL is still running the live stuff
> and handling it well) to switch between languages and so I select from a
> different column (english, icelandic, russian etc) to display the country
> list correctly and do ORDER BY that specific column. I noticed this problem
> when the output was Icelandic, and have confirmed it with german and other
> languages using non-English characters. The sorting goes all wrong.
>
> All the special letters come in between the A, so we have Þ A Ú A Í A Ó et
> cetera. This is wrong, Þ should be the third last letter in the row for
> example, not the first and A should be in a right order.
>
> I have is_IS as all the LC_X and so am even more surprised by this.
> Furthermore I will have to be able to switch between is_IS and others to
> get the correct sort order ( russian sort order different from icelandic
> for example, simply due to the nature of the letters) live when the user
> selects a different language.
>
> Example is here: http://www.worldfootball.org/sort-test.php
> -notice how countries beginning with A are surrounded by various others
> starting with non-english letters in the Icelandic section
> -in the German section you can see that Österreich comes in between the A
> countries
>
> I'm sure this is a simple matter but my googling for the last 3 days has
> proved unsuccessful so far.
>
> Feedback greatly appreciated
>
> regards,
> JBJ
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

Re: once again, sorting with Unicode

From

Antti Haapala

Date:

19 February 2003, 07:37:28

On Wed, 19 Feb 2003, Troy wrote:

> > I have a multi-lingual database (currently 11 languages) which sorts
> > fine in MySQL (8859-1 character set) I have now converted the data to
> > Unicode and compiled Postgre with unicode support.
> >
> > I can select and insert unicode and so was rather pleased about that.
> > Until I saw that it wasn't working properly when ordering!
>
> The cause for the different values is the fact that unicode characters
> have different numeric values from ISO8859-1 and other encodings. Only
> ascii values are in sync with unicode numeric values. This I am sure you
> knew.

No, ISO8859-1 maps directly to unicode up to U+00FF. So the actual
_numeric_ values are the same. But actual byte patterns are encoding
dependent.

Have you set database encoding to UTF-8? Are you using proper UTF-8
locales? POSIX compiled locales are often charset dependent.

-- 
Antti Haapala

Re: once again, sorting with Unicode

From

Tom Lane

Date:

19 February 2003, 10:22:36

JBJ <postgre@totw.org> writes:
> Furthermore I will have to be able to switch between is_IS and others to 
> get the correct sort order ( russian sort order different from icelandic 
> for example, simply due to the nature of the letters) live when the user 
> selects a different language.

Postgres can't do that at the moment :-(.  You might be able to hack up
some custom functions to do something likeORDER BY icelandic(text_col)
but it'll be messy, and probably slow.
        regards, tom lane

Re: once again, sorting with Unicode

From

"Troy"

Date:

19 February 2003, 15:38:02

You can generate indexes for your custom functions, though,
which will speed things up. This is what I've done, successfully.



Troy



> 
> JBJ <postgre@totw.org> writes:
> > Furthermore I will have to be able to switch between is_IS and others to 
> > get the correct sort order ( russian sort order different from icelandic 
> > for example, simply due to the nature of the letters) live when the user 
> > selects a different language.
> 
> Postgres can't do that at the moment :-(.  You might be able to hack up
> some custom functions to do something like
>     ORDER BY icelandic(text_col)
> but it'll be messy, and probably slow.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

Re: once again, sorting with Unicode

From

JBJ

Date:

19 February 2003, 16:12:26

At 20:16 19.2.2003, Troy K wrote:
>You can generate indexes for your custom functions, though,
>which will speed things up. This is what I've done, successfully.

Sounds useful, do you have a demo of such a function?

I can if all else fails sort the data using PHP but am not too fond of it 
when I have over 2000 rows or more as will be the case in other tables.

Thanks all for the answers.

Re: once again, sorting with Unicode

From

"Troy"

Date:

20 February 2003, 05:53:23

There are various examples in the example source code section
of the postgres distribution, where you can find code you can
use to write exactly the kind of funtion you need. I can't
immediately include source code from us, but I can include
the gist of how the code works.

The basic idea is to convert the input data to byte values
which are in the right order. If the input data is unicode,
utf8, utf16, or whatever, you have to know what it is, so you
can convert the data to a meaningful byte stream which can be
evaluated just like an array of numbers would be. I.e. remove
bytes which indicate something to the encoding and convert
characters to their one byte values. E.g. if the data is UTF8,
it is one or two bytes long for ISO8859_1 (upto six bytes for
others), one byte for ascii and two bytes for ISO8859_1. You
need to convert it to a one byte long value so comparisons at
byte level will work. For pure unicode you just have to skip
every other byte. 

1. Source code :

... various includes.

PG_FUNCTION_INFO_V1(sample_encoding_func);

Datum sample_encoding_func(PG_FUNCTION_ARGS) {  text * str;  text * result;  size_t len;
  if (PG_ARGISNULL(0))      PG_RETURN_NULL();
  str = PG_GETARG_TEXT_P(0);
  len = VARSIZE(str) - VARHDRSZ;
 ...  do your conversion thing, allocate memory for the   result and return the value, doing error checking as you
go.
}

Add the function to your db:

DROP FUNCTION sample_encoding_func (text);
CREATE FUNCTION sample_encoding_func (text) RETURNS text  AS 'sample_encoding_func.so'  LANGUAGE 'C' WITH
(iscachable,isstrict);

You can create an index with:

create index dummyindex on usertable using btree (sample_encoding_func(username) text_ops);

Troy

> 
> At 20:16 19.2.2003, Troy K wrote:
> >You can generate indexes for your custom functions, though,
> >which will speed things up. This is what I've done, successfully.
> 
> Sounds useful, do you have a demo of such a function?
> 
> I can if all else fails sort the data using PHP but am not too fond of it 
> when I have over 2000 rows or more as will be the case in other tables.
> 
> Thanks all for the answers.
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

Re: once again, sorting with Unicode

From

"Troy"

Date:

20 February 2003, 05:55:50

You are right, of course. I was thinking in terms of the encoded
data. Applications usually get data in UTF8 or UTF16. If the 
input data is true unicode, then there is no difference in
the byte values (just skip the 0x00 bytes).

Cheers,

Troy



> 
> 
> On Wed, 19 Feb 2003, Troy wrote:
> 
> > > I have a multi-lingual database (currently 11 languages) which sorts
> > > fine in MySQL (8859-1 character set) I have now converted the data to
> > > Unicode and compiled Postgre with unicode support.
> > >
> > > I can select and insert unicode and so was rather pleased about that.
> > > Until I saw that it wasn't working properly when ordering!
> >
> > The cause for the different values is the fact that unicode characters
> > have different numeric values from ISO8859-1 and other encodings. Only
> > ascii values are in sync with unicode numeric values. This I am sure you
> > knew.
> 
> No, ISO8859-1 maps directly to unicode up to U+00FF. So the actual
> _numeric_ values are the same. But actual byte patterns are encoding
> dependent.
> 
> Have you set database encoding to UTF-8? Are you using proper UTF-8
> locales? POSIX compiled locales are often charset dependent.
> 
> -- 
> Antti Haapala
> 
> 
> 
>

Re: once again, sorting with Unicode

From

"Troy"

Date:

20 February 2003, 15:10:25

There are various examples in the example source code section
of the postgres distribution, where you can find code you can
use to write exactly the kind of funtion you need. I can't
immediately include source code from us, but I can include
the gist of how the code works.

The basic idea is to convert the input data to byte values
which are in the right order. If the input data is unicode,
utf8, utf16, or whatever, you have to know what it is, so you
can convert the data to a meaningful byte stream which can be
evaluated just like an array of numbers would be. I.e. remove
bytes which indicate something to the encoding and convert
characters to their one byte values. E.g. if the data is UTF8,
it is one or two bytes long for ISO8859_1 (upto six bytes for
others), one byte for ascii and two bytes for ISO8859_1. You
need to convert it to a one byte long value so comparisons at
byte level will work. For pure unicode you just have to skip
every other byte. 

1. Source code :

... various includes.

PG_FUNCTION_INFO_V1(sample_encoding_func);

Datum sample_encoding_func(PG_FUNCTION_ARGS) {  text * str;  text * result;  size_t len;
  if (PG_ARGISNULL(0))      PG_RETURN_NULL();
  str = PG_GETARG_TEXT_P(0);
  len = VARSIZE(str) - VARHDRSZ;
 ...  do your conversion thing, allocate memory for the   result and return the value, doing error checking as you
go.
}

Add the function to your db:

DROP FUNCTION sample_encoding_func (text);
CREATE FUNCTION sample_encoding_func (text) RETURNS text  AS 'sample_encoding_func.so'  LANGUAGE 'C' WITH
(iscachable,isstrict);

You can create an index with:

create index dummyindex on usertable using btree (sample_encoding_func(username) text_ops);

Troy

> 
> At 20:16 19.2.2003, Troy K wrote:
> >You can generate indexes for your custom functions, though,
> >which will speed things up. This is what I've done, successfully.
> 
> Sounds useful, do you have a demo of such a function?
> 
> I can if all else fails sort the data using PHP but am not too fond of it 
> when I have over 2000 rows or more as will be the case in other tables.
> 
> Thanks all for the answers.
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>