Thread: once again, sorting with Unicode
Thus is the nature of my problem: I have a multi-lingual database (currently 11 languages) which sorts fine in MySQL (8859-1 character set) I have now converted the data to Unicode and compiled Postgre with unicode support. I can select and insert unicode and so was rather pleased about that. Until I saw that it wasn't working properly when ordering! I have a table with all 11 languages containing a country list, the webuser will be able (this is not live yet, MySQL is still running the live stuff and handling it well) to switch between languages and so I select from a different column (english, icelandic, russian etc) to display the country list correctly and do ORDER BY that specific column. I noticed this problem when the output was Icelandic, and have confirmed it with german and other languages using non-English characters. The sorting goes all wrong. All the special letters come in between the A, so we have Þ A Ú A Í A Ó et cetera. This is wrong, Þ should be the third last letter in the row for example, not the first and A should be in a right order. I have is_IS as all the LC_X and so am even more surprised by this. Furthermore I will have to be able to switch between is_IS and others to get the correct sort order ( russian sort order different from icelandic for example, simply due to the nature of the letters) live when the user selects a different language. Example is here: http://www.worldfootball.org/sort-test.php -notice how countries beginning with A are surrounded by various others starting with non-english letters in the Icelandic section -in the German section you can see that Österreich comes in between the A countries I'm sure this is a simple matter but my googling for the last 3 days has proved unsuccessful so far. Feedback greatly appreciated regards, JBJ
JBJ, The cause for the different values is the fact that unicode characters have different numeric values from ISO8859-1 and other encodings. Only ascii values are in sync with unicode numeric values. This I am sure you knew. Myself, I've only had to deal with data in different encodings, not unicode. I've implemented functions to assist in lower case conversions and comparisons for the different encodings. The same would work for unicode as well, you'd just need different functions for each encoding. Remember that some European scripts have multi character letters (e.g. "ll," "ch" etc.), so reordering characters alone would never work. Sorting multibyte data (e.g. Japanese), where you need to use dictionaries, would of course be a much more complex task. Frankly, I haven't looked at the latest changes to postgres in terms of encoding related functions, since there hasn't been the need for me. So all this could be outdated info. I'll leave judgement to someone whose looked at the features more recently. Anyway, if a solution doesn't exist already, one solution I would like would be to have built in helper functions for each encoding, with unicode as input data. This way one could sort the data in different ways by switching the function. E.g. "... order by sortval_is_is(countryname)" could sort unicode data in Icelandic order by reading the unicode data and then returning an integer value based on the meaning of the characters in Icelandic. Troy > > Thus is the nature of my problem: > > I have a multi-lingual database (currently 11 languages) which sorts fine > in MySQL (8859-1 character set) > I have now converted the data to Unicode and compiled Postgre with unicode > support. > > I can select and insert unicode and so was rather pleased about that. Until > I saw that it wasn't working properly when ordering! > > I have a table with all 11 languages containing a country list, the webuser > will be able (this is not live yet, MySQL is still running the live stuff > and handling it well) to switch between languages and so I select from a > different column (english, icelandic, russian etc) to display the country > list correctly and do ORDER BY that specific column. I noticed this problem > when the output was Icelandic, and have confirmed it with german and other > languages using non-English characters. The sorting goes all wrong. > > All the special letters come in between the A, so we have Þ A Ú A Í A Ó et > cetera. This is wrong, Þ should be the third last letter in the row for > example, not the first and A should be in a right order. > > I have is_IS as all the LC_X and so am even more surprised by this. > Furthermore I will have to be able to switch between is_IS and others to > get the correct sort order ( russian sort order different from icelandic > for example, simply due to the nature of the letters) live when the user > selects a different language. > > Example is here: http://www.worldfootball.org/sort-test.php > -notice how countries beginning with A are surrounded by various others > starting with non-english letters in the Icelandic section > -in the German section you can see that Österreich comes in between the A > countries > > I'm sure this is a simple matter but my googling for the last 3 days has > proved unsuccessful so far. > > Feedback greatly appreciated > > regards, > JBJ > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
On Wed, 19 Feb 2003, Troy wrote: > > I have a multi-lingual database (currently 11 languages) which sorts > > fine in MySQL (8859-1 character set) I have now converted the data to > > Unicode and compiled Postgre with unicode support. > > > > I can select and insert unicode and so was rather pleased about that. > > Until I saw that it wasn't working properly when ordering! > > The cause for the different values is the fact that unicode characters > have different numeric values from ISO8859-1 and other encodings. Only > ascii values are in sync with unicode numeric values. This I am sure you > knew. No, ISO8859-1 maps directly to unicode up to U+00FF. So the actual _numeric_ values are the same. But actual byte patterns are encoding dependent. Have you set database encoding to UTF-8? Are you using proper UTF-8 locales? POSIX compiled locales are often charset dependent. -- Antti Haapala
JBJ <postgre@totw.org> writes: > Furthermore I will have to be able to switch between is_IS and others to > get the correct sort order ( russian sort order different from icelandic > for example, simply due to the nature of the letters) live when the user > selects a different language. Postgres can't do that at the moment :-(. You might be able to hack up some custom functions to do something likeORDER BY icelandic(text_col) but it'll be messy, and probably slow. regards, tom lane
You can generate indexes for your custom functions, though, which will speed things up. This is what I've done, successfully. Troy > > JBJ <postgre@totw.org> writes: > > Furthermore I will have to be able to switch between is_IS and others to > > get the correct sort order ( russian sort order different from icelandic > > for example, simply due to the nature of the letters) live when the user > > selects a different language. > > Postgres can't do that at the moment :-(. You might be able to hack up > some custom functions to do something like > ORDER BY icelandic(text_col) > but it'll be messy, and probably slow. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
At 20:16 19.2.2003, Troy K wrote: >You can generate indexes for your custom functions, though, >which will speed things up. This is what I've done, successfully. Sounds useful, do you have a demo of such a function? I can if all else fails sort the data using PHP but am not too fond of it when I have over 2000 rows or more as will be the case in other tables. Thanks all for the answers.
There are various examples in the example source code section of the postgres distribution, where you can find code you can use to write exactly the kind of funtion you need. I can't immediately include source code from us, but I can include the gist of how the code works. The basic idea is to convert the input data to byte values which are in the right order. If the input data is unicode, utf8, utf16, or whatever, you have to know what it is, so you can convert the data to a meaningful byte stream which can be evaluated just like an array of numbers would be. I.e. remove bytes which indicate something to the encoding and convert characters to their one byte values. E.g. if the data is UTF8, it is one or two bytes long for ISO8859_1 (upto six bytes for others), one byte for ascii and two bytes for ISO8859_1. You need to convert it to a one byte long value so comparisons at byte level will work. For pure unicode you just have to skip every other byte. 1. Source code : ... various includes. PG_FUNCTION_INFO_V1(sample_encoding_func); Datum sample_encoding_func(PG_FUNCTION_ARGS) { text * str; text * result; size_t len; if (PG_ARGISNULL(0)) PG_RETURN_NULL(); str = PG_GETARG_TEXT_P(0); len = VARSIZE(str) - VARHDRSZ; ... do your conversion thing, allocate memory for the result and return the value, doing error checking as you go. } Add the function to your db: DROP FUNCTION sample_encoding_func (text); CREATE FUNCTION sample_encoding_func (text) RETURNS text AS 'sample_encoding_func.so' LANGUAGE 'C' WITH (iscachable,isstrict); You can create an index with: create index dummyindex on usertable using btree (sample_encoding_func(username) text_ops); Troy > > At 20:16 19.2.2003, Troy K wrote: > >You can generate indexes for your custom functions, though, > >which will speed things up. This is what I've done, successfully. > > Sounds useful, do you have a demo of such a function? > > I can if all else fails sort the data using PHP but am not too fond of it > when I have over 2000 rows or more as will be the case in other tables. > > Thanks all for the answers. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
You are right, of course. I was thinking in terms of the encoded data. Applications usually get data in UTF8 or UTF16. If the input data is true unicode, then there is no difference in the byte values (just skip the 0x00 bytes). Cheers, Troy > > > On Wed, 19 Feb 2003, Troy wrote: > > > > I have a multi-lingual database (currently 11 languages) which sorts > > > fine in MySQL (8859-1 character set) I have now converted the data to > > > Unicode and compiled Postgre with unicode support. > > > > > > I can select and insert unicode and so was rather pleased about that. > > > Until I saw that it wasn't working properly when ordering! > > > > The cause for the different values is the fact that unicode characters > > have different numeric values from ISO8859-1 and other encodings. Only > > ascii values are in sync with unicode numeric values. This I am sure you > > knew. > > No, ISO8859-1 maps directly to unicode up to U+00FF. So the actual > _numeric_ values are the same. But actual byte patterns are encoding > dependent. > > Have you set database encoding to UTF-8? Are you using proper UTF-8 > locales? POSIX compiled locales are often charset dependent. > > -- > Antti Haapala > > > >
There are various examples in the example source code section of the postgres distribution, where you can find code you can use to write exactly the kind of funtion you need. I can't immediately include source code from us, but I can include the gist of how the code works. The basic idea is to convert the input data to byte values which are in the right order. If the input data is unicode, utf8, utf16, or whatever, you have to know what it is, so you can convert the data to a meaningful byte stream which can be evaluated just like an array of numbers would be. I.e. remove bytes which indicate something to the encoding and convert characters to their one byte values. E.g. if the data is UTF8, it is one or two bytes long for ISO8859_1 (upto six bytes for others), one byte for ascii and two bytes for ISO8859_1. You need to convert it to a one byte long value so comparisons at byte level will work. For pure unicode you just have to skip every other byte. 1. Source code : ... various includes. PG_FUNCTION_INFO_V1(sample_encoding_func); Datum sample_encoding_func(PG_FUNCTION_ARGS) { text * str; text * result; size_t len; if (PG_ARGISNULL(0)) PG_RETURN_NULL(); str = PG_GETARG_TEXT_P(0); len = VARSIZE(str) - VARHDRSZ; ... do your conversion thing, allocate memory for the result and return the value, doing error checking as you go. } Add the function to your db: DROP FUNCTION sample_encoding_func (text); CREATE FUNCTION sample_encoding_func (text) RETURNS text AS 'sample_encoding_func.so' LANGUAGE 'C' WITH (iscachable,isstrict); You can create an index with: create index dummyindex on usertable using btree (sample_encoding_func(username) text_ops); Troy > > At 20:16 19.2.2003, Troy K wrote: > >You can generate indexes for your custom functions, though, > >which will speed things up. This is what I've done, successfully. > > Sounds useful, do you have a demo of such a function? > > I can if all else fails sort the data using PHP but am not too fond of it > when I have over 2000 rows or more as will be the case in other tables. > > Thanks all for the answers. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >