Thread: utf-8 and cultural sensitive sorting

utf-8 and cultural sensitive sorting

From
Date:
Our product will be storing its character data in utf-8
format (unicode encoding).

What is the best way to achive cultural sensitive sorting
using the utf-8 data?

Is it possible have the locale apply to a connection?

If so, is the cultural sorting support mature in
PostgreSQL?

What type of performance can be expected as compared with
the normal c locale sorting?

Thanks very much,

Steve.

Re: utf-8 and cultural sensitive sorting

From
Richard Huxton
Date:
sknipe@tucows.com wrote:
> Our product will be storing its character data in utf-8 format (unicode
> encoding).
>
> What is the best way to achive cultural sensitive sorting using the
> utf-8 data?

See below.

> Is it possible have the locale apply to a connection?

A locale applies to a whole database cluster. That's one locale only.

> If so, is the cultural sorting support mature in PostgreSQL?
>
> What type of performance can be expected as compared with the normal c
> locale sorting?

Maturity and performance depend upon the underlying libraries, which
depend upon your O.S. - I recommend testing.

There has been discussion of how to support mixed locales, possibly even
within the same column but I don't know whether anyone is working on
this at the present time. The mailing list archives will contain plenty
of discussion on this.

Does that help?
--
   Richard Huxton
   Archonet Ltd

Re: utf-8 and cultural sensitive sorting

From
Alex Stapleton
Date:
It depends what language you want to sort. Lots of languages do not
have a sort alphabet. For example, Japanese. It can be quite
difficult to sort unusual languages like this. I am not aware of any
standard technique for sorting Japanese text other than keeping an
arbitrarily sorted dictionary (courtesy of whatever the most popular
Japanese dictionary at the time happens to be perhaps) and then doing
hash lookups in the for indexing values. As you can imagine, this is
not particularly fast. I have not actually tried this, but I expect
PosgreSQL will simply sort in a fairly binary fashion. As in, it gets
sorted in according to the binary value of the characters, or the
UTF-8 offsets, or something like that.

On 12 Jul 2005, at 15:48, <sknipe@tucows.com> <sknipe@tucows.com> wrote:

> Our product will be storing its character data in utf-8 format
> (unicode encoding).
>
> What is the best way to achive cultural sensitive sorting using the
> utf-8 data?
>
> Is it possible have the locale apply to a connection?
>
> If so, is the cultural sorting support mature in PostgreSQL?
>
> What type of performance can be expected as compared with the
> normal c locale sorting?
>
> Thanks very much,
>
> Steve.
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that
> your
>       message can get through to the mailing list cleanly
>


Re: utf-8 and cultural sensitive sorting

From
Tatsuo Ishii
Date:
> It depends what language you want to sort. Lots of languages do not
> have a sort alphabet. For example, Japanese. It can be quite
> difficult to sort unusual languages like this. I am not aware of any
> standard technique for sorting Japanese text other than keeping an
> arbitrarily sorted dictionary (courtesy of whatever the most popular
> Japanese dictionary at the time happens to be perhaps) and then doing
> hash lookups in the for indexing values. As you can imagine, this is
> not particularly fast. I have not actually tried this, but I expect
> PosgreSQL will simply sort in a fairly binary fashion. As in, it gets
> sorted in according to the binary value of the characters, or the
> UTF-8 offsets, or something like that.

Above is almost correct but usually sorting by the JIS code order is
enough for most Japanese applications (I believe same thing can be
said to Chinese). I do not recommend using locale for sorting
Japanese. It quite frequently happens that the locale support for
multibyte encodings is totally broken. See recent posting titled
"[GENERAL] Japanese words not distinguished" for more details.

If you have to live with UTF-8 database, I recommend turning off the
locale support and use CONVERT to sort Japanese. For example,

SELECT * FROM t1 ORDER BY CONVERT(col1 USING utf_8_to_euc_jp);

> On 12 Jul 2005, at 15:48, <sknipe@tucows.com> <sknipe@tucows.com> wrote:
>
> > Our product will be storing its character data in utf-8 format
> > (unicode encoding).
> >
> > What is the best way to achive cultural sensitive sorting using the
> > utf-8 data?
> >
> > Is it possible have the locale apply to a connection?
> >
> > If so, is the cultural sorting support mature in PostgreSQL?
> >
> > What type of performance can be expected as compared with the
> > normal c locale sorting?
> >
> > Thanks very much,
> >
> > Steve.
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 1: if posting/reading through Usenet, please send an appropriate
> >       subscribe-nomail command to majordomo@postgresql.org so that
> > your
> >       message can get through to the mailing list cleanly
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>