Re: unicode and sorting(at least) - Mailing list pgsql-general

From Joel Matthew
Subject Re: unicode and sorting(at least)
Date
Msg-id 20040625102921.A4DD.REES@ddcom.co.jp
Whole thread Raw
In response to Re: unicode and sorting(at least)  (Dennis Gearon <gearond@fireserve.net>)
List pgsql-general
On Thu, 24 Jun 2004 09:06:44 -0700
Dennis Gearon <gearond@fireserve.net> wrote

> All of the ISO 8xxx encodings and LATINX encodings can handle two
> langauges, English and at least one other. Sometimes they can handle
> several langauges besides English, and are actually designed to handle a
> family of langauges.

And that's where the confusion with locales started, I think. But it's
not really true. Every encoding can handle the Latin/English based
computer (C) locale plus one other. In one specific case, it's the C
locale plus (almost real) English, even.

> The ONLY encodings that can handle a significant amount of multiple
> langauges and character sets are the ISO/UTF/UCS series. (UCS is giving
> way to UTF).

Providing a little emphasis, here:

Unicode is a character set. There is currently one encoding defined for
it, but there are several transformations of that encoding, which we
refer to as UTF-n where n tells us something about the bit width that
the transformation is optimized for.

In theory, there could be other encodings of Unicode. (No one expects it
to actually happen, but they did try to leave a door open, just in case.)

> In fact they can handle every human langauge ever used,

Again, adding some emphasis here, (almost) every language currently
known in our modern society can be handled about as well as or better
wth Unicode than we were able to handle English with just ASCII. (I
have a personal interest in some of the dark corners, but that's OT.)

> plus some esoteric ones postulated, and there is room for future
> languages.

(Can't you just feel the chills run up and down your spine? This could
be a wild ride, boys! heh. A little drama for your lunch hour.)

> ...
>
> UTF8/16/32 is built the same way. However, this only applies per
> character, and only works painlessly on UTF32, which has fixed width
> characters.

Again, a little point of emphasis, UTF32 is fixed width on the code
points, but there are still composite characters. (You only thought it
was safe to go back to the beach.)

> UTF8/16 OTOH, have variable length characters (in multiples
> of 8

or 16

>bits). Since SQL_ASCII sorts in a binary fashion, UTF8/16 won't
> sort correctly under SQL_ASCII locale, I believe.

It might almost sort well enough to cause you real pain later, too.

> Tatsuo Ishii wrote:
>
> >>On Wed, 23 Jun 2004, Dennis Gearon wrote:
> >>
> >>
> >>>This is what has to be eventually done:(as sybase, and probably others do it)
> >>>
> >>>    http://www.ianywhere.com/whitepapers/unicode.html
> >>
> >>Actually, what probably has to be eventually done is what's in the SQL
> >>spec.
> >>
> >>Which is AFAICS basically:
> >> Allow multiple encodings
> >> Allow multiple character sets (within an encoding)
> >
> >
> > Could Please explain more details for above. In my understanding a
> > character set can have multiple encodings but...

Well, UTF-8 was originally, IIRC, intended to be a _Universal_ly
applicable transformation. (The scheme would apply as easily to the JIS
character sets.) But I don't think that's what they were talking about.

I think they are talking about multiple major locales (or character
subsets) within Unicode. (I could be wrong, of course.)

>>> ...

--
Joel <rees@ddcom.co.jp>


pgsql-general by date:

Previous
From: Ara Anjargolian
Date:
Subject: Multi-Language upper(),lower()
Next
From: Joel
Date:
Subject: Re: unicode and sorting(at least)