Re: unicode and sorting(at least) - Mailing list pgsql-general
From | Joel Matthew |
---|---|
Subject | Re: unicode and sorting(at least) |
Date | |
Msg-id | 20040625102921.A4DD.REES@ddcom.co.jp Whole thread Raw |
In response to | Re: unicode and sorting(at least) (Dennis Gearon <gearond@fireserve.net>) |
List | pgsql-general |
On Thu, 24 Jun 2004 09:06:44 -0700 Dennis Gearon <gearond@fireserve.net> wrote > All of the ISO 8xxx encodings and LATINX encodings can handle two > langauges, English and at least one other. Sometimes they can handle > several langauges besides English, and are actually designed to handle a > family of langauges. And that's where the confusion with locales started, I think. But it's not really true. Every encoding can handle the Latin/English based computer (C) locale plus one other. In one specific case, it's the C locale plus (almost real) English, even. > The ONLY encodings that can handle a significant amount of multiple > langauges and character sets are the ISO/UTF/UCS series. (UCS is giving > way to UTF). Providing a little emphasis, here: Unicode is a character set. There is currently one encoding defined for it, but there are several transformations of that encoding, which we refer to as UTF-n where n tells us something about the bit width that the transformation is optimized for. In theory, there could be other encodings of Unicode. (No one expects it to actually happen, but they did try to leave a door open, just in case.) > In fact they can handle every human langauge ever used, Again, adding some emphasis here, (almost) every language currently known in our modern society can be handled about as well as or better wth Unicode than we were able to handle English with just ASCII. (I have a personal interest in some of the dark corners, but that's OT.) > plus some esoteric ones postulated, and there is room for future > languages. (Can't you just feel the chills run up and down your spine? This could be a wild ride, boys! heh. A little drama for your lunch hour.) > ... > > UTF8/16/32 is built the same way. However, this only applies per > character, and only works painlessly on UTF32, which has fixed width > characters. Again, a little point of emphasis, UTF32 is fixed width on the code points, but there are still composite characters. (You only thought it was safe to go back to the beach.) > UTF8/16 OTOH, have variable length characters (in multiples > of 8 or 16 >bits). Since SQL_ASCII sorts in a binary fashion, UTF8/16 won't > sort correctly under SQL_ASCII locale, I believe. It might almost sort well enough to cause you real pain later, too. > Tatsuo Ishii wrote: > > >>On Wed, 23 Jun 2004, Dennis Gearon wrote: > >> > >> > >>>This is what has to be eventually done:(as sybase, and probably others do it) > >>> > >>> http://www.ianywhere.com/whitepapers/unicode.html > >> > >>Actually, what probably has to be eventually done is what's in the SQL > >>spec. > >> > >>Which is AFAICS basically: > >> Allow multiple encodings > >> Allow multiple character sets (within an encoding) > > > > > > Could Please explain more details for above. In my understanding a > > character set can have multiple encodings but... Well, UTF-8 was originally, IIRC, intended to be a _Universal_ly applicable transformation. (The scheme would apply as easily to the JIS character sets.) But I don't think that's what they were talking about. I think they are talking about multiple major locales (or character subsets) within Unicode. (I could be wrong, of course.) >>> ... -- Joel <rees@ddcom.co.jp>
pgsql-general by date: