Re: unicode and sorting(at least) - Mailing list pgsql-general
From | Joel |
---|---|
Subject | Re: unicode and sorting(at least) |
Date | |
Msg-id | 20040625112035.A4E0.REES@ddcom.co.jp Whole thread Raw |
In response to | Re: unicode and sorting(at least) (Tatsuo Ishii <t-ishii@sra.co.jp>) |
List | pgsql-general |
On Fri, 25 Jun 2004 10:19:05 +0900 (JST) Tatsuo Ishii <t-ishii@sra.co.jp> wrote > > All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other. Sometimesthey can handle several langauges besides English, and are actually designed to handle a family of langauges. > > ISO 8xxx series are not encodings but character sets. For example, > ISO-8859-1 can be expressed in 8-bit encoding form, it also can be > expressed in 7-bit encoding form. This is called ISO-2022. I know that > PostgreSQL treats ISO-8859-1 as an encoding but it's just a short hand > for "8-bit encoded ISO-8859-1". > > Also, let's not mix together "languages" and "character > sets". Langugaes are defined by human, not by computers. While > character sets are perfectly definable by computers. More important > thing is that a language can be expressed in several character > sets. For example language Japanese can be expressed in EUC-JP of > cousrse. It also can be expressed in ASCII by using ROMAJI script. (Which isn't to say that everyone will find romanized Japanese easy to read for meaning.) But we should point out that there are several variations on the romanization of Japanese (some of which are anything but regular). > What I want to say here is talking about "languages" is almost > useless and we have to talk about character sets and encodings. > > > The ONLY encodings that can handle a significant amount of multiple langauges and character sets are the ISO/UTF/UCSseries. (UCS is giving way to UTF). In fact they can handle every human langauge ever used, plus some esotericones postulated, and there is room for future languages. > > > > So, for a column to handle multiple langauges/character sets, the languages/character sets have to be in the family thatthe database's encoding was defined for(in postgres currently, choosing encoding down to the column level is availableon several databases and is the SQL spec), OR, the encoding for the database has to be UTF8 (since we don't haveUTF16 or UTF32 available) > > > > Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of anykind of encodings. It really does not do much good to have a different locale than the encoding, except for UTF8, whichas an encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running instanceof Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what language/characterset encoding is chosen for the database. > > The sorting order problem is not neccessary limited to "clutser > vs. locale" one. My example about ROMAJI above raises another question > "How to sort ROMAJI Japanese?" If we regard it just ASCII strings, we > could sort it in alphabetical order. But if we regard it as Japanaese, > probably sorting in alphabetical order is not appropreate. I think we should say that, while there are some contexts in which ordinary alphabetic order would be okay, there are some, for instance, in which we'd want to mirror the kana order as much as possible. (Not exactly a straightforward map-this-code-point-to-this-collation-value exercise, but should be doable.) > This > example shows that the sorting order should be defined by users or > applications, not by systems or DBMSs. This is why the SQL standard > has "COLLATION" concept IMO. > ... -- Joel <rees@ddcom.co.jp>
pgsql-general by date: