Re: unicode and sorting(at least) - Mailing list pgsql-general

From Joel
Subject Re: unicode and sorting(at least)
Date
Msg-id 20040625112035.A4E0.REES@ddcom.co.jp
Whole thread Raw
In response to Re: unicode and sorting(at least)  (Tatsuo Ishii <t-ishii@sra.co.jp>)
List pgsql-general
On Fri, 25 Jun 2004 10:19:05 +0900 (JST)
Tatsuo Ishii <t-ishii@sra.co.jp> wrote

> > All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other.
Sometimesthey can handle several langauges besides English, and are actually designed to handle a family of langauges.  
>
> ISO 8xxx series are not encodings but character sets. For example,
> ISO-8859-1 can be expressed in 8-bit encoding form, it also can be
> expressed in 7-bit encoding form. This is called ISO-2022. I know that
> PostgreSQL treats ISO-8859-1 as an encoding but it's just a short hand
> for "8-bit encoded ISO-8859-1".
>
> Also, let's not mix together "languages" and "character
> sets". Langugaes are defined by human, not by computers. While
> character sets are perfectly definable by computers. More important
> thing is that a language can be expressed in several character
> sets. For example language Japanese can be expressed in EUC-JP of
> cousrse. It also can be expressed in ASCII by using ROMAJI script.

(Which isn't to say that everyone will find romanized Japanese easy to
read for meaning.)

But we should point out that there are several variations on the
romanization of Japanese (some of which are anything but regular).

> What I want to say here is talking about "languages" is almost
> useless and we have to talk about character sets and encodings.
>
> > The ONLY encodings that can handle a significant amount of multiple langauges and character sets are the
ISO/UTF/UCSseries. (UCS is giving way to UTF). In fact they can handle every human langauge ever used, plus some
esotericones postulated, and there is room for future languages. 
> >
> > So, for a column to handle multiple langauges/character sets, the languages/character sets have to be in the family
thatthe database's encoding was defined for(in postgres currently, choosing encoding down to the column level is
availableon several databases and is the SQL spec), OR, the encoding for the database has to be UTF8 (since we don't
haveUTF16 or UTF32 available) 
> >
> > Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of
anykind of encodings. It really does not do much good to have a different locale than the encoding, except for UTF8,
whichas an encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running
instanceof Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what
language/characterset encoding is chosen for the database.  
>
> The sorting order problem is not neccessary limited to "clutser
> vs. locale" one. My example about ROMAJI above raises another question
> "How to sort ROMAJI Japanese?" If we regard it just ASCII strings, we
> could sort it in alphabetical order. But if we regard it as Japanaese,
> probably sorting in alphabetical order is not appropreate.

I think we should say that, while there are some contexts in which
ordinary alphabetic order would be okay, there are some, for instance,
in which we'd want to mirror the kana order as much as possible. (Not
exactly a straightforward map-this-code-point-to-this-collation-value
exercise, but should be doable.)

> This
> example shows that the sorting order should be defined by users or
> applications, not by systems or DBMSs. This is why the SQL standard
> has "COLLATION" concept IMO.
> ...


--
Joel <rees@ddcom.co.jp>


pgsql-general by date:

Previous
From: Joel Matthew
Date:
Subject: Re: unicode and sorting(at least)
Next
From: Tom Lane
Date:
Subject: Re: Renaming a schema