Re: Thoughts on multiple simultaneous code page support - Mailing list pgsql-hackers

From Randall Parker
Subject Re: Thoughts on multiple simultaneous code page support
Date
Msg-id 01501518836812@mail.nls.net
Whole thread Raw
In response to Thoughts on multiple simultaneous code page support  ("Randall Parker" <randall@nls.net>)
Responses Re: Thoughts on multiple simultaneous code page support
List pgsql-hackers
On Thu, 22 Jun 2000 11:17:14 +1000, Giles Lean wrote:

>
>> 1) Make the entire database Unicode
>> ...
>> It also makes sorting and indexing take more time.
>
>Mentioned in my other email, but what collation order were you
>proposing to use?  Binary might be OK for unique keys but that doesn't
>help you for '<', '>' etc.

To use Unicode on a field that can have indexes defined on it does require one single big 
collation order table that determines the relative order of all the characters in Unicode. Surely 
there must be a standard for this that is part of the Unicode spec? Or part of ISO/IEC 10646 
spec? 

One optimization doable on this would be to allow the user to declare tothe RDBMS what 
subset of Unicode he is going to use. So, for instance, someone who is only handling 
European languages might just say he wants to use 8859-1 thru 8859-9. Or a Japanese 
company might throw in some more code pages but still not bring in code pages for 
languages for which they do not create manuals.

That would make the collation table _much_ smaller.

I don't know anything about the collation order of Asian character sets. My guess though is 
that each in toto is either greater or lesser than the various Euro pages. Though the non-
shifted part of Shift-JIS would be equal to its ASCII equivalents.

>My expectation (not the same as I'd like to see, necessarily, and not
>that my opinion counts -- I'm not a developer) would be that each
>database have a locale, and that this locale's collation order be used
>for indexing, LIKE, '<', '>' etc.  

Characters like '<' and '>' already have standard collation orders vis a vis the other parts of 
ASCII. I doubt these things vary by locale. But maybe I'm wrong. 

>If you want to store data from
>multiple human languages using a locale that has Unicode for its
>character set would be appropriate/necessary.

So you are saying that the same characters can have a different collation order when they 
appear in different locales even if they have the same encoding in all of them?

If so, then Unicode is really not a locale. Its an encoding but it is not a locale. 


>Regards,
>
>Giles
>





pgsql-hackers by date:

Previous
From: "Randall Parker"
Date:
Subject: Re: An idea on faster CHAR field indexing
Next
From: Bruce Momjian
Date:
Subject: Re: Big 7.1 open items