Thread: Thoughts on multiple simultaneous code page support

Thoughts on multiple simultaneous code page support

From

"Randall Parker"

Date:

21 June 2000, 17:23:24

I've seen a number of RDBMSs that require that an entire database (whatever they call a database)
has to be in the same code page. There are disadvantages in this. With those disadvantages in mind
and with the idea that it would be worth examining options other than Unicode I typed up a list of
some of the ways that databases could handle lots of different national languages and code pages.

Suppose you have to develop a database that will store text strings for a large number of languages
cutting across several code pages. Well, you have a few choices:

1) Make the entire database Unicode (providing that your RDBMS supports Unicode). At least this way you know that you
canstore any text that you have. Whether it is Swedish,

Russian, Japanese, German, and English is not a problem. The problem is that all CHAR fields in all tables are now
Unicode.That makes all text storage take

up more room. It also makes sorting and indexing take more time. It also requires translation between
Unicode and the Code Page that any particular client PC is using. Aside: DB2 has a feature whereby the client PC can
declareitself to the RDBMS as being in a

particular code page and then DB2 does all the back-end to client-end code page translation going in
both directions.

2) Allow each table to be in a particular code page. This is not too satisfying. How can you do foreign keys between
tableswhen two tables are in two

different code pages? You'd have to ignore the differences and assume that only the first 127 chars
are used and that they are the same in the code pages of two different code pages.

3) Allow individual columns to have code pages to be declared for them. This is okay. But it has downsides: A) For
everynational language or at least for every code page you want to support you have to

create a new column. Picture a translation table. You might have 30 languages you translate into. So what do you
do?

Make 30 columns? You'd have a column that was a TokenID that describes what phrase or word the
row represents. Then all the translation columns. Having a column per language results in more columns than if you
havea column per code page.

But if you have a column per code page that results in more rows. The reason is that you might put
French, German, English, and several other languages in ISO 8859-1. Well, they each need to go on
a different line. But if only Japanese goes into the SHIFT-JIS (eg CP932 or CP942) column then only
one of those rows has Japanese in it. Do you put the Japanese on the same row as the English
translation? Or do you put it on its own row? You end up with a sparse matrix appearance if you do one column for
eachcode page. But you

end up with a lot more columns if you do one column for each language. Then you run into limits of
how many columns and how many bytes a particular RDBMS can support.
4) Mix code pages in a single column. Different rows may have different code pages. I've done this in DB2. One gives
upthe ability to do indexing of that column. After all, whose code

page collation rules do you use? That limitation is okay if the table is a translation table for token ids that are
usedin the

development of software. Basically, the programmers writing some app use Token ID constants in
their code to specify what text gets displayed. Then the database contains the mappings from those
token ids to the various national languages. In the case where I did this the database is populated
from some outside source of data that comes from translators who don't even have direct access to
the database. The table then is just a storehouse that is indexed on the token id and national
language fields. Of course, one could come up with some scheme whereby the RDBMS would somehow know for
each row and field what its code page is. One would need to have a way to declare a field as having a
variable code page and then to have a way to set a code page value along with the text for a given
field. I'm not arguing for this approach. Just pointing it out for completeness.

Note that in order to support the mixing of code pages in a column you have to have one of two
conditions: A) The column is declared in some code page. But the RDBMS can not enforce the requirement
that all text going into that column be in the set of encodings that are legal in that code page. Some
of the code pages that will be put in rows in that column may use encodings that are not legal in the
declared code page. B) Have a binary character type that is not in any code page. eg see DB2's CLOB, DBCLOB,
VARGRAPHIC and other similar fields. I forget which one of those I used but I have used one of them.

Re: Thoughts on multiple simultaneous code page support

From

Giles Lean

Date:

21 June 2000, 21:15:19

> 1) Make the entire database Unicode
> ...
> It also makes sorting and indexing take more time.

Mentioned in my other email, but what collation order were you
proposing to use?  Binary might be OK for unique keys but that doesn't
help you for '<', '>' etc.

My expectation (not the same as I'd like to see, necessarily, and not
that my opinion counts -- I'm not a developer) would be that each
database have a locale, and that this locale's collation order be used
for indexing, LIKE, '<', '>' etc.  If you want to store data from
multiple human languages using a locale that has Unicode for its
character set would be appropriate/necessary.

Regards,

Giles

Re: Thoughts on multiple simultaneous code page support

From

"Randall Parker"

Date:

21 June 2000, 21:50:17

On Thu, 22 Jun 2000 11:17:14 +1000, Giles Lean wrote:

>
>> 1) Make the entire database Unicode
>> ...
>> It also makes sorting and indexing take more time.
>
>Mentioned in my other email, but what collation order were you
>proposing to use?  Binary might be OK for unique keys but that doesn't
>help you for '<', '>' etc.

To use Unicode on a field that can have indexes defined on it does require one single big 
collation order table that determines the relative order of all the characters in Unicode. Surely 
there must be a standard for this that is part of the Unicode spec? Or part of ISO/IEC 10646 
spec? 

One optimization doable on this would be to allow the user to declare tothe RDBMS what 
subset of Unicode he is going to use. So, for instance, someone who is only handling 
European languages might just say he wants to use 8859-1 thru 8859-9. Or a Japanese 
company might throw in some more code pages but still not bring in code pages for 
languages for which they do not create manuals.

That would make the collation table _much_ smaller.

I don't know anything about the collation order of Asian character sets. My guess though is 
that each in toto is either greater or lesser than the various Euro pages. Though the non-
shifted part of Shift-JIS would be equal to its ASCII equivalents.

>My expectation (not the same as I'd like to see, necessarily, and not
>that my opinion counts -- I'm not a developer) would be that each
>database have a locale, and that this locale's collation order be used
>for indexing, LIKE, '<', '>' etc.  

Characters like '<' and '>' already have standard collation orders vis a vis the other parts of 
ASCII. I doubt these things vary by locale. But maybe I'm wrong. 

>If you want to store data from
>multiple human languages using a locale that has Unicode for its
>character set would be appropriate/necessary.

So you are saying that the same characters can have a different collation order when they 
appear in different locales even if they have the same encoding in all of them?

If so, then Unicode is really not a locale. Its an encoding but it is not a locale. 

>Regards,
>
>Giles
>

Re: Thoughts on multiple simultaneous code page support

From

Tatsuo Ishii

Date:

21 June 2000, 23:46:58

Guys, can I ask a question?

What is "code page"? Is it sort of M$'s terminology?
--
Tatsuo Ishii

Re: Thoughts on multiple simultaneous code page support

From

"Randall Parker"

Date:

22 June 2000, 01:30:56

Code Page is a terminology that predates MS. I certainly first learned of it from IBM documents. 

Shift-JIS is code page 932 or 942 (942 cotains about a half dozen more characters than 932). 

The original US IBM PC used Code Page 437. In Europe it used Code Page 850 which is a Latin 1 Code Page. MS invented
CodePage 1252 which was their Latin 1 code page. ISO 8859-1 is just another 

Latin 1 Code Page looked at from that perspective. 

On Thu, 22 Jun 2000 12:51:02 +0900, Tatsuo Ishii wrote:

>Guys, can I ask a question?
>
>What is "code page"? Is it sort of M$'s terminology?
>--
>Tatsuo Ishii

Re: Thoughts on multiple simultaneous code page support

From

Tom Lane

Date:

22 June 2000, 03:00:20

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Guys, can I ask a question?
> What is "code page"? Is it sort of M$'s terminology?

Randall already pointed out that the term predates M$, but he didn't
quite answer the first question.  "Code page" means "character set",
no more nor less AFAICT.  For example ISO 8859-1 would be a specific
code page.  The term seems to imply a set of no more than about 256
symbols --- I doubt anyone ever called Unicode a code page, for
instance ...
        regards, tom lane

Re: Thoughts on multiple simultaneous code page support

From

Tatsuo Ishii

Date:

22 June 2000, 11:10:46

> Randall already pointed out that the term predates M$, but he didn't
> quite answer the first question.  "Code page" means "character set",
> no more nor less AFAICT.

Oh, I see.

Randall, why do we need to use the term then? Shouldn't we use more
standard term "character set" or "charset" instead?
--
Tatsuo Ishii

Re: Thoughts on multiple simultaneous code page support

From

"Randall Parker"

Date:

22 June 2000, 14:12:03

Well, to me at least the term character set does not define a mapping or encoding. It just specifies a list of
charactersand their numeric representations or mappings not included. 

To say "character set mapping" or "character set encoding" might be more complete. Though I tend to use the term "code
page"because that's what I've heard the most down thru the years.

If someone here wants to suggest a particular terminology to use I'd be happy to adopt it in this list. 

On Fri, 23 Jun 2000 00:14:43 +0900, Tatsuo Ishii wrote:

>Oh, I see.
>
>Randall, why do we need to use the term then? Shouldn't we use more
>standard term "character set" or "charset" instead?

Re: Thoughts on multiple simultaneous code page support

From

The Hermit Hacker

Date:

22 June 2000, 14:58:14

On Thu, 22 Jun 2000, Randall Parker wrote:

> Well, to me at least the term character set does not define a mapping or encoding. It just specifies a list of
charactersand their numeric representations or mappings not included. 
 
> 
> To say "character set mapping" or "character set encoding" might be more complete. Though I tend to use the term
"codepage" because that's what I've heard the most down thru the years.
 
> 
> If someone here wants to suggest a particular terminology to use I'd
> be happy to adopt it in this list.

codepages are used by the samba folks also, if this helps any ... I never
knew what it meant before, but now that I do, makes perfect sense ...

its like using tuples/fields vs rows/columns :)  one is right, the other
is lazy :)

Re: Thoughts on multiple simultaneous code page support

From

Tatsuo Ishii

Date:

22 June 2000, 18:30:39

> > Well, to me at least the term character set does not define a mapping or encoding. It just specifies a list of
charactersand their numeric representations or mappings not included. 
 
> > 
> > To say "character set mapping" or "character set encoding" might be more complete. Though I tend to use the term
"codepage" because that's what I've heard the most down thru the years.
 

I think the problem with "code page" is it only mentions about
character sets recognized by M$. For example, one of a KANJI character
sets called "JIS X 0212" is in the standard ISO 2022, but not in "code
page."

> > If someone here wants to suggest a particular terminology to use I'd
> > be happy to adopt it in this list.

The term "character set" defined in ISO 2022 definitely does not
define a mapping or encoding as Randall said. But in SQL9x, it
includes "a list of characters" (called "repertory") and an encoding
(called "form of use"). I guess we could agree that we discuss how to
implement SQL9x in this list. If so, it would be more natural to use
the term "character set" as defined in SQL9x, rather than "code page",
no?

> codepages are used by the samba folks also, if this helps any ... I never
> knew what it meant before, but now that I do, makes perfect sense ...

That's because samba only handles character sets defined by M$.
--
Tatsuo Ishii