Re: [PATCHES] Postgres-6.3.2 locale patch - Mailing list pgsql-hackers

From t-ishii@sra.co.jp
Subject Re: [PATCHES] Postgres-6.3.2 locale patch
Date
Msg-id 199806040523.OAA06173@srapc451.sra.co.jp
Whole thread Raw
In response to Re: [PATCHES] Postgres-6.3.2 locale patch  ("Thomas G. Lockhart" <lockhart@alumni.caltech.edu>)
Responses Re: [PATCHES] Postgres-6.3.2 locale patch
List pgsql-hackers
>Hi. I'm looking for non-English-using Postgres hackers to participate in
>implementing NCHAR() and alternate character sets in Postgres. I think
>I've worked out how to do the implementation (not the details, just a
>strategy) so that multiple character sets will be allowed in a single
>database, additional character sets can be loaded at run-time, and so
>that everything will behave transparently.

Sounds interesting idea... But before going into discussion, Let me
make clarify what "character sets" means. A character sets consists of
some characters. One of the most famous character set is ISO646
(almost same as ASCII). In western Europe, ISO 8859 series character
sets are widely used. For example, ISO 8859-1 includes English,
French, German etc. and ISO 8859-2 includes Albanian, Romanian
etc. These are "single byte" and there is one to many correspondacne
between the character set and Languages.

Example1:
ISO 8859-1 <------> English, French, German

On the other hand, some asian languages such as Japanese, Chinese, and
Korean do not correspond to a chacter set, rather correspond to
multiple character sets.

Example2:
ASCII, JIS X0208, JIS X0201, JIS X0212 <-------> Japanese
(ASCII, JIS X0208, JIS X0201, JIS X0212 are individual character sets)

An "encoding" is a way to represent set of charactser sets in
computers. The above set of characters sets are encoded in the EUC_JP
encdoing.

I think SQL92 uses a term "character set" as encoding.

>So, the initial questions:
>
>1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for
>non-English applications? Do other databases use this SQL92 convention,
>or does it have difficulties?

As far as I know, there is no commercial RDBMS that supports
NCHAR/NVARCHAR/CHARACTER SET syntax. Oracle supports multiple
encodings. An encoding for a database is defined while creating the
database and cannot be changed at runtime. Clients can use different
encoding as long as it is a "subset" of the database's encoding. For
example, a oracle client can use ASCII if the database encoding is
EUC_JP.

I think the idea that the "default" encoding for a database being
defined at the database creation time is nice.

create database with encoding EUC_JP;

If NCHAR/NVARCHAR/CHARACTER SET syntax would be supported, a user
could use a different encoding other than EUC_JP. Sound very nice too.

>2) Would anyone be interested in helping to define the character sets
>and helping to test? I don't know the correct collation sequences and
>don't think they would display properly on my screen...

I would be able to help you in the Japanese part. For Chinese and
Korean, I'm going to find volunteers in the local PostgreSQL mailing
list I'm running if necessary.

>3) I'd like to implement the existing Cyrillic and EUC-jp character
>sets, and also some European languages (French and ??) which use the
>Latin-1 alphabet but might have different collation sequences. Any
>suggestions for candidates??

Collation sequences for EUC_JP? How nice it would be! One of a problem
for collation sequences for multi-byte encodings is the sequence might
become huge. Seems you have a solution for that. Please let me know
more details.
--
Tatsuo Ishii
t-ishii@sra.co.jp

pgsql-hackers by date:

Previous
From: Kevin Heflin
Date:
Subject: Re: [GENERAL] Re: [HACKERS] NEW POSTGRESQL LOGOS
Next
From: Hannu Krosing
Date:
Subject: Re: [HACKERS] NEW POSTGRESQL LOGOS