Thread: Proposal: Adding JIS X 0213 support

Proposal: Adding JIS X 0213 support

From
Tatsuo Ishii
Date:
Hi,

I would like to propose adding new character set "JIS X
0213"(http://en.wikipedia.org/wiki/JIS_X_0213).

JIS X 0213 is a relatively new Japanese goverment standard (defined in
2000, revised in 2004), and becomes important for Japanese
users. Moreover some commercial OSs including Windows VISTA support
JIS X 0213(some open source OSs support too, of course). So I believe
supporting JIS X 0213 in upcoming 8.3 will be usefull for Japanese
users and will help spreading PostgreSQL more.

Since JIS X 0213 is a character set, we need to add encodings
supporting it. Here are lists of additional encodings (specifications
are already published by the goverment).

1) EUC-JIS-2004 

prposed encoding name: EUC_JIS_2004

including following character sets:

- ASCII
- JIS X 0213 plane 1
- JIS X 0201 "katakana"
- JIS X 0213 plane 2

Note that since encoding schema of EUC_JIS_2004 is exactly identical
to EUC_JP, we can reuse existing encoding routines defined in
utls/mb/*.c.

2) Shift-JIS-2004 

prposed encoding name: SHIFT_JIS_2004

including following character sets(same as EUC-JIS-2004):

- ASCII
- JIS X 0213 plane 1
- JIS X 0201 "katakana"
- JIS X 0213 plane 2

Note that this is client encoding only due to the same reason as SJIS.

Note that encoding schema of SHIFT_JIS_2004 is exactly identical to
SJIS, we can reuse existing encoding routines defined in utils/mb/*.c.

3) UTF-8

Actually already supported by the recent version of PostgreSQL and no
additional work required.

o About encoding conversion

I will add encoding conversios among EUC_JIS_2004, SHIFT_JIS_2004 and
UTF-8.

Including are patches against CVS head which should illustrate what
I'm proposing in detail. If there's no objection, I will commit them
along with documentation changes, regression updates and bump up
catalog version.

After that I will develop conversion part(it will take several days).

comments, suggestions are welcome.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: Proposal: Adding JIS X 0213 support

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
> I would like to propose adding new character set "JIS X
> 0213"(http://en.wikipedia.org/wiki/JIS_X_0213).
> ...
> Note that since encoding schema of EUC_JIS_2004 is exactly identical
> to EUC_JP, we can reuse existing encoding routines defined in
> utls/mb/*.c.

I'm confused.  If this is exactly the same as EUC_JP, why do we need
any new code at all?  Why not just a documentation addition saying
they are the same thing?  Or maybe rename EUC_JP to reflect the new
standard number (we've certainly renamed encodings before).
        regards, tom lane


Re: Proposal: Adding JIS X 0213 support

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@postgresql.org> writes:
> > I would like to propose adding new character set "JIS X
> > 0213"(http://en.wikipedia.org/wiki/JIS_X_0213).
> > ...
> > Note that since encoding schema of EUC_JIS_2004 is exactly identical
> > to EUC_JP, we can reuse existing encoding routines defined in
> > utls/mb/*.c.
> 
> I'm confused.  If this is exactly the same as EUC_JP, why do we need
> any new code at all?  Why not just a documentation addition saying
> they are the same thing?  Or maybe rename EUC_JP to reflect the new
> standard number (we've certainly renamed encodings before).

I said *encoding schema" is same, not the contents (character set) is
same. In another word, characters included in EUC_JP are not same as
EUC_JIS_2004.

Also, EUC_JIS_2004 is *not* the super set of EUC_JP. So we need to let
EUC_JP and EUC_JIS_2004 coexist.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Proposal: Adding JIS X 0213 support

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@postgresql.org> writes:
> > I would like to propose adding new character set "JIS X
> > 0213"(http://en.wikipedia.org/wiki/JIS_X_0213).
> > ...
> > Note that since encoding schema of EUC_JIS_2004 is exactly identical
> > to EUC_JP, we can reuse existing encoding routines defined in
> > utls/mb/*.c.
> 
> I'm confused.  If this is exactly the same as EUC_JP, why do we need
> any new code at all?  Why not just a documentation addition saying
> they are the same thing?  Or maybe rename EUC_JP to reflect the new
> standard number (we've certainly renamed encodings before).

I said *encoding schema" is same, not the contents (character set) is
same. In another word, characters included in EUC_JP are not same as
EUC_JIS_2004.

Also, EUC_JIS_2004 is *not* the super set of EUC_JP. So we need to let
EUC_JP and EUC_JIS_2004 coexist.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Proposal: Adding JIS X 0213 support

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
>> I'm confused.  If this is exactly the same as EUC_JP, why do we need
>> any new code at all?

> I said *encoding schema" is same, not the contents (character set) is
> same. In another word, characters included in EUC_JP are not same as
> EUC_JIS_2004.

I'm still confused.  If the set of characters is different, then surely
we need at least a different UTF8<->EUC_JIS_2004 conversion function?
        regards, tom lane


Re: Proposal: Adding JIS X 0213 support

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@postgresql.org> writes:
> >> I'm confused.  If this is exactly the same as EUC_JP, why do we need
> >> any new code at all?
> 
> > I said *encoding schema" is same, not the contents (character set) is
> > same. In another word, characters included in EUC_JP are not same as
> > EUC_JIS_2004.
> 
> I'm still confused.  If the set of characters is different, then surely
> we need at least a different UTF8<->EUC_JIS_2004 conversion function?

Yes, exactly. I will come up with new conversions later.

> After that I will develop conversion part(it will take several days).
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Proposal: Adding JIS X 0213 support

From
Tatsuo Ishii
Date:
> > Tatsuo Ishii <ishii@postgresql.org> writes:
> > >> I'm confused.  If this is exactly the same as EUC_JP, why do we need
> > >> any new code at all?
> > 
> > > I said *encoding schema" is same, not the contents (character set) is
> > > same. In another word, characters included in EUC_JP are not same as
> > > EUC_JIS_2004.
> > 
> > I'm still confused.  If the set of characters is different, then surely
> > we need at least a different UTF8<->EUC_JIS_2004 conversion function?
> 
> Yes, exactly. I will come up with new conversions later.

I have committed changes to add JIS X 0213 along with conversions.

New encodings:

EUC_JIS_2004:    JIS X 0213 encoded in EUC
SHIFT_JIS_2004:    JIS X 0213 encoded in Shift JIS (client only encoding)

These encodings support following character sets:

ASCII, JIS X 0201 (single byte "katakana"), JIS X 0213 plane 1, 2

New conversions:

EUC_JIS_2004 --> UTF8: euc_jis_2004_to_utf8
UTF8 --> EUC_JIS_2004: utf8_to_euc_jis_2004
SHIFT_JIS_2004 --> UTF8: shift_jis_2004_to_utf8
UTF8 --> SHIFT_JIS_2004: utf8_to_shift_jis_2004
EUC_JIS_2004 --> SHIFT_JIS_2004: euc_jis_2004_to_shift_jis_2004
SHIFT_JIS_2004 --> EUC_JIS_2004: shift_jis_2004_to_euc_jis_2004

To generate conversion maps, I have created two perl scripts
UCS_to_SHIFT_JIS_2004.pl and UCS_to_EUC_JIS_2004.pl, which use
sjis-0213-2004-std.txt and euc-jis-2004-std.txt as the source of
conversion specification. They are freely obtained from
http://x0213.org.

Conversions to UTF-8 from EUC_JIS_2004 and SHIFT_JIS_2004
require supporting UTF-8 "combined characters" i.e. a logical
character consists of two UTF-8 characters. To implement this, I have
modified LocalToUtf() and UtfToLocal() by adding new parameter: 
"combined character map".

docs changes and regression test changes are committed too.

Beware that I have updated catalog versions. Please do initdb.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Proposal: Adding JIS X 0213 support

From
Josh Berkus
Date:
Tatsuo,

Related to this, when are we going to get the Japanese po files in the 
core distribution?

--Josh


Re: Proposal: Adding JIS X 0213 support

From
Tatsuo Ishii
Date:
> Tatsuo,
> 
> Related to this, when are we going to get the Japanese po files in the 
> core distribution?

No idea. In my understanding, current message translating system has
serious problem if wrong locale and encoding is provided(has this
issue been solved in 8.3?). AFAIK Hiroki Kataoka, chairman of JPUG has
same impression. Japanese po files are managed by JPUG and it would be
better to ask him or someone from JPUG who is responsible for Japanese
po files.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Proposal: Adding JIS X 0213 support

From
Tatsuo Ishii
Date:
> Tatsuo,
> 
> Related to this, when are we going to get the Japanese po files in the 
> core distribution?

No idea. In my understanding, current message translating system has
serious problem if wrong locale and encoding is provided(has this
issue been solved in 8.3?). AFAIK Hiroki Kataoka, chairman of JPUG has
same impression. Japanese po files are managed by JPUG and it would be
better to ask him or someone from JPUG who is responsible for Japanese
po files.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Proposal: Adding JIS X 0213 support

From
Tom Lane
Date:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
>> Related to this, when are we going to get the Japanese po files in the 
>> core distribution?

> No idea. In my understanding, current message translating system has
> serious problem if wrong locale and encoding is provided(has this
> issue been solved in 8.3?).

That's certainly true, and it's not solved.  But how does keeping the
Japanese po files out of the distribution improve the matter?
        regards, tom lane


Re: Proposal: Adding JIS X 0213 support

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> >> Related to this, when are we going to get the Japanese po files in the 
> >> core distribution?
> 
> > No idea. In my understanding, current message translating system has
> > serious problem if wrong locale and encoding is provided(has this
> > issue been solved in 8.3?).
> 
> That's certainly true, and it's not solved.  But how does keeping the
> Japanese po files out of the distribution improve the matter?

Keeping out po files until the problem is solved is just my opinion.

If JPUG (or Japanese po files maintainers/volunteers) decide to
include them into PostgreSQL distribution, I have no right to prevent
it.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Proposal: Adding JIS X 0213 support

From
"Hiroshi Saito"
Date:
Hi.

----- Original Message ----- 
From: "Tom Lane" <tgl@sss.pgh.pa.us>


> Tatsuo Ishii <ishii@sraoss.co.jp> writes:
>>> Related to this, when are we going to get the Japanese po files in the 
>>> core distribution?
> 
>> No idea. In my understanding, current message translating system has
>> serious problem if wrong locale and encoding is provided(has this
>> issue been solved in 8.3?).
> 
> That's certainly true, and it's not solved.  But how does keeping the
> Japanese po files out of the distribution improve the matter?

We are doing the support including the trouble. It was thought that the place 
of JPUG was preferable for the reasons why they were problems too peculiar 
to Japan. Then, The system of the support of Honda-san who was the 
representative of the document team had functioned enough up to now.
However, it is not the one to refuse to do the distribution with the main body.
It should discuss it again in the document team for the reasons why the one 
that was the effort to match to the release schedule of the main body becomes 
stronger. 

Anyway, Please wait for the response from Honda-san for a while.

Regards,
Hiroshi Saito



Re: Proposal: Adding JIS X 0213 support

From
Josh Berkus
Date:
Hiroshi,

> We are doing the support including the trouble. It was thought that the
> place of JPUG was preferable for the reasons why they were problems too
> peculiar to Japan.

Well, some of PostgreSQL's commercial distributors have been pretty 
surprised when they package PostgreSQL and find out that the main 
distribution has no Japanese support (I know because I get the confused 
emails).  I've an open offer from the Sun i18N people to help with this, 
if they can coordinate with you.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Proposal: Adding JIS X 0213 support

From
"Hiroshi Saito"
Date:
Hi Josh-san.

From: "Josh Berkus" <josh@agliodbs.com>


> Hiroshi,
> 
>> We are doing the support including the trouble. It was thought that the
>> place of JPUG was preferable for the reasons why they were problems too
>> peculiar to Japan.
> 
> Well, some of PostgreSQL's commercial distributors have been pretty 
> surprised when they package PostgreSQL and find out that the main 
> distribution has no Japanese support (I know because I get the confused 
> emails).  I've an open offer from the Sun i18N people to help with this, 
> if they can coordinate with you.

Ahh yes, Certainly an offer from SUN of Japan.:-)
Then, The support was done as a volunteer with Honda-san. It seemed to be 
wonderful that the spread of PostgreSQL promoted it by Solaris.!
The resource is being offered in the place where JPUG was open to the public.
I think that SUN of Japan obtained knowhow that takes there and evades the 
problem. The support of the resource makes an effort to the utmost though we 
are volunteers. Satisfactory results proves.!
Maybe, the problem is a release speed.... However, It might be the same even 
if it puts it on official's place...

Regards,
Hiroshi Saito








Re: Proposal: Adding JIS X 0213 support

From
Hiroki Kataoka
Date:
Tatsuo Ishii wrote:
>>>> Related to this, when are we going to get the Japanese po files in the 
>>>> core distribution?
>>> No idea. In my understanding, current message translating system has
>>> serious problem if wrong locale and encoding is provided(has this
>>> issue been solved in 8.3?).
>> That's certainly true, and it's not solved.  But how does keeping the
>> Japanese po files out of the distribution improve the matter?
> 
> Keeping out po files until the problem is solved is just my opinion.

Regrettably I am also the same opinion.  It is the cause of an 
unnecessary trouble to include japanese po file without a certain 
betterment.

-- 
Hiroki Kataoka <kataoka@interwiz.jp>