Thread: Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

From
Peter Mount
Date:
On Thu, 4 Jun 1998, Thomas G. Lockhart wrote:

> Hi. I'm looking for non-English-using Postgres hackers to participate in
> implementing NCHAR() and alternate character sets in Postgres. I think
> I've worked out how to do the implementation (not the details, just a
> strategy) so that multiple character sets will be allowed in a single
> database, additional character sets can be loaded at run-time, and so
> that everything will behave transparently.

Ok, I'm English, but I'll keep a close eye on this topic as the JDBC
driver has two methods that handle Unicode strings.

Currently, they simply call the Ascii/Binary methods. But they could (when
NCHAR/NVARCHAR/CHARACTER SET is the columns type) handle the translation
between the character set and Unicode.

> I would propose to do this for v6.4 as user-defined packages (with
> compile-time parser support) on top of the existing USE_LOCALE and MB
> patches so that the existing compile-time options are not changed or
> damaged.

In a same vein, for getting JDBC up to speed with this, we may need to
have a function on the backend that will handle the translation between
the encoding and Unicode. This would allow the JDBC driver to
automatically handle a new character set without having to write a class
for each package.

--
Peter Mount, peter@maidstone.gov.uk
Postgres email to peter@taer.maidstone.gov.uk & peter@retep.org.uk
Remember, this is my work email, so please CC my home address, as I may
not always have time to reply from work.



Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

From
t-ishii@sra.co.jp
Date:
>In a same vein, for getting JDBC up to speed with this, we may need to
>have a function on the backend that will handle the translation between
>the encoding and Unicode. This would allow the JDBC driver to
>automatically handle a new character set without having to write a class
>for each package.

I already have a patch to handle the translation on the backend
between the encoding and SJIS (yet another encoding for Japanese).
Translation for other encodings such as Big5(Chinese) and Unicode are
in my plan.

The biggest problem for Unicode is that the translation is not
symmetrical. An encoding to Unicode is ok. However, Unicode to an
encoding is like one-to-many. The reason for that is "Unification." A
code point of Unicode might correspond to either Chinese, Japanese or
Korean. To determine that, we need additional infomation what language
we are using. Too bad. Any idea?
---
Tatsuo Ishii
t-ishii@sra.co.jp

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

From
Peter Mount
Date:
On Thu, 4 Jun 1998 t-ishii@sra.co.jp wrote:

> >In a same vein, for getting JDBC up to speed with this, we may need to
> >have a function on the backend that will handle the translation between
> >the encoding and Unicode. This would allow the JDBC driver to
> >automatically handle a new character set without having to write a class
> >for each package.
>
> I already have a patch to handle the translation on the backend
> between the encoding and SJIS (yet another encoding for Japanese).
> Translation for other encodings such as Big5(Chinese) and Unicode are
> in my plan.
>
> The biggest problem for Unicode is that the translation is not
> symmetrical. An encoding to Unicode is ok. However, Unicode to an
> encoding is like one-to-many. The reason for that is "Unification." A
> code point of Unicode might correspond to either Chinese, Japanese or
> Korean. To determine that, we need additional infomation what language
> we are using. Too bad. Any idea?

I'm not sure. I brought this up as it's something that I feel should be
done somewhere in the backend, rather than in the clients, and should be
thought about at this stage.

I was thinking on the lines of a function that handled the translation
between any two given encodings (ie it's told what the initial and final
encodings are), and returns the translated string (be it single or
multi-byte). It could then throw an error if the translation between the
two encodings is not possible, or (optionally) that part of the
translation would fail.

Also, having this in the backend would allow all the interfaces access to
international encodings without too much work. Adding a new encoding can
then be done just on the server (say by adding a module), without having
to recompile/link everything else.

--
Peter Mount, peter@maidstone.gov.uk
Postgres email to peter@taer.maidstone.gov.uk & peter@retep.org.uk
Remember, this is my work email, so please CC my home address, as I may
not always have time to reply from work.



Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

From
dg@illustra.com (David Gould)
Date:
Someone whos headers I am too lazy to retreive wrote:
> On Thu, 4 Jun 1998, Thomas G. Lockhart wrote:
>
> > Hi. I'm looking for non-English-using Postgres hackers to participate in
> > implementing NCHAR() and alternate character sets in Postgres. I think
...
> Currently, they simply call the Ascii/Binary methods. But they could (when
> NCHAR/NVARCHAR/CHARACTER SET is the columns type) handle the translation
> between the character set and Unicode.
>
> > I would propose to do this for v6.4 as user-defined packages (with
> > compile-time parser support) on top of the existing USE_LOCALE and MB
> > patches so that the existing compile-time options are not changed or
> > damaged.
>
> In a same vein, for getting JDBC up to speed with this, we may need to
> have a function on the backend that will handle the translation between
> the encoding and Unicode. This would allow the JDBC driver to
> automatically handle a new character set without having to write a class
> for each package.

Just an observation or two on the topic of internationalization:

Illustra went to unicode internally. This allowed things like kanji table
names etc. It worked, but it was very costly in terms of work, bugs, and
especially performance although we eventually got most of it back.

Then we created encodings (char set, sort order, error messages etc) for
a bunch of languages. Then we made 8 bit chars convert to unicode and
assumed 7 bit chars were in 7-bit ascii.

This worked and was in some sense "the right thing to do".

But, the european customers hated it. Before, when we were "plain ole
Amuricans, don't hold with this furrin stuff", we ignored 8 vs 7 bit
issues and the europeans were free to stick any characters they wanted
in and get them out unchanged and it was just as fast as anything else.

When we changed to unicode and 7 vs 8 bit sensitivity it forced everyone
to install an encoding and store their data in unicode. Needless to say
customers in eg Germany did not want to double their disk space and give
up performance to do something only a little better than they could do
already.

Ultimately, we backed it out and allowed 8 bit chars again. You could still
get unicode, but except for asian sites it was not widely used, and even in
asia it was not universally popular.

Bottom line, I am not opposed to internationalization. But, it is harder
even than it looks. And some of the "correct" technical solutions turn
out to be pretty annoying in the real world.

So, having it as an add on is fine. Providing support in the core is fine
too. An incremental approach of perhaps adding sort orders for 8 bit char
sets today and something else next release might be ok. But, be very very
careful and do not accept that the "popular" solutions are useable or try
to solve the "whole" problem in one grand effort.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"And there _is_ a real world. In fact, some of you
 are in it right now."         -- Gene Spafford

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

From
Satoshi Kinoshita
Date:
> The biggest problem for Unicode is that the translation is not
> symmetrical. An encoding to Unicode is ok. However, Unicode to an
> encoding is like one-to-many. The reason for that is "Unification." A
> code point of Unicode might correspond to either Chinese, Japanese or
> Korean. To determine that, we need additional infomation what language
> we are using. Too bad. Any idea?

It seems not that bad for the translation from Unicode to Japanese EUC
(or SJIS or Big5).
Because Japanese EUC(or SJIS) has only Japanese characters and Big5 has only Chinese characters(regarding to only CJK).
Right?
It would be virtually one-to-one or one-to-none when translating
from unicode to them mono-lingual encodings.
It, however, would not be that simple to translate from Unicdoe to
another multi-lingual encoding(like iso-2022 based Mule encoding?).

Kinoshita

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

From
t-ishii@sra.co.jp
Date:
>> The biggest problem for Unicode is that the translation is not
>> symmetrical. An encoding to Unicode is ok. However, Unicode to an
>> encoding is like one-to-many. The reason for that is "Unification." A
>> code point of Unicode might correspond to either Chinese, Japanese or
>> Korean. To determine that, we need additional infomation what language
>> we are using. Too bad. Any idea?
>
>It seems not that bad for the translation from Unicode to Japanese EUC
>(or SJIS or Big5).
>Because Japanese EUC(or SJIS) has only Japanese characters and Big5 has only Chinese characters(regarding to only
CJK).
>Right?
>It would be virtually one-to-one or one-to-none when translating
>from unicode to them mono-lingual encodings.

Oh, I was wrong. We have already an information about "what language
we are using" when try to make a translation between Unicode and
Japanese EUC:-)

>It, however, would not be that simple to translate from Unicdoe to
>another multi-lingual encoding(like iso-2022 based Mule encoding?).

Correct.
--
Tatsuo Ishii
t-ishii@sra.co.jp