Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd) - Mailing list pgsql-hackers

From dg@illustra.com (David Gould)
Subject Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)
Date
Msg-id 9806050540.AA04709@hawk.illustra.com
Whole thread Raw
In response to Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)  (Peter Mount <peter@taer.maidstone.gov.uk>)
List pgsql-hackers
Someone whos headers I am too lazy to retreive wrote:
> On Thu, 4 Jun 1998, Thomas G. Lockhart wrote:
>
> > Hi. I'm looking for non-English-using Postgres hackers to participate in
> > implementing NCHAR() and alternate character sets in Postgres. I think
...
> Currently, they simply call the Ascii/Binary methods. But they could (when
> NCHAR/NVARCHAR/CHARACTER SET is the columns type) handle the translation
> between the character set and Unicode.
>
> > I would propose to do this for v6.4 as user-defined packages (with
> > compile-time parser support) on top of the existing USE_LOCALE and MB
> > patches so that the existing compile-time options are not changed or
> > damaged.
>
> In a same vein, for getting JDBC up to speed with this, we may need to
> have a function on the backend that will handle the translation between
> the encoding and Unicode. This would allow the JDBC driver to
> automatically handle a new character set without having to write a class
> for each package.

Just an observation or two on the topic of internationalization:

Illustra went to unicode internally. This allowed things like kanji table
names etc. It worked, but it was very costly in terms of work, bugs, and
especially performance although we eventually got most of it back.

Then we created encodings (char set, sort order, error messages etc) for
a bunch of languages. Then we made 8 bit chars convert to unicode and
assumed 7 bit chars were in 7-bit ascii.

This worked and was in some sense "the right thing to do".

But, the european customers hated it. Before, when we were "plain ole
Amuricans, don't hold with this furrin stuff", we ignored 8 vs 7 bit
issues and the europeans were free to stick any characters they wanted
in and get them out unchanged and it was just as fast as anything else.

When we changed to unicode and 7 vs 8 bit sensitivity it forced everyone
to install an encoding and store their data in unicode. Needless to say
customers in eg Germany did not want to double their disk space and give
up performance to do something only a little better than they could do
already.

Ultimately, we backed it out and allowed 8 bit chars again. You could still
get unicode, but except for asian sites it was not widely used, and even in
asia it was not universally popular.

Bottom line, I am not opposed to internationalization. But, it is harder
even than it looks. And some of the "correct" technical solutions turn
out to be pretty annoying in the real world.

So, having it as an add on is fine. Providing support in the core is fine
too. An incremental approach of perhaps adding sort orders for 8 bit char
sets today and something else next release might be ok. But, be very very
careful and do not accept that the "popular" solutions are useable or try
to solve the "whole" problem in one grand effort.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"And there _is_ a real world. In fact, some of you
 are in it right now."         -- Gene Spafford

pgsql-hackers by date:

Previous
From: dg@illustra.com (David Gould)
Date:
Subject: Re: [HACKERS] keeping track of connections
Next
From: Magosanyi Arpad
Date:
Subject: Re: libpgtcl bug (and symptomatic treatment)