Thread: Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)
On Thu, 4 Jun 1998, Thomas G. Lockhart wrote: > Hi. I'm looking for non-English-using Postgres hackers to participate in > implementing NCHAR() and alternate character sets in Postgres. I think > I've worked out how to do the implementation (not the details, just a > strategy) so that multiple character sets will be allowed in a single > database, additional character sets can be loaded at run-time, and so > that everything will behave transparently. Ok, I'm English, but I'll keep a close eye on this topic as the JDBC driver has two methods that handle Unicode strings. Currently, they simply call the Ascii/Binary methods. But they could (when NCHAR/NVARCHAR/CHARACTER SET is the columns type) handle the translation between the character set and Unicode. > I would propose to do this for v6.4 as user-defined packages (with > compile-time parser support) on top of the existing USE_LOCALE and MB > patches so that the existing compile-time options are not changed or > damaged. In a same vein, for getting JDBC up to speed with this, we may need to have a function on the backend that will handle the translation between the encoding and Unicode. This would allow the JDBC driver to automatically handle a new character set without having to write a class for each package. -- Peter Mount, peter@maidstone.gov.uk Postgres email to peter@taer.maidstone.gov.uk & peter@retep.org.uk Remember, this is my work email, so please CC my home address, as I may not always have time to reply from work.
>In a same vein, for getting JDBC up to speed with this, we may need to >have a function on the backend that will handle the translation between >the encoding and Unicode. This would allow the JDBC driver to >automatically handle a new character set without having to write a class >for each package. I already have a patch to handle the translation on the backend between the encoding and SJIS (yet another encoding for Japanese). Translation for other encodings such as Big5(Chinese) and Unicode are in my plan. The biggest problem for Unicode is that the translation is not symmetrical. An encoding to Unicode is ok. However, Unicode to an encoding is like one-to-many. The reason for that is "Unification." A code point of Unicode might correspond to either Chinese, Japanese or Korean. To determine that, we need additional infomation what language we are using. Too bad. Any idea? --- Tatsuo Ishii t-ishii@sra.co.jp
On Thu, 4 Jun 1998 t-ishii@sra.co.jp wrote: > >In a same vein, for getting JDBC up to speed with this, we may need to > >have a function on the backend that will handle the translation between > >the encoding and Unicode. This would allow the JDBC driver to > >automatically handle a new character set without having to write a class > >for each package. > > I already have a patch to handle the translation on the backend > between the encoding and SJIS (yet another encoding for Japanese). > Translation for other encodings such as Big5(Chinese) and Unicode are > in my plan. > > The biggest problem for Unicode is that the translation is not > symmetrical. An encoding to Unicode is ok. However, Unicode to an > encoding is like one-to-many. The reason for that is "Unification." A > code point of Unicode might correspond to either Chinese, Japanese or > Korean. To determine that, we need additional infomation what language > we are using. Too bad. Any idea? I'm not sure. I brought this up as it's something that I feel should be done somewhere in the backend, rather than in the clients, and should be thought about at this stage. I was thinking on the lines of a function that handled the translation between any two given encodings (ie it's told what the initial and final encodings are), and returns the translated string (be it single or multi-byte). It could then throw an error if the translation between the two encodings is not possible, or (optionally) that part of the translation would fail. Also, having this in the backend would allow all the interfaces access to international encodings without too much work. Adding a new encoding can then be done just on the server (say by adding a module), without having to recompile/link everything else. -- Peter Mount, peter@maidstone.gov.uk Postgres email to peter@taer.maidstone.gov.uk & peter@retep.org.uk Remember, this is my work email, so please CC my home address, as I may not always have time to reply from work.
Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)
From
dg@illustra.com (David Gould)
Date:
Someone whos headers I am too lazy to retreive wrote: > On Thu, 4 Jun 1998, Thomas G. Lockhart wrote: > > > Hi. I'm looking for non-English-using Postgres hackers to participate in > > implementing NCHAR() and alternate character sets in Postgres. I think ... > Currently, they simply call the Ascii/Binary methods. But they could (when > NCHAR/NVARCHAR/CHARACTER SET is the columns type) handle the translation > between the character set and Unicode. > > > I would propose to do this for v6.4 as user-defined packages (with > > compile-time parser support) on top of the existing USE_LOCALE and MB > > patches so that the existing compile-time options are not changed or > > damaged. > > In a same vein, for getting JDBC up to speed with this, we may need to > have a function on the backend that will handle the translation between > the encoding and Unicode. This would allow the JDBC driver to > automatically handle a new character set without having to write a class > for each package. Just an observation or two on the topic of internationalization: Illustra went to unicode internally. This allowed things like kanji table names etc. It worked, but it was very costly in terms of work, bugs, and especially performance although we eventually got most of it back. Then we created encodings (char set, sort order, error messages etc) for a bunch of languages. Then we made 8 bit chars convert to unicode and assumed 7 bit chars were in 7-bit ascii. This worked and was in some sense "the right thing to do". But, the european customers hated it. Before, when we were "plain ole Amuricans, don't hold with this furrin stuff", we ignored 8 vs 7 bit issues and the europeans were free to stick any characters they wanted in and get them out unchanged and it was just as fast as anything else. When we changed to unicode and 7 vs 8 bit sensitivity it forced everyone to install an encoding and store their data in unicode. Needless to say customers in eg Germany did not want to double their disk space and give up performance to do something only a little better than they could do already. Ultimately, we backed it out and allowed 8 bit chars again. You could still get unicode, but except for asian sites it was not widely used, and even in asia it was not universally popular. Bottom line, I am not opposed to internationalization. But, it is harder even than it looks. And some of the "correct" technical solutions turn out to be pretty annoying in the real world. So, having it as an add on is fine. Providing support in the core is fine too. An incremental approach of perhaps adding sort orders for 8 bit char sets today and something else next release might be ok. But, be very very careful and do not accept that the "popular" solutions are useable or try to solve the "whole" problem in one grand effort. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "And there _is_ a real world. In fact, some of you are in it right now." -- Gene Spafford
> The biggest problem for Unicode is that the translation is not > symmetrical. An encoding to Unicode is ok. However, Unicode to an > encoding is like one-to-many. The reason for that is "Unification." A > code point of Unicode might correspond to either Chinese, Japanese or > Korean. To determine that, we need additional infomation what language > we are using. Too bad. Any idea? It seems not that bad for the translation from Unicode to Japanese EUC (or SJIS or Big5). Because Japanese EUC(or SJIS) has only Japanese characters and Big5 has only Chinese characters(regarding to only CJK). Right? It would be virtually one-to-one or one-to-none when translating from unicode to them mono-lingual encodings. It, however, would not be that simple to translate from Unicdoe to another multi-lingual encoding(like iso-2022 based Mule encoding?). Kinoshita
>> The biggest problem for Unicode is that the translation is not >> symmetrical. An encoding to Unicode is ok. However, Unicode to an >> encoding is like one-to-many. The reason for that is "Unification." A >> code point of Unicode might correspond to either Chinese, Japanese or >> Korean. To determine that, we need additional infomation what language >> we are using. Too bad. Any idea? > >It seems not that bad for the translation from Unicode to Japanese EUC >(or SJIS or Big5). >Because Japanese EUC(or SJIS) has only Japanese characters and Big5 has only Chinese characters(regarding to only CJK). >Right? >It would be virtually one-to-one or one-to-none when translating >from unicode to them mono-lingual encodings. Oh, I was wrong. We have already an information about "what language we are using" when try to make a translation between Unicode and Japanese EUC:-) >It, however, would not be that simple to translate from Unicdoe to >another multi-lingual encoding(like iso-2022 based Mule encoding?). Correct. -- Tatsuo Ishii t-ishii@sra.co.jp