Thread: Re: Multibyte support in oracle_compat.c
> I found one bug in file src/backend/utils/adt/oracle_compat.c and there were your name, related with Multibyte enhancement,so i write to you. > Functions upper,lower and initcap doesn't work with utf-8 data which is not of Latin letters.At my work i do databasesfor Russian users and when i tried to use unicode encoding for database and Russsian alphabet than these functionsdidn't work. So i wrote some patches, because i don't think that problem is in that or other shell variable likeLANG or LC_CTYPE. As i don't know any other > languages except Russian and English, i wrote small test(test.tar.gz) only for them.Execute it befor and after patchingand feel the difference:). And by the way, do encodings(and appropriative languages) EUC_JP,EUC_CN,EUC_KR and EUC_TWhave logical operations upper,lower and initcap? > regards,Eugene. For EUC_JP, there is no upper,lower and initcap. I'm not sure about other languages. > P.S.It doesn't seem bad for me to use lib unicode instead of functions like mbtowc,wctomb from stdlib and towupper,towlowerfrom wctype, but may be somebody will find decision based on them or other lib? I'm not sure. What do you think, Peter or other guys who is familiar with Unicode? BTW, I don't like your patches. If there's no unicode.h, configure aborts with: configure: error: header file <unicode.h> is required for unicode support which seems not acceptable to me. I suggest you #ifdef out the unicode upper,lower and initcap support if libunicode and/or unicode.h are not found in the system. -- Tatsuo Ishii (I have included patches for review purpose)
Tatsuo Ishii writes: > > Functions upper,lower and initcap doesn't work with utf-8 data The backend routines use the host OS locales, so look there. On my machine I have several Russian locales, which seem to address the issue of character sets: ru_RU ru_RU.koi8r ru_RU.utf8 ru_UA russian This is bogus, because the LC_CTYPE choice is cluster-wide and the encoding choice is database-specific (in other words: it's broken), but there's nothing we can do about that right now. > > P.S.It doesn't seem bad for me to use lib unicode instead of functions like mbtowc,wctomb from stdlib and towupper,towlowerfrom wctype > > I'm not sure. What do you think, Peter or other guys who is familiar > with Unicode? I don't know that that libunicode is, but that shouldn't prevent us from possibly evaluating it. :-) Btw., I just happened to think about this very issue over the last few days. What I would like to attack for the next release is to implement character classification and conversion using the Unicode tables so we can cut the LC_CTYPE system locale out of the picture. Perhaps this is what the poster was thinking of, too. -- Peter Eisentraut peter_e@gmx.net
On Thu, 5 Sep 2002, Peter Eisentraut wrote: > Date: Thu, 5 Sep 2002 00:46:39 +0200 (CEST) > From: Peter Eisentraut <peter_e@gmx.net> > To: Tatsuo Ishii <t-ishii@sra.co.jp> > Cc: pgsql-hackers@postgresql.org, eutm@yandex.ru > Subject: Re: [HACKERS] Multibyte support in oracle_compat.c > > Tatsuo Ishii writes: > > > > Functions upper,lower and initcap doesn't work with utf-8 data > > The backend routines use the host OS locales, so look there. On my > machine I have several Russian locales, which seem to address the issue of > character sets: > > ru_RU > ru_RU.koi8r > ru_RU.utf8 > ru_UA > russian Yeah, our character sets is a major pain for internatianlization. And the above list is not exhaustive. I guess you are right, for the time being you'll have to bear with it. -s
> The backend routines use the host OS locales, so look there. On my > machine I have several Russian locales, which seem to address the issue of > character sets: > > ru_RU > ru_RU.koi8r > ru_RU.utf8 > ru_UA > russian > > This is bogus, because the LC_CTYPE choice is cluster-wide and the > encoding choice is database-specific (in other words: it's broken), but > there's nothing we can do about that right now. I thought his idea was using UTF-8 locale and Unicode (UTF-8) encoded database. > Btw., I just happened to think about this very issue over the last few > days. What I would like to attack for the next release is to implement > character classification and conversion using the Unicode tables so we can > cut the LC_CTYPE system locale out of the picture. Perhaps this is what > the poster was thinking of, too. Interesting idea. If you are saying that you are going to remove the dependecy on system locale, I will agree with your idea. BTW, nls has same problem as above, no? I guess nls depeneds on locale and it may conflict with the database-specific encoding and/or the automatic FE/BE encoding conversion. -- Tatsuo Ishii
Tatsuo Ishii writes: > BTW, nls has same problem as above, no? I guess nls depeneds on locale > and it may conflict with the database-specific encoding and/or the > automatic FE/BE encoding conversion. GNU gettext does its own encoding conversion. It reads the program's character encoding from the LC_CTYPE locale and converts the material in the translation catalogs on the fly for output. This is great in general, really, but for the postmaster it's a problem. If LC_CTYPE is fixed for the cluster and you later on change your mind about the message language the it will be recoded into the character set that LC_CTYPE says. And if that character set does not match the one that is set as the backend encoding internally then who knows what will happen when this stuff is recoded again as it's sent to the client. Big, big mess. -- Peter Eisentraut peter_e@gmx.net
> GNU gettext does its own encoding conversion. It reads the program's > character encoding from the LC_CTYPE locale and converts the material in > the translation catalogs on the fly for output. This is great in general, > really, but for the postmaster it's a problem. If LC_CTYPE is fixed for > the cluster and you later on change your mind about the message language > the it will be recoded into the character set that LC_CTYPE says. And if > that character set does not match the one that is set as the backend > encoding internally then who knows what will happen when this stuff is > recoded again as it's sent to the client. Big, big mess. Then in another word, it's completely broken. Sigh. -- Tatsuo Ishii