Thread: Re: Multibyte support in oracle_compat.c

Re: Multibyte support in oracle_compat.c

From

Tatsuo Ishii

Date:

04 September 2002, 08:02:08

>     I found one bug in file src/backend/utils/adt/oracle_compat.c and there were your name, related with Multibyte
enhancement,so i write to you.
 
>     Functions upper,lower and initcap doesn't work with utf-8 data which is not of Latin letters.At my work i do
databasesfor Russian users and when i tried to use unicode encoding for database and Russsian alphabet than these
functionsdidn't work. So i wrote some patches, because i don't think that problem is in that or other shell variable
likeLANG or LC_CTYPE. As i don't know any other 
 
> languages except Russian and English, i wrote small test(test.tar.gz) only for them.Execute it befor and after
patchingand feel the difference:). And by the way, do encodings(and appropriative languages) EUC_JP,EUC_CN,EUC_KR and
EUC_TWhave logical operations upper,lower and initcap? 
 
>                         regards,Eugene.

For EUC_JP, there is no upper,lower and initcap. I'm not sure about
other languages.

>     P.S.It doesn't seem bad for me to use lib unicode instead of functions like mbtowc,wctomb from stdlib and
towupper,towlowerfrom wctype, but may be somebody will find decision based on them or other lib?
 

I'm not sure. What do you think, Peter or other guys who is familiar
with Unicode?

BTW, I don't like your patches. If there's no unicode.h, configure
aborts with:

configure: error: header file <unicode.h> is required for unicode support

which seems not acceptable to me. I suggest you #ifdef out the unicode
upper,lower and initcap support if libunicode and/or unicode.h are not
found in the system.
--
Tatsuo Ishii

(I have included patches for review purpose)

Re: Multibyte support in oracle_compat.c

From

Peter Eisentraut

Date:

04 September 2002, 18:43:50

Tatsuo Ishii writes:

> >     Functions upper,lower and initcap doesn't work with utf-8 data

The backend routines use the host OS locales, so look there.  On my
machine I have several Russian locales, which seem to address the issue of
character sets:

ru_RU
ru_RU.koi8r
ru_RU.utf8
ru_UA
russian

This is bogus, because the LC_CTYPE choice is cluster-wide and the
encoding choice is database-specific (in other words: it's broken), but
there's nothing we can do about that right now.

> >     P.S.It doesn't seem bad for me to use lib unicode instead of functions like mbtowc,wctomb from stdlib and
towupper,towlowerfrom wctype

>
> I'm not sure. What do you think, Peter or other guys who is familiar
> with Unicode?

I don't know that that libunicode is, but that shouldn't prevent us from
possibly evaluating it. :-)

Btw., I just happened to think about this very issue over the last few
days.  What I would like to attack for the next release is to implement
character classification and conversion using the Unicode tables so we can
cut the LC_CTYPE system locale out of the picture.  Perhaps this is what
the poster was thinking of, too.

-- 
Peter Eisentraut   peter_e@gmx.net

Re: Multibyte support in oracle_compat.c

From

"Serguei A. Mokhov"

Date:

04 September 2002, 18:55:01

On Thu, 5 Sep 2002, Peter Eisentraut wrote:

> Date: Thu, 5 Sep 2002 00:46:39 +0200 (CEST)
> From: Peter Eisentraut <peter_e@gmx.net>
> To: Tatsuo Ishii <t-ishii@sra.co.jp>
> Cc: pgsql-hackers@postgresql.org, eutm@yandex.ru
> Subject: Re: [HACKERS] Multibyte support in oracle_compat.c
>
> Tatsuo Ishii writes:
>
> > >     Functions upper,lower and initcap doesn't work with utf-8 data
>
> The backend routines use the host OS locales, so look there.  On my
> machine I have several Russian locales, which seem to address the issue of
> character sets:
>
> ru_RU
> ru_RU.koi8r
> ru_RU.utf8
> ru_UA
> russian

Yeah, our character sets is a major pain for internatianlization. And the
above list is not exhaustive. I guess you are right, for the time being
you'll have to bear with it.

-s

Re: Multibyte support in oracle_compat.c

From

Tatsuo Ishii

Date:

04 September 2002, 21:09:50

> The backend routines use the host OS locales, so look there.  On my
> machine I have several Russian locales, which seem to address the issue of
> character sets:
> 
> ru_RU
> ru_RU.koi8r
> ru_RU.utf8
> ru_UA
> russian
> 
> This is bogus, because the LC_CTYPE choice is cluster-wide and the
> encoding choice is database-specific (in other words: it's broken), but
> there's nothing we can do about that right now.

I thought his idea was using UTF-8 locale and Unicode (UTF-8) encoded
database.

> Btw., I just happened to think about this very issue over the last few
> days.  What I would like to attack for the next release is to implement
> character classification and conversion using the Unicode tables so we can
> cut the LC_CTYPE system locale out of the picture.  Perhaps this is what
> the poster was thinking of, too.

Interesting idea. If you are saying that you are going to remove the
dependecy on system locale, I will agree with your idea.

BTW, nls has same problem as above, no? I guess nls depeneds on locale
and it may conflict with the database-specific encoding and/or the
automatic FE/BE encoding conversion.
--
Tatsuo Ishii

Re: Multibyte support in oracle_compat.c

From

Peter Eisentraut

Date:

05 September 2002, 17:33:48

Tatsuo Ishii writes:

> BTW, nls has same problem as above, no? I guess nls depeneds on locale
> and it may conflict with the database-specific encoding and/or the
> automatic FE/BE encoding conversion.

GNU gettext does its own encoding conversion.  It reads the program's
character encoding from the LC_CTYPE locale and converts the material in
the translation catalogs on the fly for output.  This is great in general,
really, but for the postmaster it's a problem.  If LC_CTYPE is fixed for
the cluster and you later on change your mind about the message language
the it will be recoded into the character set that LC_CTYPE says.  And if
that character set does not match the one that is set as the backend
encoding internally then who knows what will happen when this stuff is
recoded again as it's sent to the client.  Big, big mess.

-- 
Peter Eisentraut   peter_e@gmx.net

Re: Multibyte support in oracle_compat.c

From

Tatsuo Ishii

Date:

05 September 2002, 21:22:18

> GNU gettext does its own encoding conversion.  It reads the program's
> character encoding from the LC_CTYPE locale and converts the material in
> the translation catalogs on the fly for output.  This is great in general,
> really, but for the postmaster it's a problem.  If LC_CTYPE is fixed for
> the cluster and you later on change your mind about the message language
> the it will be recoded into the character set that LC_CTYPE says.  And if
> that character set does not match the one that is set as the backend
> encoding internally then who knows what will happen when this stuff is
> recoded again as it's sent to the client.  Big, big mess.

Then in another word, it's completely broken. Sigh.
--
Tatsuo Ishii