Thread: Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB

Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB

From
Tatsuo Ishii
Date:
[Cc:ed to hackers]

(trying select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');)

> Ok, this is working now (I cann't reproduce why not at the first time).

Good.

> Is it planned to implement it so that I can write lower()/ upper() for multibyte
> according to SQL standard (without convert)?

SQL standard? The SQL standard says nothing about locale. So making
lower() (and others) "locale aware" is far different from the SQL
standard of point of view. Of course this does not mean "locale
support" is should not be a part of PostgreSQL's implementation of
SQL. However, we should be aware the limitation of "locale support"
(as well as multibyte support). They are just the stopgap util CREATE
CHARACTER SET etc. is implemnted IMO.

> I could do it if you tell me where the final tolower()/toupper() happens.
> (but not before middle of June).

For the short term solution making convert() hiding from users might
be a good idea (what I mean here is kind of auto execution of
convert()). The hardest part is there's no idea how we could find a
relationship bewteen particular locale and the encoding. For example,
you know that for de_DE locale using LATIN1 encoding is appropreate,
but PostgreSQL does not.
--
Tatsuo Ishii


Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB

From
"Enke, Michael"
Date:
Tatsuo Ishii wrote:
>
> [Cc:ed to hackers]
>
> (trying select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');)
>
> > Ok, this is working now (I cann't reproduce why not at the first time).
>
> Good.
>
> > Is it planned to implement it so that I can write lower()/ upper() for multibyte
> > according to SQL standard (without convert)?
>
> SQL standard? The SQL standard says nothing about locale. So making
> lower() (and others) "locale aware" is far different from the SQL
> standard of point of view. Of course this does not mean "locale
> support" is should not be a part of PostgreSQL's implementation of
> SQL. However, we should be aware the limitation of "locale support"
> (as well as multibyte support). They are just the stopgap util CREATE
> CHARACTER SET etc. is implemnted IMO.
>
> > I could do it if you tell me where the final tolower()/toupper() happens.
> > (but not before middle of June).
>
> For the short term solution making convert() hiding from users might
> be a good idea (what I mean here is kind of auto execution of
> convert()). The hardest part is there's no idea how we could find a
> relationship bewteen particular locale and the encoding. For example,
> you know that for de_DE locale using LATIN1 encoding is appropreate,
> but PostgreSQL does not.

I think it is really not hard to do this for UTF-8. I don't have to know the
relation between the locale and the encoding. Look at this:
We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE
at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists
also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc.
We don't need to know more than this. If we call
setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>)
then glibc is aware of doing all the conversions. I attach a small demo program
which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to
german umlaut a (lower).
What I don't know (have to ask a glibc delveloper) is:
Why there exists dozens of locales *.utf8 and what is the difference
between all /usr/lib/locale/*.utf8/LC_CTYPE?
But for all existing locales *.utf8, the conversion of german umlauts is working properly.

Regards,
Michael

PS: I'm not in my office for the next 3 weeks and therefore not able to read my mails.

#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
#define LEN 5

int main() {
  char readInByte[LEN], writeOutByte[LEN];     // holds the character bytes
  const char *readInByteP[] = {readInByte};    // help pointer
  wchar_t readInWC[LEN], writeOutWC[LEN];      // holds the wide characters
  const wchar_t *writeOutWCP[] = {writeOutWC}; // help pointer
  wctrans_t wctransDesc;                       // holds the descriptor for conversion
  int i, ret;
  const char myLocale[] = "ja_JP.utf8";
  char *localeSet;

  readInByte[0] = 0xc3; readInByte[1] = 0x84;  // german umlaut A (upper) in UTF-8
  readInByte[2] = 0xc3; readInByte[3] = 0xa4;  // german umlaut a (lower) in UTF-8
  readInByte[4] = 0;

  // print out the input
  printf("german umlaut A (upper) UTF-8: %hhx %hhx\n", readInByte[0], readInByte[1]);
  printf("german umlaut a (lower) UTF-8: %hhx %hhx\n", readInByte[2], readInByte[3]);

  if((localeSet = setlocale(LC_CTYPE, myLocale)) == NULL) { perror("setlocale"); exit(1); }
  else printf("locale set: %s\n", localeSet);
  ret = mbsrtowcs(readInWC, readInByteP, LEN, NULL); // convert bytes to wide chars
  printf("number of wide chars: %i\n", ret);
  wctransDesc = wctrans("tolower");            // get descriptor for wc operation
  if(wctransDesc == 0) { perror("wctransDesc"); exit(1); }

  // make the transformation according to descriptor
  i=0; while((writeOutWC[i] = towctrans(readInWC[i], wctransDesc)) != L'\0') i++;

  ret = wcsrtombs(writeOutByte, writeOutWCP, LEN, NULL); // convert wide chars to bytes
  printf("number of bytes: %i\n", ret);

  // print out the result
  printf("german umlaut A tolower(): %hhx %hhx\n", writeOutByte[0], writeOutByte[1]);
  printf("german umlaut a tolower(): %hhx %hhx\n", writeOutByte[2], writeOutByte[3]);

  return 0;
}

Re: [BUGS] Bug #659: lower()/upper() bug on

From
Tatsuo Ishii
Date:
> I think it is really not hard to do this for UTF-8. I don't have to know the
> relation between the locale and the encoding. Look at this:
> We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE
> at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists
> also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc.

My Linux box does not have *.utf8 locales at all. Probably not so many
platforms have them up to now, I guess.

> We don't need to know more than this. If we call
> setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>)
> then glibc is aware of doing all the conversions. I attach a small demo program
> which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to
> german umlaut a (lower).

Interesting idea, but the problem is we have to decide to use exactly
one locale before initdb. In my understanding, users willing to use
Unicode (UTF-8) tend to use multiple languages. This is natural since
Unicode claims it can handle several languages. For example, user
might want to have a table like this in a UTF-8 database:

create table t1(      english text,    -- English message      germany text,    -- Germany message      japanese text
-- Japanese message
 
);

If you have set the local to, say de_DE, then:

select lower(japanese) from t1;

would be executed in de_DE.utf8 locale, and I doubt it produces any
meaningfull results for Japanese.
--
Tatsuo Ishii


Re: [BUGS] Bug #659: lower()/upper() bug on

From
Jean-Michel POURE
Date:
Le Mardi 14 Mai 2002 03:29, Tatsuo Ishii a écrit :
> For example, user
> might want to have a table like this in a UTF-8 database:
>
> create table t1(
>        english text,    -- English message
>        germany text,    -- Germany message
>        japanese text    -- Japanese message
> );

Or just
CREATE table t1(       text_locale varchar,         text_content text
);
which is my case.
Just my 2 cents.
/Jean-Michel POURE


Re: [BUGS] Bug #659: lower()/upper() bug on

From
Hannu Krosing
Date:
On Tue, 2002-05-14 at 03:29, Tatsuo Ishii wrote:
> > I think it is really not hard to do this for UTF-8. I don't have to know the
> > relation between the locale and the encoding. Look at this:
> > We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE
> > at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists
> > also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc.
> 
> My Linux box does not have *.utf8 locales at all. Probably not so many
> platforms have them up to now, I guess.

What linux do you use ?

At least newer Redhat Linuxen have them and I suspect that all newer
glibc's are capable of using them.

> 
> > We don't need to know more than this. If we call
> > setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>)
> > then glibc is aware of doing all the conversions. I attach a small demo program
> > which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to
> > german umlaut a (lower).
> 
> Interesting idea, but the problem is we have to decide to use exactly
> one locale before initdb. In my understanding, users willing to use
> Unicode (UTF-8) tend to use multiple languages. This is natural since
> Unicode claims it can handle several languages. For example, user
> might want to have a table like this in a UTF-8 database:
> 
> create table t1(
>        english text,    -- English message
>        germany text,    -- Germany message
>        japanese text    -- Japanese message
> );
> 
> If you have set the local to, say de_DE, then:
> 
> select lower(japanese) from t1;
>
> would be executed in de_DE.utf8 locale, and I doubt it produces any
> meaningfull results for Japanese.

IIRC it may, as I think that it will include full UTF8 upper/lower
tables, at least on Linux.

For example en_US will produce right upper/lower results for Estonian,
though collation is off and some chars are missing if using iso-8859-1.

btw, does Japanese language have distinct upper and lower case letters ?

--------------
Hannu




Re: [BUGS] Bug #659: lower()/upper() bug on

From
Tatsuo Ishii
Date:
> > My Linux box does not have *.utf8 locales at all. Probably not so many
> > platforms have them up to now, I guess.
> 
> What linux do you use ?

Kind of variant of RH6.2.

> At least newer Redhat Linuxen have them and I suspect that all newer
> glibc's are capable of using them.

I guess many RH6.2 or RH6.2 based are still surviving...

> > If you have set the local to, say de_DE, then:
> > 
> > select lower(japanese) from t1;
> >
> > would be executed in de_DE.utf8 locale, and I doubt it produces any
> > meaningfull results for Japanese.
> 
> IIRC it may, as I think that it will include full UTF8 upper/lower
> tables, at least on Linux.
> 
> For example en_US will produce right upper/lower results for Estonian,
> though collation is off and some chars are missing if using iso-8859-1.

Are you sure that say, de_DE.utf8 locale produce meaningful results
for any other languages? If so, why are there so many *.utf8 locales?

> btw, does Japanese language have distinct upper and lower case letters ?

There are "full width alphabets" in Japanese. Thoes include not only
ASCII letters but also some European characters.
--
Tatsuo Ishii


Re: [BUGS] Bug #659: lower()/upper() bug on

From
Hannu Krosing
Date:
On Tue, 2002-05-14 at 09:52, Tatsuo Ishii wrote:
> 
> Are you sure that say, de_DE.utf8 locale produce meaningful results
> for any other languages?

there are often subtle differences, but upper() and lower() are much
more likely to produce right results than collation order or date/money
formats.

in fact seem to be only 10 distinct LC_CTYPE files for ~110 locales with
most european-originated languages having the same and only 
tr_TR, zh_??, fr_??,da_DK, de_??, ro_RO, sr_YU, ja_JP and ko_KR having
their own.

> If so, why are there so many *.utf8 locales?

As I understand it, a locale should cover all locale-specific issues
> > btw, does Japanese language have distinct upper and lower case letters ?
> 
> There are "full width alphabets" in Japanese. Thoes include not only
> ASCII letters but also some European characters.

Are these ASCII and European characters uppercased in some
Japanese-specific way ?

--------------
Hannu



Re: [BUGS] Bug #659: lower()/upper() bug on

From
Tatsuo Ishii
Date:
> > Are you sure that say, de_DE.utf8 locale produce meaningful results
> > for any other languages?
> 
> there are often subtle differences, but upper() and lower() are much
> more likely to produce right results than collation order or date/money
> formats.
> 
> in fact seem to be only 10 distinct LC_CTYPE files for ~110 locales with
> most european-originated languages having the same and only 
> tr_TR, zh_??, fr_??,da_DK, de_??, ro_RO, sr_YU, ja_JP and ko_KR having
> their own.

I see. So the remaining problem would be how to detect the existence
of *.utf8 collation at the configure time.

> > If so, why are there so many *.utf8 locales?
> 
> As I understand it, a locale should cover all locale-specific issues
>  
> > > btw, does Japanese language have distinct upper and lower case letters ?
> > 
> > There are "full width alphabets" in Japanese. Thoes include not only
> > ASCII letters but also some European characters.
> 
> Are these ASCII and European characters uppercased in some
> Japanese-specific way ?

Probably not, but I'm not sure since my Linux box does not have *.utf8
locales.
--
Tatsuo Ishii


Re: [BUGS] Bug #659: lower()/upper() bug on

From
"Enke, Michael"
Date:
Tatsuo Ishii wrote:
> 
> > > Are you sure that say, de_DE.utf8 locale produce meaningful results
> > > for any other languages?
> >
> > there are often subtle differences, but upper() and lower() are much
> > more likely to produce right results than collation order or date/money
> > formats.
> >
> > in fact seem to be only 10 distinct LC_CTYPE files for ~110 locales with
> > most european-originated languages having the same and only
> > tr_TR, zh_??, fr_??,da_DK, de_??, ro_RO, sr_YU, ja_JP and ko_KR having
> > their own.
> 
> I see. So the remaining problem would be how to detect the existence
> of *.utf8 collation at the configure time.
> 
> > > If so, why are there so many *.utf8 locales?
> >
> > As I understand it, a locale should cover all locale-specific issues
> >
> > > > btw, does Japanese language have distinct upper and lower case letters ?
> > >
> > > There are "full width alphabets" in Japanese. Thoes include not only
> > > ASCII letters but also some European characters.
> >
> > Are these ASCII and European characters uppercased in some
> > Japanese-specific way ?
> 
> Probably not, but I'm not sure since my Linux box does not have *.utf8
> locales.

Could you give me the UTF-8 bytecode for one japanese upper case char and
for the same char the lower case?
I will check in de_DE locale if this translations works.

Michael


Re: [BUGS] Bug #659: lower()/upper() bug on

From
Tatsuo Ishii
Date:
> > > > There are "full width alphabets" in Japanese. Thoes include not only
> > > > ASCII letters but also some European characters.
> > >
> > > Are these ASCII and European characters uppercased in some
> > > Japanese-specific way ?
> > 
> > Probably not, but I'm not sure since my Linux box does not have *.utf8
> > locales.
> 
> Could you give me the UTF-8 bytecode for one japanese upper case char and
> for the same char the lower case?
> I will check in de_DE locale if this translations works.

Ok, here is the data you requested. The first three bytes (0xefbca1)
represents full-width capital "A", the rest three bytes (0xefbd81)
represents full-width lower case "a".

Re: [BUGS] Bug #659: lower()/upper() bug on

From
"Enke, Michael"
Date:
Tatsuo Ishii wrote:
>
> > > > > There are "full width alphabets" in Japanese. Thoes include not only
> > > > > ASCII letters but also some European characters.
> > > >
> > > > Are these ASCII and European characters uppercased in some
> > > > Japanese-specific way ?
> > >
> > > Probably not, but I'm not sure since my Linux box does not have *.utf8
> > > locales.
> >
> > Could you give me the UTF-8 bytecode for one japanese upper case char and
> > for the same char the lower case?
> > I will check in de_DE locale if this translations works.
>
> Ok, here is the data you requested. The first three bytes (0xefbca1)
> represents full-width capital "A", the rest three bytes (0xefbd81)
> represents full-width lower case "a".

Thank you for the data, it is working in ja_JP.utf8 and in de_DE.utf8
I send you my test program as attachment.

Regards,
Michael

#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
#define LEN 7

int main() {
  char readInByte[LEN], writeOutByte[LEN];     // holds the character bytes
  const char *readInByteP[] = {readInByte};    // help pointer
  wchar_t readInWC[LEN], writeOutWC[LEN];      // holds the wide characters
  const wchar_t *writeOutWCP[] = {writeOutWC}; // help pointer
  wctrans_t wctransDesc;                       // holds the descriptor for conversion
  int i, ret;
  //const char myLocale[] = "ja_JP.utf8";
  const char myLocale[] = "de_DE.utf8";
  char *localeSet;

  readInByte[0] = 0xef; readInByte[1] = 0xbc; readInByte[2] = 0xa1; // full-width A (upper) in UTF-8
  readInByte[3] = 0xef; readInByte[4] = 0xbd; readInByte[5] = 0x81; // full-width a (lower) in UTF-8
  readInByte[6] = 0;

  // print out the input
  printf("full-width A (upper) UTF-8: %hhx %hhx %hhx\n", readInByte[0], readInByte[1], readInByte[2]);
  printf("full-width a (lower) UTF-8: %hhx %hhx %hhx\n", readInByte[3], readInByte[4], readInByte[5]);

  if((localeSet = setlocale(LC_CTYPE, myLocale)) == NULL) { perror("setlocale"); exit(1); }
  else printf("locale set: %s\n", localeSet);
  ret = mbsrtowcs(readInWC, readInByteP, LEN, NULL); // convert bytes to wide chars
  printf("number of wide chars: %i\n", ret);
  wctransDesc = wctrans("tolower");            // get descriptor for wc operation
  if(wctransDesc == 0) { perror("wctransDesc"); exit(1); }

  // make the transformation according to descriptor
  i=0; while((writeOutWC[i] = towctrans(readInWC[i], wctransDesc)) != L'\0') i++;

  ret = wcsrtombs(writeOutByte, writeOutWCP, LEN, NULL); // convert wide chars to bytes
  printf("number of bytes: %i\n", ret);

  // print out the result
  printf("full-width A tolower():     %hhx %hhx %hhx\n", writeOutByte[0], writeOutByte[1], writeOutByte[2]);
  printf("full-width a tolower():     %hhx %hhx %hhx\n", writeOutByte[3], writeOutByte[4], writeOutByte[5]);

  return 0;
}