Thread: Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB
[Cc:ed to hackers] (trying select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');) > Ok, this is working now (I cann't reproduce why not at the first time). Good. > Is it planned to implement it so that I can write lower()/ upper() for multibyte > according to SQL standard (without convert)? SQL standard? The SQL standard says nothing about locale. So making lower() (and others) "locale aware" is far different from the SQL standard of point of view. Of course this does not mean "locale support" is should not be a part of PostgreSQL's implementation of SQL. However, we should be aware the limitation of "locale support" (as well as multibyte support). They are just the stopgap util CREATE CHARACTER SET etc. is implemnted IMO. > I could do it if you tell me where the final tolower()/toupper() happens. > (but not before middle of June). For the short term solution making convert() hiding from users might be a good idea (what I mean here is kind of auto execution of convert()). The hardest part is there's no idea how we could find a relationship bewteen particular locale and the encoding. For example, you know that for de_DE locale using LATIN1 encoding is appropreate, but PostgreSQL does not. -- Tatsuo Ishii
Tatsuo Ishii wrote: > > [Cc:ed to hackers] > > (trying select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');) > > > Ok, this is working now (I cann't reproduce why not at the first time). > > Good. > > > Is it planned to implement it so that I can write lower()/ upper() for multibyte > > according to SQL standard (without convert)? > > SQL standard? The SQL standard says nothing about locale. So making > lower() (and others) "locale aware" is far different from the SQL > standard of point of view. Of course this does not mean "locale > support" is should not be a part of PostgreSQL's implementation of > SQL. However, we should be aware the limitation of "locale support" > (as well as multibyte support). They are just the stopgap util CREATE > CHARACTER SET etc. is implemnted IMO. > > > I could do it if you tell me where the final tolower()/toupper() happens. > > (but not before middle of June). > > For the short term solution making convert() hiding from users might > be a good idea (what I mean here is kind of auto execution of > convert()). The hardest part is there's no idea how we could find a > relationship bewteen particular locale and the encoding. For example, > you know that for de_DE locale using LATIN1 encoding is appropreate, > but PostgreSQL does not. I think it is really not hard to do this for UTF-8. I don't have to know the relation between the locale and the encoding. Look at this: We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc. We don't need to know more than this. If we call setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>) then glibc is aware of doing all the conversions. I attach a small demo program which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to german umlaut a (lower). What I don't know (have to ask a glibc delveloper) is: Why there exists dozens of locales *.utf8 and what is the difference between all /usr/lib/locale/*.utf8/LC_CTYPE? But for all existing locales *.utf8, the conversion of german umlauts is working properly. Regards, Michael PS: I'm not in my office for the next 3 weeks and therefore not able to read my mails. #include <stdio.h> #include <wchar.h> #include <wctype.h> #include <locale.h> #define LEN 5 int main() { char readInByte[LEN], writeOutByte[LEN]; // holds the character bytes const char *readInByteP[] = {readInByte}; // help pointer wchar_t readInWC[LEN], writeOutWC[LEN]; // holds the wide characters const wchar_t *writeOutWCP[] = {writeOutWC}; // help pointer wctrans_t wctransDesc; // holds the descriptor for conversion int i, ret; const char myLocale[] = "ja_JP.utf8"; char *localeSet; readInByte[0] = 0xc3; readInByte[1] = 0x84; // german umlaut A (upper) in UTF-8 readInByte[2] = 0xc3; readInByte[3] = 0xa4; // german umlaut a (lower) in UTF-8 readInByte[4] = 0; // print out the input printf("german umlaut A (upper) UTF-8: %hhx %hhx\n", readInByte[0], readInByte[1]); printf("german umlaut a (lower) UTF-8: %hhx %hhx\n", readInByte[2], readInByte[3]); if((localeSet = setlocale(LC_CTYPE, myLocale)) == NULL) { perror("setlocale"); exit(1); } else printf("locale set: %s\n", localeSet); ret = mbsrtowcs(readInWC, readInByteP, LEN, NULL); // convert bytes to wide chars printf("number of wide chars: %i\n", ret); wctransDesc = wctrans("tolower"); // get descriptor for wc operation if(wctransDesc == 0) { perror("wctransDesc"); exit(1); } // make the transformation according to descriptor i=0; while((writeOutWC[i] = towctrans(readInWC[i], wctransDesc)) != L'\0') i++; ret = wcsrtombs(writeOutByte, writeOutWCP, LEN, NULL); // convert wide chars to bytes printf("number of bytes: %i\n", ret); // print out the result printf("german umlaut A tolower(): %hhx %hhx\n", writeOutByte[0], writeOutByte[1]); printf("german umlaut a tolower(): %hhx %hhx\n", writeOutByte[2], writeOutByte[3]); return 0; }
> I think it is really not hard to do this for UTF-8. I don't have to know the > relation between the locale and the encoding. Look at this: > We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE > at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists > also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc. My Linux box does not have *.utf8 locales at all. Probably not so many platforms have them up to now, I guess. > We don't need to know more than this. If we call > setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>) > then glibc is aware of doing all the conversions. I attach a small demo program > which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to > german umlaut a (lower). Interesting idea, but the problem is we have to decide to use exactly one locale before initdb. In my understanding, users willing to use Unicode (UTF-8) tend to use multiple languages. This is natural since Unicode claims it can handle several languages. For example, user might want to have a table like this in a UTF-8 database: create table t1( english text, -- English message germany text, -- Germany message japanese text -- Japanese message ); If you have set the local to, say de_DE, then: select lower(japanese) from t1; would be executed in de_DE.utf8 locale, and I doubt it produces any meaningfull results for Japanese. -- Tatsuo Ishii
Le Mardi 14 Mai 2002 03:29, Tatsuo Ishii a écrit : > For example, user > might want to have a table like this in a UTF-8 database: > > create table t1( > english text, -- English message > germany text, -- Germany message > japanese text -- Japanese message > ); Or just CREATE table t1( text_locale varchar, text_content text ); which is my case. Just my 2 cents. /Jean-Michel POURE
On Tue, 2002-05-14 at 03:29, Tatsuo Ishii wrote: > > I think it is really not hard to do this for UTF-8. I don't have to know the > > relation between the locale and the encoding. Look at this: > > We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE > > at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists > > also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc. > > My Linux box does not have *.utf8 locales at all. Probably not so many > platforms have them up to now, I guess. What linux do you use ? At least newer Redhat Linuxen have them and I suspect that all newer glibc's are capable of using them. > > > We don't need to know more than this. If we call > > setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>) > > then glibc is aware of doing all the conversions. I attach a small demo program > > which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to > > german umlaut a (lower). > > Interesting idea, but the problem is we have to decide to use exactly > one locale before initdb. In my understanding, users willing to use > Unicode (UTF-8) tend to use multiple languages. This is natural since > Unicode claims it can handle several languages. For example, user > might want to have a table like this in a UTF-8 database: > > create table t1( > english text, -- English message > germany text, -- Germany message > japanese text -- Japanese message > ); > > If you have set the local to, say de_DE, then: > > select lower(japanese) from t1; > > would be executed in de_DE.utf8 locale, and I doubt it produces any > meaningfull results for Japanese. IIRC it may, as I think that it will include full UTF8 upper/lower tables, at least on Linux. For example en_US will produce right upper/lower results for Estonian, though collation is off and some chars are missing if using iso-8859-1. btw, does Japanese language have distinct upper and lower case letters ? -------------- Hannu
> > My Linux box does not have *.utf8 locales at all. Probably not so many > > platforms have them up to now, I guess. > > What linux do you use ? Kind of variant of RH6.2. > At least newer Redhat Linuxen have them and I suspect that all newer > glibc's are capable of using them. I guess many RH6.2 or RH6.2 based are still surviving... > > If you have set the local to, say de_DE, then: > > > > select lower(japanese) from t1; > > > > would be executed in de_DE.utf8 locale, and I doubt it produces any > > meaningfull results for Japanese. > > IIRC it may, as I think that it will include full UTF8 upper/lower > tables, at least on Linux. > > For example en_US will produce right upper/lower results for Estonian, > though collation is off and some chars are missing if using iso-8859-1. Are you sure that say, de_DE.utf8 locale produce meaningful results for any other languages? If so, why are there so many *.utf8 locales? > btw, does Japanese language have distinct upper and lower case letters ? There are "full width alphabets" in Japanese. Thoes include not only ASCII letters but also some European characters. -- Tatsuo Ishii
On Tue, 2002-05-14 at 09:52, Tatsuo Ishii wrote: > > Are you sure that say, de_DE.utf8 locale produce meaningful results > for any other languages? there are often subtle differences, but upper() and lower() are much more likely to produce right results than collation order or date/money formats. in fact seem to be only 10 distinct LC_CTYPE files for ~110 locales with most european-originated languages having the same and only tr_TR, zh_??, fr_??,da_DK, de_??, ro_RO, sr_YU, ja_JP and ko_KR having their own. > If so, why are there so many *.utf8 locales? As I understand it, a locale should cover all locale-specific issues > > btw, does Japanese language have distinct upper and lower case letters ? > > There are "full width alphabets" in Japanese. Thoes include not only > ASCII letters but also some European characters. Are these ASCII and European characters uppercased in some Japanese-specific way ? -------------- Hannu
> > Are you sure that say, de_DE.utf8 locale produce meaningful results > > for any other languages? > > there are often subtle differences, but upper() and lower() are much > more likely to produce right results than collation order or date/money > formats. > > in fact seem to be only 10 distinct LC_CTYPE files for ~110 locales with > most european-originated languages having the same and only > tr_TR, zh_??, fr_??,da_DK, de_??, ro_RO, sr_YU, ja_JP and ko_KR having > their own. I see. So the remaining problem would be how to detect the existence of *.utf8 collation at the configure time. > > If so, why are there so many *.utf8 locales? > > As I understand it, a locale should cover all locale-specific issues > > > > btw, does Japanese language have distinct upper and lower case letters ? > > > > There are "full width alphabets" in Japanese. Thoes include not only > > ASCII letters but also some European characters. > > Are these ASCII and European characters uppercased in some > Japanese-specific way ? Probably not, but I'm not sure since my Linux box does not have *.utf8 locales. -- Tatsuo Ishii
Tatsuo Ishii wrote: > > > > Are you sure that say, de_DE.utf8 locale produce meaningful results > > > for any other languages? > > > > there are often subtle differences, but upper() and lower() are much > > more likely to produce right results than collation order or date/money > > formats. > > > > in fact seem to be only 10 distinct LC_CTYPE files for ~110 locales with > > most european-originated languages having the same and only > > tr_TR, zh_??, fr_??,da_DK, de_??, ro_RO, sr_YU, ja_JP and ko_KR having > > their own. > > I see. So the remaining problem would be how to detect the existence > of *.utf8 collation at the configure time. > > > > If so, why are there so many *.utf8 locales? > > > > As I understand it, a locale should cover all locale-specific issues > > > > > > btw, does Japanese language have distinct upper and lower case letters ? > > > > > > There are "full width alphabets" in Japanese. Thoes include not only > > > ASCII letters but also some European characters. > > > > Are these ASCII and European characters uppercased in some > > Japanese-specific way ? > > Probably not, but I'm not sure since my Linux box does not have *.utf8 > locales. Could you give me the UTF-8 bytecode for one japanese upper case char and for the same char the lower case? I will check in de_DE locale if this translations works. Michael
> > > > There are "full width alphabets" in Japanese. Thoes include not only > > > > ASCII letters but also some European characters. > > > > > > Are these ASCII and European characters uppercased in some > > > Japanese-specific way ? > > > > Probably not, but I'm not sure since my Linux box does not have *.utf8 > > locales. > > Could you give me the UTF-8 bytecode for one japanese upper case char and > for the same char the lower case? > I will check in de_DE locale if this translations works. Ok, here is the data you requested. The first three bytes (0xefbca1) represents full-width capital "A", the rest three bytes (0xefbd81) represents full-width lower case "a".
Tatsuo Ishii wrote: > > > > > > There are "full width alphabets" in Japanese. Thoes include not only > > > > > ASCII letters but also some European characters. > > > > > > > > Are these ASCII and European characters uppercased in some > > > > Japanese-specific way ? > > > > > > Probably not, but I'm not sure since my Linux box does not have *.utf8 > > > locales. > > > > Could you give me the UTF-8 bytecode for one japanese upper case char and > > for the same char the lower case? > > I will check in de_DE locale if this translations works. > > Ok, here is the data you requested. The first three bytes (0xefbca1) > represents full-width capital "A", the rest three bytes (0xefbd81) > represents full-width lower case "a". Thank you for the data, it is working in ja_JP.utf8 and in de_DE.utf8 I send you my test program as attachment. Regards, Michael #include <stdio.h> #include <wchar.h> #include <wctype.h> #include <locale.h> #define LEN 7 int main() { char readInByte[LEN], writeOutByte[LEN]; // holds the character bytes const char *readInByteP[] = {readInByte}; // help pointer wchar_t readInWC[LEN], writeOutWC[LEN]; // holds the wide characters const wchar_t *writeOutWCP[] = {writeOutWC}; // help pointer wctrans_t wctransDesc; // holds the descriptor for conversion int i, ret; //const char myLocale[] = "ja_JP.utf8"; const char myLocale[] = "de_DE.utf8"; char *localeSet; readInByte[0] = 0xef; readInByte[1] = 0xbc; readInByte[2] = 0xa1; // full-width A (upper) in UTF-8 readInByte[3] = 0xef; readInByte[4] = 0xbd; readInByte[5] = 0x81; // full-width a (lower) in UTF-8 readInByte[6] = 0; // print out the input printf("full-width A (upper) UTF-8: %hhx %hhx %hhx\n", readInByte[0], readInByte[1], readInByte[2]); printf("full-width a (lower) UTF-8: %hhx %hhx %hhx\n", readInByte[3], readInByte[4], readInByte[5]); if((localeSet = setlocale(LC_CTYPE, myLocale)) == NULL) { perror("setlocale"); exit(1); } else printf("locale set: %s\n", localeSet); ret = mbsrtowcs(readInWC, readInByteP, LEN, NULL); // convert bytes to wide chars printf("number of wide chars: %i\n", ret); wctransDesc = wctrans("tolower"); // get descriptor for wc operation if(wctransDesc == 0) { perror("wctransDesc"); exit(1); } // make the transformation according to descriptor i=0; while((writeOutWC[i] = towctrans(readInWC[i], wctransDesc)) != L'\0') i++; ret = wcsrtombs(writeOutByte, writeOutWCP, LEN, NULL); // convert wide chars to bytes printf("number of bytes: %i\n", ret); // print out the result printf("full-width A tolower(): %hhx %hhx %hhx\n", writeOutByte[0], writeOutByte[1], writeOutByte[2]); printf("full-width a tolower(): %hhx %hhx %hhx\n", writeOutByte[3], writeOutByte[4], writeOutByte[5]); return 0; }