Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB - Mailing list pgsql-hackers
From | Enke, Michael |
---|---|
Subject | Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB |
Date | |
Msg-id | 3CDF8E01.DC0B2817@wincor-nixdorf.com Whole thread Raw |
In response to | Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB (Tatsuo Ishii <t-ishii@sra.co.jp>) |
Responses |
Re: [BUGS] Bug #659: lower()/upper() bug on
|
List | pgsql-hackers |
Tatsuo Ishii wrote: > > [Cc:ed to hackers] > > (trying select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');) > > > Ok, this is working now (I cann't reproduce why not at the first time). > > Good. > > > Is it planned to implement it so that I can write lower()/ upper() for multibyte > > according to SQL standard (without convert)? > > SQL standard? The SQL standard says nothing about locale. So making > lower() (and others) "locale aware" is far different from the SQL > standard of point of view. Of course this does not mean "locale > support" is should not be a part of PostgreSQL's implementation of > SQL. However, we should be aware the limitation of "locale support" > (as well as multibyte support). They are just the stopgap util CREATE > CHARACTER SET etc. is implemnted IMO. > > > I could do it if you tell me where the final tolower()/toupper() happens. > > (but not before middle of June). > > For the short term solution making convert() hiding from users might > be a good idea (what I mean here is kind of auto execution of > convert()). The hardest part is there's no idea how we could find a > relationship bewteen particular locale and the encoding. For example, > you know that for de_DE locale using LATIN1 encoding is appropreate, > but PostgreSQL does not. I think it is really not hard to do this for UTF-8. I don't have to know the relation between the locale and the encoding. Look at this: We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc. We don't need to know more than this. If we call setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>) then glibc is aware of doing all the conversions. I attach a small demo program which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to german umlaut a (lower). What I don't know (have to ask a glibc delveloper) is: Why there exists dozens of locales *.utf8 and what is the difference between all /usr/lib/locale/*.utf8/LC_CTYPE? But for all existing locales *.utf8, the conversion of german umlauts is working properly. Regards, Michael PS: I'm not in my office for the next 3 weeks and therefore not able to read my mails. #include <stdio.h> #include <wchar.h> #include <wctype.h> #include <locale.h> #define LEN 5 int main() { char readInByte[LEN], writeOutByte[LEN]; // holds the character bytes const char *readInByteP[] = {readInByte}; // help pointer wchar_t readInWC[LEN], writeOutWC[LEN]; // holds the wide characters const wchar_t *writeOutWCP[] = {writeOutWC}; // help pointer wctrans_t wctransDesc; // holds the descriptor for conversion int i, ret; const char myLocale[] = "ja_JP.utf8"; char *localeSet; readInByte[0] = 0xc3; readInByte[1] = 0x84; // german umlaut A (upper) in UTF-8 readInByte[2] = 0xc3; readInByte[3] = 0xa4; // german umlaut a (lower) in UTF-8 readInByte[4] = 0; // print out the input printf("german umlaut A (upper) UTF-8: %hhx %hhx\n", readInByte[0], readInByte[1]); printf("german umlaut a (lower) UTF-8: %hhx %hhx\n", readInByte[2], readInByte[3]); if((localeSet = setlocale(LC_CTYPE, myLocale)) == NULL) { perror("setlocale"); exit(1); } else printf("locale set: %s\n", localeSet); ret = mbsrtowcs(readInWC, readInByteP, LEN, NULL); // convert bytes to wide chars printf("number of wide chars: %i\n", ret); wctransDesc = wctrans("tolower"); // get descriptor for wc operation if(wctransDesc == 0) { perror("wctransDesc"); exit(1); } // make the transformation according to descriptor i=0; while((writeOutWC[i] = towctrans(readInWC[i], wctransDesc)) != L'\0') i++; ret = wcsrtombs(writeOutByte, writeOutWCP, LEN, NULL); // convert wide chars to bytes printf("number of bytes: %i\n", ret); // print out the result printf("german umlaut A tolower(): %hhx %hhx\n", writeOutByte[0], writeOutByte[1]); printf("german umlaut a tolower(): %hhx %hhx\n", writeOutByte[2], writeOutByte[3]); return 0; }
pgsql-hackers by date: