Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB - Mailing list pgsql-hackers

From Enke, Michael
Subject Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB
Date
Msg-id 3CDF8E01.DC0B2817@wincor-nixdorf.com
Whole thread Raw
In response to Re: [BUGS] Bug #659: lower()/upper() bug on ->multibyte<- DB  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Responses Re: [BUGS] Bug #659: lower()/upper() bug on
List pgsql-hackers
Tatsuo Ishii wrote:
>
> [Cc:ed to hackers]
>
> (trying select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');)
>
> > Ok, this is working now (I cann't reproduce why not at the first time).
>
> Good.
>
> > Is it planned to implement it so that I can write lower()/ upper() for multibyte
> > according to SQL standard (without convert)?
>
> SQL standard? The SQL standard says nothing about locale. So making
> lower() (and others) "locale aware" is far different from the SQL
> standard of point of view. Of course this does not mean "locale
> support" is should not be a part of PostgreSQL's implementation of
> SQL. However, we should be aware the limitation of "locale support"
> (as well as multibyte support). They are just the stopgap util CREATE
> CHARACTER SET etc. is implemnted IMO.
>
> > I could do it if you tell me where the final tolower()/toupper() happens.
> > (but not before middle of June).
>
> For the short term solution making convert() hiding from users might
> be a good idea (what I mean here is kind of auto execution of
> convert()). The hardest part is there's no idea how we could find a
> relationship bewteen particular locale and the encoding. For example,
> you know that for de_DE locale using LATIN1 encoding is appropreate,
> but PostgreSQL does not.

I think it is really not hard to do this for UTF-8. I don't have to know the
relation between the locale and the encoding. Look at this:
We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE
at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists
also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc.
We don't need to know more than this. If we call
setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>)
then glibc is aware of doing all the conversions. I attach a small demo program
which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to
german umlaut a (lower).
What I don't know (have to ask a glibc delveloper) is:
Why there exists dozens of locales *.utf8 and what is the difference
between all /usr/lib/locale/*.utf8/LC_CTYPE?
But for all existing locales *.utf8, the conversion of german umlauts is working properly.

Regards,
Michael

PS: I'm not in my office for the next 3 weeks and therefore not able to read my mails.

#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
#define LEN 5

int main() {
  char readInByte[LEN], writeOutByte[LEN];     // holds the character bytes
  const char *readInByteP[] = {readInByte};    // help pointer
  wchar_t readInWC[LEN], writeOutWC[LEN];      // holds the wide characters
  const wchar_t *writeOutWCP[] = {writeOutWC}; // help pointer
  wctrans_t wctransDesc;                       // holds the descriptor for conversion
  int i, ret;
  const char myLocale[] = "ja_JP.utf8";
  char *localeSet;

  readInByte[0] = 0xc3; readInByte[1] = 0x84;  // german umlaut A (upper) in UTF-8
  readInByte[2] = 0xc3; readInByte[3] = 0xa4;  // german umlaut a (lower) in UTF-8
  readInByte[4] = 0;

  // print out the input
  printf("german umlaut A (upper) UTF-8: %hhx %hhx\n", readInByte[0], readInByte[1]);
  printf("german umlaut a (lower) UTF-8: %hhx %hhx\n", readInByte[2], readInByte[3]);

  if((localeSet = setlocale(LC_CTYPE, myLocale)) == NULL) { perror("setlocale"); exit(1); }
  else printf("locale set: %s\n", localeSet);
  ret = mbsrtowcs(readInWC, readInByteP, LEN, NULL); // convert bytes to wide chars
  printf("number of wide chars: %i\n", ret);
  wctransDesc = wctrans("tolower");            // get descriptor for wc operation
  if(wctransDesc == 0) { perror("wctransDesc"); exit(1); }

  // make the transformation according to descriptor
  i=0; while((writeOutWC[i] = towctrans(readInWC[i], wctransDesc)) != L'\0') i++;

  ret = wcsrtombs(writeOutByte, writeOutWCP, LEN, NULL); // convert wide chars to bytes
  printf("number of bytes: %i\n", ret);

  // print out the result
  printf("german umlaut A tolower(): %hhx %hhx\n", writeOutByte[0], writeOutByte[1]);
  printf("german umlaut a tolower(): %hhx %hhx\n", writeOutByte[2], writeOutByte[3]);

  return 0;
}

pgsql-hackers by date:

Previous
From: Bartus Levente
Date:
Subject: Re: internal voting
Next
From: "Doug Hughes"
Date:
Subject: Easy upgrade