Thread: verifying unicode locale support
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi there, triggered by the recent questions about sorting, I started digging into my problems with upper('ä')='ä' when using LC_CTYPE and LANG = de_DE.UTF-8. I have checked with Java (toUpperCase()) and C (see attached program, might help others) that my locale is working, but postgres (initdb and postmaster running with LANG=de_DE.utf8, -E UNICODE) still insists that upper('ä') equals 'ä'. What else can be wrong? Mit freundlichem Gruß / With kind regards Holger Klawitter - -- lists <at> klawitter <dot> de - ------snip------ #include <stdio.h> #include <locale.h> #include <wchar.h> int main() { if (!setlocale(LC_CTYPE, "")) { fprintf(stderr, "Can't set the specified locale! " "Check LANG, LC_CTYPE, LC_ALL.\n"); return 1; } wchar_t* text = L"ä"; printf( "is: towupper(%x) = %x\n", text[0], towupper(text[0]) ); return 0; } -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFAe6601Xdt0HKSwgYRAvtlAJ9nfZHVHLcDeCCok/ylgr1jtZrXBQCff29h bKiclwE2ahspLQZSBKJWIuo= =1IaE -----END PGP SIGNATURE-----
Holger Klawitter <lists@klawitter.de> writes: > I have checked with Java (toUpperCase()) and C (see attached program, might > help others) that my locale is working, but postgres (initdb and postmaster > running with LANG=de_DE.utf8, -E UNICODE) still insists that upper('�') > equals '�'. What else can be wrong? What byte string are you really entering here? What's coming through in your email is \344 ... which is not valid UTF8. But I suspect something may have translated it before it got to my inbox. regards, tom lane
Holger Klawitter wrote: > I have checked with Java (toUpperCase()) and C (see attached program, > might help others) that my locale is working, but postgres (initdb > and postmaster running with LANG=de_DE.utf8, -E UNICODE) still > insists that upper('ä') equals 'ä'. What else can be wrong? PostgreSQL, case conversion, and Unicode don't work together. Pick any two. :-)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > What byte string are you really entering here? What's coming through in > your email is \344 ... which is not valid UTF8. But I suspect something > may have translated it before it got to my inbox. Damn charsets :-) The character indeed was \344 aka "ä", but my mailer sends latin, not unicode. In order to avoid interaction with gcc, cat and others else I've written a new program, reading from a file. gcc -o unicode unicode.c LC_CTYPE=de_DE.utf8 ./unicode uni.data should yield (xterm -u8, LC_CTYPE=en_US.utf8 works as well) uni.out Mit freundlichem Gruß / With kind regards Holger Klawitter - -- lists <at> klawitter <dot> de -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFAfA1/1Xdt0HKSwgYRAhldAJoCcNrZ7BGnG1m2SXX/lR1ngqGooQCcDYOF SlzlbLAJk7/e6rzYZyL7yE4= =/3bH -----END PGP SIGNATURE-----
Attachment
Holger Klawitter <lists@klawitter.de> writes: > In order to avoid interaction with gcc, cat and others else I've written a > new program, reading from a file. After setting up the test case and duplicating your problem, I realized I was being dense :-( ... this is a well-known issue. Need more caffeine before answering bug reports obviously ... The problem is that PG's upper() and lower() functions are based on the C library's <ctype.h> functions (toupper() and tolower()), which of course only work for single-byte character sets. So they cannot work on UTF8 data. There has been some talk of rewriting these functions to use the <wctype.h> API where available, but no one's actually stepped up to the plate and done it. IIRC the main sticking point was figuring out how to get from whatever character encoding the database is using into the wide character set representation the C library wants. There doesn't seem to be a portable way of discovering exactly what the wchar encoding is supposed to be for the current locale setting. If you're interested in trying to fix this, check the pgsql-hackers archives for the previous discussions. Searching for "wctype" would probably find the relevant threads. If you just want to get your work done, I'd suggest adopting a single-byte encoding such as Latin1 for the database. regards, tom lane
On Tue, Apr 13, 2004 at 12:32:17PM -0400, Tom Lane wrote: > Holger Klawitter <lists@klawitter.de> writes: > > In order to avoid interaction with gcc, cat and others else I've written a > > new program, reading from a file. > > After setting up the test case and duplicating your problem, I realized > I was being dense :-( ... this is a well-known issue. Need more > caffeine before answering bug reports obviously ... > > The problem is that PG's upper() and lower() functions are based on > the C library's <ctype.h> functions (toupper() and tolower()), which of > course only work for single-byte character sets. So they cannot work on > UTF8 data. > > There has been some talk of rewriting these functions to use the > <wctype.h> API where available, but no one's actually stepped up to the > plate and done it. IIRC the main sticking point was figuring out how to > get from whatever character encoding the database is using into the wide > character set representation the C library wants. There doesn't seem to > be a portable way of discovering exactly what the wchar encoding is > supposed to be for the current locale setting. There is the "libcharset - portable character set determination. library". But maintain this library with a lot of OS depend code is probably nothing simple. It's used in standard iconv. http://www.haible.de/bruno/packages-libcharset.html But I'm not sure if it resolve something, because there is not gaurantee of any connection between the current locale setting and string encoding. SELECT upper( convert('foo', 'X', 'Y') ); IMHO solution is add to "struct varlena" pointer to pg_encname that knows handle PostgreSQL encoding information and make each PostgreSQL string independent and self-described. Or is there something why is this useless? Karel -- Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/