Thread: verifying unicode locale support

verifying unicode locale support

From
Holger Klawitter
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi there,

triggered by the recent questions about sorting, I started digging into my
problems with upper('ä')='ä' when using LC_CTYPE and LANG = de_DE.UTF-8.

I have checked with Java (toUpperCase()) and C (see attached program, might
help others) that my locale is working, but postgres (initdb and postmaster
running with LANG=de_DE.utf8, -E UNICODE) still insists that upper('ä')
equals 'ä'. What else can be wrong?

Mit freundlichem Gruß / With kind regards
    Holger Klawitter
- --
lists <at> klawitter <dot> de

- ------snip------
#include <stdio.h>
#include <locale.h>
#include <wchar.h>

int main()
{
    if (!setlocale(LC_CTYPE, "")) {
        fprintf(stderr, "Can't set the specified locale! "
                        "Check LANG, LC_CTYPE, LC_ALL.\n");
        return 1;
    }
    wchar_t* text = L"ä";
    printf( "is: towupper(%x) = %x\n", text[0], towupper(text[0]) );
    return 0;
}
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAe6601Xdt0HKSwgYRAvtlAJ9nfZHVHLcDeCCok/ylgr1jtZrXBQCff29h
bKiclwE2ahspLQZSBKJWIuo=
=1IaE
-----END PGP SIGNATURE-----


Re: verifying unicode locale support

From
Tom Lane
Date:
Holger Klawitter <lists@klawitter.de> writes:
> I have checked with Java (toUpperCase()) and C (see attached program, might
> help others) that my locale is working, but postgres (initdb and postmaster
> running with LANG=de_DE.utf8, -E UNICODE) still insists that upper('�')
> equals '�'. What else can be wrong?

What byte string are you really entering here?  What's coming through in
your email is \344 ... which is not valid UTF8.  But I suspect something
may have translated it before it got to my inbox.

            regards, tom lane

Re: verifying unicode locale support

From
Peter Eisentraut
Date:
Holger Klawitter wrote:
> I have checked with Java (toUpperCase()) and C (see attached program,
> might help others) that my locale is working, but postgres (initdb
> and postmaster running with LANG=de_DE.utf8, -E UNICODE) still
> insists that upper('ä') equals 'ä'. What else can be wrong?

PostgreSQL, case conversion, and Unicode don't work together.  Pick any
two. :-)


Re: verifying unicode locale support

From
Holger Klawitter
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> What byte string are you really entering here?  What's coming through in
> your email is \344 ... which is not valid UTF8.  But I suspect something
> may have translated it before it got to my inbox.

Damn charsets :-) The character indeed was \344 aka "ä", but my mailer
sends latin, not unicode.

In order to avoid interaction with gcc, cat and others else I've written a new
program, reading from a file.
    gcc -o unicode unicode.c
    LC_CTYPE=de_DE.utf8 ./unicode uni.data
should yield (xterm -u8, LC_CTYPE=en_US.utf8 works as well)
    uni.out

Mit freundlichem Gruß / With kind regards
    Holger Klawitter
- --
lists <at> klawitter <dot> de


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAfA1/1Xdt0HKSwgYRAhldAJoCcNrZ7BGnG1m2SXX/lR1ngqGooQCcDYOF
SlzlbLAJk7/e6rzYZyL7yE4=
=/3bH
-----END PGP SIGNATURE-----

Attachment

Re: verifying unicode locale support

From
Tom Lane
Date:
Holger Klawitter <lists@klawitter.de> writes:
> In order to avoid interaction with gcc, cat and others else I've written a
> new program, reading from a file.

After setting up the test case and duplicating your problem, I realized
I was being dense :-( ... this is a well-known issue.  Need more
caffeine before answering bug reports obviously ...

The problem is that PG's upper() and lower() functions are based on
the C library's <ctype.h> functions (toupper() and tolower()), which of
course only work for single-byte character sets.  So they cannot work on
UTF8 data.

There has been some talk of rewriting these functions to use the
<wctype.h> API where available, but no one's actually stepped up to the
plate and done it.  IIRC the main sticking point was figuring out how to
get from whatever character encoding the database is using into the wide
character set representation the C library wants.  There doesn't seem to
be a portable way of discovering exactly what the wchar encoding is
supposed to be for the current locale setting.

If you're interested in trying to fix this, check the pgsql-hackers
archives for the previous discussions.  Searching for "wctype" would
probably find the relevant threads.

If you just want to get your work done, I'd suggest adopting a
single-byte encoding such as Latin1 for the database.

            regards, tom lane

Re: verifying unicode locale support

From
Karel Zak
Date:
On Tue, Apr 13, 2004 at 12:32:17PM -0400, Tom Lane wrote:
> Holger Klawitter <lists@klawitter.de> writes:
> > In order to avoid interaction with gcc, cat and others else I've written a
> > new program, reading from a file.
>
> After setting up the test case and duplicating your problem, I realized
> I was being dense :-( ... this is a well-known issue.  Need more
> caffeine before answering bug reports obviously ...
>
> The problem is that PG's upper() and lower() functions are based on
> the C library's <ctype.h> functions (toupper() and tolower()), which of
> course only work for single-byte character sets.  So they cannot work on
> UTF8 data.
>
> There has been some talk of rewriting these functions to use the
> <wctype.h> API where available, but no one's actually stepped up to the
> plate and done it.  IIRC the main sticking point was figuring out how to
> get from whatever character encoding the database is using into the wide
> character set representation the C library wants.  There doesn't seem to
> be a portable way of discovering exactly what the wchar encoding is
> supposed to be for the current locale setting.

 There  is  the  "libcharset  - portable  character  set  determination.
 library". But maintain  this library with  a lot  of OS depend  code is
 probably nothing simple. It's used in standard iconv.

 http://www.haible.de/bruno/packages-libcharset.html

 But  I'm  not sure  if  it  resolve  something,  because there  is  not
 gaurantee  of any  connection between  the current  locale setting  and
 string encoding.

     SELECT upper( convert('foo', 'X', 'Y') );

 IMHO solution  is add  to "struct varlena"  pointer to  pg_encname that
 knows handle  PostgreSQL encoding information and  make each PostgreSQL
 string  independent and  self-described. Or is  there something  why is
 this useless?

    Karel

--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/