On Tue, Apr 13, 2004 at 12:32:17PM -0400, Tom Lane wrote:
> Holger Klawitter <lists@klawitter.de> writes:
> > In order to avoid interaction with gcc, cat and others else I've written a
> > new program, reading from a file.
>
> After setting up the test case and duplicating your problem, I realized
> I was being dense :-( ... this is a well-known issue. Need more
> caffeine before answering bug reports obviously ...
>
> The problem is that PG's upper() and lower() functions are based on
> the C library's <ctype.h> functions (toupper() and tolower()), which of
> course only work for single-byte character sets. So they cannot work on
> UTF8 data.
>
> There has been some talk of rewriting these functions to use the
> <wctype.h> API where available, but no one's actually stepped up to the
> plate and done it. IIRC the main sticking point was figuring out how to
> get from whatever character encoding the database is using into the wide
> character set representation the C library wants. There doesn't seem to
> be a portable way of discovering exactly what the wchar encoding is
> supposed to be for the current locale setting.
There is the "libcharset - portable character set determination.
library". But maintain this library with a lot of OS depend code is
probably nothing simple. It's used in standard iconv.
http://www.haible.de/bruno/packages-libcharset.html
But I'm not sure if it resolve something, because there is not
gaurantee of any connection between the current locale setting and
string encoding.
SELECT upper( convert('foo', 'X', 'Y') );
IMHO solution is add to "struct varlena" pointer to pg_encname that
knows handle PostgreSQL encoding information and make each PostgreSQL
string independent and self-described. Or is there something why is
this useless?
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/