Thread: lower and upper not UTF-8 safe

lower and upper not UTF-8 safe

From
Julian Satchell
Date:
The implementations of lower and upper in
src/backend/utils/adt/oracle_compat.c use the single byte macros from
ctype.h to alter individual bytes in the text string. 

If the text is UTF-8 encoded this is totally wrong, and will result in
an invalid string that is no longer UTF-8.

The code is basically unchanged in both 7.3.4 and CVS tip.

I can see two options - remove access to these functions if the database
is running UNICODE, or rewrite/extend them so the correct thing happens.
The easiest way to do this is probably to convert the UTF-8 to a fixed
width encoding (say UCS-4),  perform the lower operation to get a new
set of character indices, then convert back to UTF-8. The byte length of
the output might even be different from the input, although I don't know
of an example where this happens. 

At the very least, the documentation for lower and upper in the manual
should warn the user not to use them in a UNICODE database.

-- 
Julian Satchell <j.satchell@eris.qinetiq.com>
QinetiQ



Re: lower and upper not UTF-8 safe

From
Tom Lane
Date:
Julian Satchell <j.satchell@eris.qinetiq.com> writes:
> The implementations of lower and upper in
> src/backend/utils/adt/oracle_compat.c use the single byte macros from
> ctype.h to alter individual bytes in the text string. 

> If the text is UTF-8 encoded this is totally wrong, and will result in
> an invalid string that is no longer UTF-8.

Only if you use a locale that is assuming a character set that is not
UTF8 but does have characters with the high bit set.  I'm not sure that
we can do anything to defend against locale/charset mismatch.
        regards, tom lane


Re: lower and upper not UTF-8 safe

From
Karel Zak
Date:
On Mon, Aug 04, 2003 at 05:03:02PM -0400, Tom Lane wrote:
> Julian Satchell <j.satchell@eris.qinetiq.com> writes:
> > The implementations of lower and upper in
> > src/backend/utils/adt/oracle_compat.c use the single byte macros from
> > ctype.h to alter individual bytes in the text string. 
> 
> > If the text is UTF-8 encoded this is totally wrong, and will result in
> > an invalid string that is no longer UTF-8.
> 
> Only if you use a locale that is assuming a character set that is not
> UTF8 but does have characters with the high bit set.  I'm not sure that
> we can do anything to defend against locale/charset mismatch.
We can try detect typical locale charset and compare it with actualcharset used in DB and send NOTICE to FE if it's
mismatched.The problem is portability of charset detection code, because there is differences between OS. The best it's
iflibc support nl_langinfo(CODESET) call.The complete code of charset detection you can found in libcharset orglib (I
usesimplification of these codes and it's 300 lines:-).
 
   Karel


-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: lower and upper not UTF-8 safe

From
Tom Lane
Date:
Karel Zak <zakkr@zf.jcu.cz> writes:
> On Mon, Aug 04, 2003 at 05:03:02PM -0400, Tom Lane wrote:
>> Only if you use a locale that is assuming a character set that is not
>> UTF8 but does have characters with the high bit set.  I'm not sure that
>> we can do anything to defend against locale/charset mismatch.

>  We can try detect typical locale charset and compare it with actual
>  charset used in DB and send NOTICE to FE if it's mismatched. The problem 
>  is portability of charset detection code, because there is differences 
>  between OS.

Yeah.  If we had a portable, reliable way of testing for incompatibility,
I'd be in favor of just forbidding creation of databases that have
encoding choices incompatible with the server's LC_COLLATE/LC_CTYPE
settings.  (If we ever allow those settings to be more dynamic than they
are, then the test would have to be made somewhere else, but for now it'd
be sufficient to put it in CREATE DATABASE.)

But I don't see a portable way to find out what charset a locale
supports.  nl_langinfo() isn't in the C standard at all.
        regards, tom lane