Thread: lower and upper not UTF-8 safe
The implementations of lower and upper in src/backend/utils/adt/oracle_compat.c use the single byte macros from ctype.h to alter individual bytes in the text string. If the text is UTF-8 encoded this is totally wrong, and will result in an invalid string that is no longer UTF-8. The code is basically unchanged in both 7.3.4 and CVS tip. I can see two options - remove access to these functions if the database is running UNICODE, or rewrite/extend them so the correct thing happens. The easiest way to do this is probably to convert the UTF-8 to a fixed width encoding (say UCS-4), perform the lower operation to get a new set of character indices, then convert back to UTF-8. The byte length of the output might even be different from the input, although I don't know of an example where this happens. At the very least, the documentation for lower and upper in the manual should warn the user not to use them in a UNICODE database. -- Julian Satchell <j.satchell@eris.qinetiq.com> QinetiQ
Julian Satchell <j.satchell@eris.qinetiq.com> writes: > The implementations of lower and upper in > src/backend/utils/adt/oracle_compat.c use the single byte macros from > ctype.h to alter individual bytes in the text string. > If the text is UTF-8 encoded this is totally wrong, and will result in > an invalid string that is no longer UTF-8. Only if you use a locale that is assuming a character set that is not UTF8 but does have characters with the high bit set. I'm not sure that we can do anything to defend against locale/charset mismatch. regards, tom lane
On Mon, Aug 04, 2003 at 05:03:02PM -0400, Tom Lane wrote: > Julian Satchell <j.satchell@eris.qinetiq.com> writes: > > The implementations of lower and upper in > > src/backend/utils/adt/oracle_compat.c use the single byte macros from > > ctype.h to alter individual bytes in the text string. > > > If the text is UTF-8 encoded this is totally wrong, and will result in > > an invalid string that is no longer UTF-8. > > Only if you use a locale that is assuming a character set that is not > UTF8 but does have characters with the high bit set. I'm not sure that > we can do anything to defend against locale/charset mismatch. We can try detect typical locale charset and compare it with actualcharset used in DB and send NOTICE to FE if it's mismatched.The problem is portability of charset detection code, because there is differences between OS. The best it's iflibc support nl_langinfo(CODESET) call.The complete code of charset detection you can found in libcharset orglib (I usesimplification of these codes and it's 300 lines:-). Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
Karel Zak <zakkr@zf.jcu.cz> writes: > On Mon, Aug 04, 2003 at 05:03:02PM -0400, Tom Lane wrote: >> Only if you use a locale that is assuming a character set that is not >> UTF8 but does have characters with the high bit set. I'm not sure that >> we can do anything to defend against locale/charset mismatch. > We can try detect typical locale charset and compare it with actual > charset used in DB and send NOTICE to FE if it's mismatched. The problem > is portability of charset detection code, because there is differences > between OS. Yeah. If we had a portable, reliable way of testing for incompatibility, I'd be in favor of just forbidding creation of databases that have encoding choices incompatible with the server's LC_COLLATE/LC_CTYPE settings. (If we ever allow those settings to be more dynamic than they are, then the test would have to be made somewhere else, but for now it'd be sufficient to put it in CREATE DATABASE.) But I don't see a portable way to find out what charset a locale supports. nl_langinfo() isn't in the C standard at all. regards, tom lane