Home > mailing lists

lower and upper not UTF-8 safe - Mailing list pgsql-hackers

From	Julian Satchell
Subject	lower and upper not UTF-8 safe
Date	August 4, 2003 17:50:03
Msg-id	1060004637.28875.3215.camel@jsatchell.eris.qinetiq.com Whole thread Raw
Responses	Re: lower and upper not UTF-8 safe
List	pgsql-hackers

Tree view

The implementations of lower and upper in
src/backend/utils/adt/oracle_compat.c use the single byte macros from
ctype.h to alter individual bytes in the text string. 

If the text is UTF-8 encoded this is totally wrong, and will result in
an invalid string that is no longer UTF-8.

The code is basically unchanged in both 7.3.4 and CVS tip.

I can see two options - remove access to these functions if the database
is running UNICODE, or rewrite/extend them so the correct thing happens.
The easiest way to do this is probably to convert the UTF-8 to a fixed
width encoding (say UCS-4),  perform the lower operation to get a new
set of character indices, then convert back to UTF-8. The byte length of
the output might even be different from the input, although I don't know
of an example where this happens. 

At the very least, the documentation for lower and upper in the manual
should warn the user not to use them in a UNICODE database.

-- 
Julian Satchell <j.satchell@eris.qinetiq.com>
QinetiQ

pgsql-hackers by date:

From: ivan
Date: 04 August 2003, 17:27:13
Subject: Re: problem with cache

From: Tom Lane
Date: 04 August 2003, 18:03:44
Subject: Re: lower and upper not UTF-8 safe

lower and upper not UTF-8 safe - Mailing list pgsql-hackers

Previous

Next