Re: tolower() identifier downcasing versus multibyte encodings - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: tolower() identifier downcasing versus multibyte encodings
Date
Msg-id 201109060218.p862IOZ23903@momjian.us
Whole thread Raw
In response to tolower() identifier downcasing versus multibyte encodings  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Did we ever address this?

---------------------------------------------------------------------------

Tom Lane wrote:
> I've been able to reproduce the behavior described here:
> http://archives.postgresql.org/pgsql-general/2011-03/msg00538.php
> It's specific to UTF8 locales on Mac OS X.  I'm not sure if the
> problem can manifest anywhere else; considering that OS X's UTF8
> locales have a general reputation of being broken, it may only
> happen on that platform.
> 
> What is happening is that downcase_truncate_identifier() tries to
> downcase identifiers like this:
> 
>         unsigned char ch = (unsigned char) ident[i];
> 
>         if (ch >= 'A' && ch <= 'Z')
>             ch += 'a' - 'A';
>         else if (IS_HIGHBIT_SET(ch) && isupper(ch))
>             ch = tolower(ch);
>         result[i] = (char) ch;
> 
> This is of course incapable of successfully downcasing any multibyte
> characters, but there's an assumption that isupper() won't return TRUE
> for a character fragment in a multibyte locale.  However, on OS X
> it seems that that's not the case :-(.  For the particular example
> cited by Francisco Figueiredo, I see the byte sequence \303\251
> converted to \343\251, because isupper() returns TRUE for \303 and
> then tolower() returns \343.  The byte \251 is not changed, but the
> damage is already done: we now have an invalidly-encoded string.
> 
> It looks like the blame for the subsequent "disappearance" of the bogus
> data lies with fprintf back on the client side; that surprises me a bit
> because I'd only heard of glibc being so cavalier with data it thought
> was invalidly encoded.  But anyway, the origin of the problem is in the
> downcasing transformation.
> 
> We could possibly fix this by not attempting the downcasing
> transformation on high-bit-set characters unless the encoding is
> single-byte.  However, we have the exact same downcasing logic embedded
> in the functions in src/port/pgstrcasecmp.c, and those don't have any
> convenient way of knowing what the prevailing encoding is --- when
> compiled for frontend use, they can't use pg_database_encoding_max_length.
> 
> Or we could bite the bullet and start using str_tolower(), but the
> performance implications of that are unpleasant; not to mention that
> we really don't want to re-introduce the "Turkish problem" with
> unexpected handling of i/I in identifiers.
> 
> Or we could go the other way and stop downcasing non-ASCII letters
> altogether.
> 
> None of these options seem terribly attractive.  Thoughts?
> 
>             regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: pg_ctl restart - behaviour based on wrong instance
Next
From: daveg
Date:
Subject: Re: [GENERAL] pg_upgrade problem