Re: Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
Date
Msg-id 19394.1370792358@sss.pgh.pa.us
Whole thread Raw
In response to Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
> On 06/09/2013 12:38 AM, Noah Misch wrote:
>> PostgreSQL has lived with this wrong behavior since ... the beginning?  It's a
>> problem, certainly, but a bandage fix brings its own trouble.

I don't see this as particularly bandage-y.  It's a subset of the
spec-required folding behavior, sure, but at least now it's a proper
subset of that behavior.  It preserves all cases in which the previous
coding did the right thing, while removing some cases in which it
didn't.

> If you have a better fix I am all ears. I can recall at least one 
> discussion of this area (concerning Turkish I quite a few years ago) 
> where we failed to come up with anything.

Yeah, Turkish handling of i/I messes up any attempt to use the locale's
case-folding rules straightforwardly.  However, I think we've already
fixed that with the rule that ASCII characters are folded manually.
The resistance to moving this code to use towlower() for non-ASCII
mainly comes from worries about speed, I think; although there was also
something about downcasing conversions that change the string's byte
length being problematic for some callers.

> I have a fairly hard time believing in your "relies on this and somehow 
> works" scenario.

The key point for me is that if tolower() actually does anything in the
previous state of the code, it's more than likely going to produce
invalidly encoded data.  The consequences of that can't be good.
You can argue that there might be people out there for whom the
transformation accidentally produced a validly-encoded string, but how
likely is that really?  It seems much more likely that the only reason
we've not had more complaints is that on most popular platforms, the
code accidentally fails to fire on any UTF8 characters (or any common
ones, anyway).  On those platforms, there will be no change of behavior.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: ALTER TABLE ... ALTER CONSTRAINT
Next
From: Tom Lane
Date:
Subject: Re: small patch to crypt.c