Thread: Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

From

Robert Haas

Date:

09 June 2013, 00:09:24

On Sat, Jun 8, 2013 at 10:25 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
> Don't downcase non-ascii identifier chars in multi-byte encodings.
>
> Long-standing code has called tolower() on identifier character bytes
> with the high bit set. This is clearly an error and produces junk output
> when the encoding is multi-byte. This patch therefore restricts this
> activity to cases where there is a character with the high bit set AND
> the encoding is single-byte.
>
> There have been numerous gripes about this, most recently from Martin
> Schäfer.
>
> Backpatch to all live releases.

I'm all for changing this, but back-patching seems like a terrible
idea.  This could easily break queries that are working now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

From

Noah Misch

Date:

09 June 2013, 02:53:08

On Sat, Jun 08, 2013 at 08:09:15PM -0400, Robert Haas wrote:
> On Sat, Jun 8, 2013 at 10:25 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
> > Don't downcase non-ascii identifier chars in multi-byte encodings.
> >
> > Long-standing code has called tolower() on identifier character bytes
> > with the high bit set. This is clearly an error and produces junk output
> > when the encoding is multi-byte. This patch therefore restricts this
> > activity to cases where there is a character with the high bit set AND
> > the encoding is single-byte.
> >
> > There have been numerous gripes about this, most recently from Martin
> > Sch?fer.
> >
> > Backpatch to all live releases.
> 
> I'm all for changing this, but back-patching seems like a terrible
> idea.  This could easily break queries that are working now.

If more than one encoding covers the characters used in a given application,
that application's semantics should be the same regardless of which of those
encodings is in use.  We certainly don't _guarantee_ that today; PostgreSQL
leaves much to libc, which may not implement the relevant locales compatibly.
However, this change bakes into PostgreSQL itself a departure from that
principle.  If a database contains tables "ä" and "Ä", which of those "SELECT
* FROM Ä" finds will be encoding-dependent.  If we're going to improve the
current (granted, worse) downcase_truncate_identifier() behavior, we should
not adopt another specification bearing such surprises.

Let's return to the drawing board on this one.  I would be inclined to keep
the current bad behavior until we implement the i18n-aware case folding
required by SQL.  If I'm alone in thinking that, perhaps switch to downcasing
only ASCII characters regardless of the encoding.  That at least gives
consistent application behavior.

I apologize for not noticing to comment on this week's thread.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

From

Andrew Dunstan

Date:

09 June 2013, 03:51:26

On 06/08/2013 10:52 PM, Noah Misch wrote:
> On Sat, Jun 08, 2013 at 08:09:15PM -0400, Robert Haas wrote:
>> On Sat, Jun 8, 2013 at 10:25 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>> Don't downcase non-ascii identifier chars in multi-byte encodings.
>>>
>>> Long-standing code has called tolower() on identifier character bytes
>>> with the high bit set. This is clearly an error and produces junk output
>>> when the encoding is multi-byte. This patch therefore restricts this
>>> activity to cases where there is a character with the high bit set AND
>>> the encoding is single-byte.
>>>
>>> There have been numerous gripes about this, most recently from Martin
>>> Sch?fer.
>>>
>>> Backpatch to all live releases.
>> I'm all for changing this, but back-patching seems like a terrible
>> idea.  This could easily break queries that are working now.
> If more than one encoding covers the characters used in a given application,
> that application's semantics should be the same regardless of which of those
> encodings is in use.  We certainly don't _guarantee_ that today; PostgreSQL
> leaves much to libc, which may not implement the relevant locales compatibly.
> However, this change bakes into PostgreSQL itself a departure from that
> principle.  If a database contains tables "ä" and "Ä", which of those "SELECT
> * FROM Ä" finds will be encoding-dependent.  If we're going to improve the
> current (granted, worse) downcase_truncate_identifier() behavior, we should
> not adopt another specification bearing such surprises.
>
> Let's return to the drawing board on this one.  I would be inclined to keep
> the current bad behavior until we implement the i18n-aware case folding
> required by SQL.  If I'm alone in thinking that, perhaps switch to downcasing
> only ASCII characters regardless of the encoding.  That at least gives
> consistent application behavior.
>
> I apologize for not noticing to comment on this week's thread.
>

The behaviour which this fixes is an unambiguous bug. Calling tolower()
on the individual bytes of a multi-byte character can't possibly produce
any sort of correct result. A database that contains such corrupted
names, probably not valid in any encoding at all, is almost certainly
not restorable, and I'm not sure if it's dumpable either. It's already
produced several complaints in recent months, so ISTM that returning to
it for any period of time is unthinkable.

cheers

andrew

Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

From

Noah Misch

Date:

09 June 2013, 04:38:31

On Sat, Jun 08, 2013 at 11:50:53PM -0400, Andrew Dunstan wrote:
> On 06/08/2013 10:52 PM, Noah Misch wrote:

>> Let's return to the drawing board on this one.  I would be inclined to keep
>> the current bad behavior until we implement the i18n-aware case folding
>> required by SQL.  If I'm alone in thinking that, perhaps switch to downcasing
>> only ASCII characters regardless of the encoding.  That at least gives
>> consistent application behavior.
>>
>> I apologize for not noticing to comment on this week's thread.
>>
>
> The behaviour which this fixes is an unambiguous bug. Calling tolower()  
> on the individual bytes of a multi-byte character can't possibly produce  
> any sort of correct result. A database that contains such corrupted  
> names, probably not valid in any encoding at all, is almost certainly  
> not restorable, and I'm not sure if it's dumpable either.

I agree with each of those points.  However, since any change here breaks
compatibility, we should fix it right the first time.  A second compatibility
break would be all the more onerous once this intermediate step helps more
users to start using unquoted, non-ASCII object names.

> It's already  
> produced several complaints in recent months, so ISTM that returning to  
> it for any period of time is unthinkable.

PostgreSQL has lived with this wrong behavior since ... the beginning?  It's a
problem, certainly, but a bandage fix brings its own trouble.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

From

Andrew Dunstan

Date:

09 June 2013, 13:51:45

On 06/09/2013 12:38 AM, Noah Misch wrote:
> On Sat, Jun 08, 2013 at 11:50:53PM -0400, Andrew Dunstan wrote:
>> On 06/08/2013 10:52 PM, Noah Misch wrote:
>>> Let's return to the drawing board on this one.  I would be inclined to keep
>>> the current bad behavior until we implement the i18n-aware case folding
>>> required by SQL.  If I'm alone in thinking that, perhaps switch to downcasing
>>> only ASCII characters regardless of the encoding.  That at least gives
>>> consistent application behavior.
>>>
>>> I apologize for not noticing to comment on this week's thread.
>>>
>> The behaviour which this fixes is an unambiguous bug. Calling tolower()
>> on the individual bytes of a multi-byte character can't possibly produce
>> any sort of correct result. A database that contains such corrupted
>> names, probably not valid in any encoding at all, is almost certainly
>> not restorable, and I'm not sure if it's dumpable either.
> I agree with each of those points.  However, since any change here breaks
> compatibility, we should fix it right the first time.  A second compatibility
> break would be all the more onerous once this intermediate step helps more
> users to start using unquoted, non-ASCII object names.
>
>> It's already
>> produced several complaints in recent months, so ISTM that returning to
>> it for any period of time is unthinkable.
> PostgreSQL has lived with this wrong behavior since ... the beginning?  It's a
> problem, certainly, but a bandage fix brings its own trouble.


If you have a better fix I am all ears. I can recall at least one 
discussion of this area (concerning Turkish I quite a few years ago) 
where we failed to come up with anything.

I have a fairly hard time believing in your "relies on this and somehow 
works" scenario.

cheers

andrew

Re: Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

From

Tom Lane

Date:

09 June 2013, 15:40:08

Andrew Dunstan <andrew@dunslane.net> writes:
> On 06/09/2013 12:38 AM, Noah Misch wrote:
>> PostgreSQL has lived with this wrong behavior since ... the beginning?  It's a
>> problem, certainly, but a bandage fix brings its own trouble.

I don't see this as particularly bandage-y.  It's a subset of the
spec-required folding behavior, sure, but at least now it's a proper
subset of that behavior.  It preserves all cases in which the previous
coding did the right thing, while removing some cases in which it
didn't.

> If you have a better fix I am all ears. I can recall at least one 
> discussion of this area (concerning Turkish I quite a few years ago) 
> where we failed to come up with anything.

Yeah, Turkish handling of i/I messes up any attempt to use the locale's
case-folding rules straightforwardly.  However, I think we've already
fixed that with the rule that ASCII characters are folded manually.
The resistance to moving this code to use towlower() for non-ASCII
mainly comes from worries about speed, I think; although there was also
something about downcasing conversions that change the string's byte
length being problematic for some callers.

> I have a fairly hard time believing in your "relies on this and somehow 
> works" scenario.

The key point for me is that if tolower() actually does anything in the
previous state of the code, it's more than likely going to produce
invalidly encoded data.  The consequences of that can't be good.
You can argue that there might be people out there for whom the
transformation accidentally produced a validly-encoded string, but how
likely is that really?  It seems much more likely that the only reason
we've not had more complaints is that on most popular platforms, the
code accidentally fails to fire on any UTF8 characters (or any common
ones, anyway).  On those platforms, there will be no change of behavior.
        regards, tom lane

Re: Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

From

Noah Misch

Date:

11 June 2013, 03:15:21

On Sun, Jun 09, 2013 at 11:39:18AM -0400, Tom Lane wrote:
> The key point for me is that if tolower() actually does anything in the
> previous state of the code, it's more than likely going to produce
> invalidly encoded data.  The consequences of that can't be good.
> You can argue that there might be people out there for whom the
> transformation accidentally produced a validly-encoded string, but how
> likely is that really?  It seems much more likely that the only reason
> we've not had more complaints is that on most popular platforms, the
> code accidentally fails to fire on any UTF8 characters (or any common
> ones, anyway).  On those platforms, there will be no change of behavior.

Your hypothesis is that almost all libc tolower() implementations will in
every case either (a) turn a multi-byte character to byte soup not valid in
the server encoding or (b) leave it unchanged?  Quite possible.  If that
hypothesis holds, I agree that the committed change does not break
compatibility.  That carries a certain appeal.

I still anticipate regretting that we have approved and made reliable this
often-sufficed-by-accident behavior, particularly when the SQL standard calls
for something else.  But I think I now understand your reasoning.

> The resistance to moving this code to use towlower() for non-ASCII
> mainly comes from worries about speed, I think; although there was also
> something about downcasing conversions that change the string's byte
> length being problematic for some callers.

Considering that using ASCII-only or quoted identifiers sidesteps the speed
penalty altogether, that seems a poor cause for demur.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com