Thread: Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
From
Robert Haas
Date:
On Sat, Jun 8, 2013 at 10:25 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > Don't downcase non-ascii identifier chars in multi-byte encodings. > > Long-standing code has called tolower() on identifier character bytes > with the high bit set. This is clearly an error and produces junk output > when the encoding is multi-byte. This patch therefore restricts this > activity to cases where there is a character with the high bit set AND > the encoding is single-byte. > > There have been numerous gripes about this, most recently from Martin > Schäfer. > > Backpatch to all live releases. I'm all for changing this, but back-patching seems like a terrible idea. This could easily break queries that are working now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
From
Noah Misch
Date:
On Sat, Jun 08, 2013 at 08:09:15PM -0400, Robert Haas wrote: > On Sat, Jun 8, 2013 at 10:25 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > Don't downcase non-ascii identifier chars in multi-byte encodings. > > > > Long-standing code has called tolower() on identifier character bytes > > with the high bit set. This is clearly an error and produces junk output > > when the encoding is multi-byte. This patch therefore restricts this > > activity to cases where there is a character with the high bit set AND > > the encoding is single-byte. > > > > There have been numerous gripes about this, most recently from Martin > > Sch?fer. > > > > Backpatch to all live releases. > > I'm all for changing this, but back-patching seems like a terrible > idea. This could easily break queries that are working now. If more than one encoding covers the characters used in a given application, that application's semantics should be the same regardless of which of those encodings is in use. We certainly don't _guarantee_ that today; PostgreSQL leaves much to libc, which may not implement the relevant locales compatibly. However, this change bakes into PostgreSQL itself a departure from that principle. If a database contains tables "ä" and "Ä", which of those "SELECT * FROM Ä" finds will be encoding-dependent. If we're going to improve the current (granted, worse) downcase_truncate_identifier() behavior, we should not adopt another specification bearing such surprises. Let's return to the drawing board on this one. I would be inclined to keep the current bad behavior until we implement the i18n-aware case folding required by SQL. If I'm alone in thinking that, perhaps switch to downcasing only ASCII characters regardless of the encoding. That at least gives consistent application behavior. I apologize for not noticing to comment on this week's thread. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
From
Andrew Dunstan
Date:
On 06/08/2013 10:52 PM, Noah Misch wrote: > On Sat, Jun 08, 2013 at 08:09:15PM -0400, Robert Haas wrote: >> On Sat, Jun 8, 2013 at 10:25 AM, Andrew Dunstan <andrew@dunslane.net> wrote: >>> Don't downcase non-ascii identifier chars in multi-byte encodings. >>> >>> Long-standing code has called tolower() on identifier character bytes >>> with the high bit set. This is clearly an error and produces junk output >>> when the encoding is multi-byte. This patch therefore restricts this >>> activity to cases where there is a character with the high bit set AND >>> the encoding is single-byte. >>> >>> There have been numerous gripes about this, most recently from Martin >>> Sch?fer. >>> >>> Backpatch to all live releases. >> I'm all for changing this, but back-patching seems like a terrible >> idea. This could easily break queries that are working now. > If more than one encoding covers the characters used in a given application, > that application's semantics should be the same regardless of which of those > encodings is in use. We certainly don't _guarantee_ that today; PostgreSQL > leaves much to libc, which may not implement the relevant locales compatibly. > However, this change bakes into PostgreSQL itself a departure from that > principle. If a database contains tables "ä" and "Ä", which of those "SELECT > * FROM Ä" finds will be encoding-dependent. If we're going to improve the > current (granted, worse) downcase_truncate_identifier() behavior, we should > not adopt another specification bearing such surprises. > > Let's return to the drawing board on this one. I would be inclined to keep > the current bad behavior until we implement the i18n-aware case folding > required by SQL. If I'm alone in thinking that, perhaps switch to downcasing > only ASCII characters regardless of the encoding. That at least gives > consistent application behavior. > > I apologize for not noticing to comment on this week's thread. > The behaviour which this fixes is an unambiguous bug. Calling tolower() on the individual bytes of a multi-byte character can't possibly produce any sort of correct result. A database that contains such corrupted names, probably not valid in any encoding at all, is almost certainly not restorable, and I'm not sure if it's dumpable either. It's already produced several complaints in recent months, so ISTM that returning to it for any period of time is unthinkable. cheers andrew
Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
From
Noah Misch
Date:
On Sat, Jun 08, 2013 at 11:50:53PM -0400, Andrew Dunstan wrote: > On 06/08/2013 10:52 PM, Noah Misch wrote: >> Let's return to the drawing board on this one. I would be inclined to keep >> the current bad behavior until we implement the i18n-aware case folding >> required by SQL. If I'm alone in thinking that, perhaps switch to downcasing >> only ASCII characters regardless of the encoding. That at least gives >> consistent application behavior. >> >> I apologize for not noticing to comment on this week's thread. >> > > The behaviour which this fixes is an unambiguous bug. Calling tolower() > on the individual bytes of a multi-byte character can't possibly produce > any sort of correct result. A database that contains such corrupted > names, probably not valid in any encoding at all, is almost certainly > not restorable, and I'm not sure if it's dumpable either. I agree with each of those points. However, since any change here breaks compatibility, we should fix it right the first time. A second compatibility break would be all the more onerous once this intermediate step helps more users to start using unquoted, non-ASCII object names. > It's already > produced several complaints in recent months, so ISTM that returning to > it for any period of time is unthinkable. PostgreSQL has lived with this wrong behavior since ... the beginning? It's a problem, certainly, but a bandage fix brings its own trouble. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
From
Andrew Dunstan
Date:
On 06/09/2013 12:38 AM, Noah Misch wrote: > On Sat, Jun 08, 2013 at 11:50:53PM -0400, Andrew Dunstan wrote: >> On 06/08/2013 10:52 PM, Noah Misch wrote: >>> Let's return to the drawing board on this one. I would be inclined to keep >>> the current bad behavior until we implement the i18n-aware case folding >>> required by SQL. If I'm alone in thinking that, perhaps switch to downcasing >>> only ASCII characters regardless of the encoding. That at least gives >>> consistent application behavior. >>> >>> I apologize for not noticing to comment on this week's thread. >>> >> The behaviour which this fixes is an unambiguous bug. Calling tolower() >> on the individual bytes of a multi-byte character can't possibly produce >> any sort of correct result. A database that contains such corrupted >> names, probably not valid in any encoding at all, is almost certainly >> not restorable, and I'm not sure if it's dumpable either. > I agree with each of those points. However, since any change here breaks > compatibility, we should fix it right the first time. A second compatibility > break would be all the more onerous once this intermediate step helps more > users to start using unquoted, non-ASCII object names. > >> It's already >> produced several complaints in recent months, so ISTM that returning to >> it for any period of time is unthinkable. > PostgreSQL has lived with this wrong behavior since ... the beginning? It's a > problem, certainly, but a bandage fix brings its own trouble. If you have a better fix I am all ears. I can recall at least one discussion of this area (concerning Turkish I quite a few years ago) where we failed to come up with anything. I have a fairly hard time believing in your "relies on this and somehow works" scenario. cheers andrew
Re: Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes: > On 06/09/2013 12:38 AM, Noah Misch wrote: >> PostgreSQL has lived with this wrong behavior since ... the beginning? It's a >> problem, certainly, but a bandage fix brings its own trouble. I don't see this as particularly bandage-y. It's a subset of the spec-required folding behavior, sure, but at least now it's a proper subset of that behavior. It preserves all cases in which the previous coding did the right thing, while removing some cases in which it didn't. > If you have a better fix I am all ears. I can recall at least one > discussion of this area (concerning Turkish I quite a few years ago) > where we failed to come up with anything. Yeah, Turkish handling of i/I messes up any attempt to use the locale's case-folding rules straightforwardly. However, I think we've already fixed that with the rule that ASCII characters are folded manually. The resistance to moving this code to use towlower() for non-ASCII mainly comes from worries about speed, I think; although there was also something about downcasing conversions that change the string's byte length being problematic for some callers. > I have a fairly hard time believing in your "relies on this and somehow > works" scenario. The key point for me is that if tolower() actually does anything in the previous state of the code, it's more than likely going to produce invalidly encoded data. The consequences of that can't be good. You can argue that there might be people out there for whom the transformation accidentally produced a validly-encoded string, but how likely is that really? It seems much more likely that the only reason we've not had more complaints is that on most popular platforms, the code accidentally fails to fire on any UTF8 characters (or any common ones, anyway). On those platforms, there will be no change of behavior. regards, tom lane
Re: Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
From
Noah Misch
Date:
On Sun, Jun 09, 2013 at 11:39:18AM -0400, Tom Lane wrote: > The key point for me is that if tolower() actually does anything in the > previous state of the code, it's more than likely going to produce > invalidly encoded data. The consequences of that can't be good. > You can argue that there might be people out there for whom the > transformation accidentally produced a validly-encoded string, but how > likely is that really? It seems much more likely that the only reason > we've not had more complaints is that on most popular platforms, the > code accidentally fails to fire on any UTF8 characters (or any common > ones, anyway). On those platforms, there will be no change of behavior. Your hypothesis is that almost all libc tolower() implementations will in every case either (a) turn a multi-byte character to byte soup not valid in the server encoding or (b) leave it unchanged? Quite possible. If that hypothesis holds, I agree that the committed change does not break compatibility. That carries a certain appeal. I still anticipate regretting that we have approved and made reliable this often-sufficed-by-accident behavior, particularly when the SQL standard calls for something else. But I think I now understand your reasoning. > The resistance to moving this code to use towlower() for non-ASCII > mainly comes from worries about speed, I think; although there was also > something about downcasing conversions that change the string's byte > length being problematic for some callers. Considering that using ASCII-only or quoted identifiers sidesteps the speed penalty altogether, that seems a poor cause for demur. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com