On 06/08/2013 10:52 PM, Noah Misch wrote:
> On Sat, Jun 08, 2013 at 08:09:15PM -0400, Robert Haas wrote:
>> On Sat, Jun 8, 2013 at 10:25 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>> Don't downcase non-ascii identifier chars in multi-byte encodings.
>>>
>>> Long-standing code has called tolower() on identifier character bytes
>>> with the high bit set. This is clearly an error and produces junk output
>>> when the encoding is multi-byte. This patch therefore restricts this
>>> activity to cases where there is a character with the high bit set AND
>>> the encoding is single-byte.
>>>
>>> There have been numerous gripes about this, most recently from Martin
>>> Sch?fer.
>>>
>>> Backpatch to all live releases.
>> I'm all for changing this, but back-patching seems like a terrible
>> idea. This could easily break queries that are working now.
> If more than one encoding covers the characters used in a given application,
> that application's semantics should be the same regardless of which of those
> encodings is in use. We certainly don't _guarantee_ that today; PostgreSQL
> leaves much to libc, which may not implement the relevant locales compatibly.
> However, this change bakes into PostgreSQL itself a departure from that
> principle. If a database contains tables "ä" and "Ä", which of those "SELECT
> * FROM Ä" finds will be encoding-dependent. If we're going to improve the
> current (granted, worse) downcase_truncate_identifier() behavior, we should
> not adopt another specification bearing such surprises.
>
> Let's return to the drawing board on this one. I would be inclined to keep
> the current bad behavior until we implement the i18n-aware case folding
> required by SQL. If I'm alone in thinking that, perhaps switch to downcasing
> only ASCII characters regardless of the encoding. That at least gives
> consistent application behavior.
>
> I apologize for not noticing to comment on this week's thread.
>
The behaviour which this fixes is an unambiguous bug. Calling tolower()
on the individual bytes of a multi-byte character can't possibly produce
any sort of correct result. A database that contains such corrupted
names, probably not valid in any encoding at all, is almost certainly
not restorable, and I'm not sure if it's dumpable either. It's already
produced several complaints in recent months, so ISTM that returning to
it for any period of time is unthinkable.
cheers
andrew