Re: multibyte-character aware support for function "downcase_truncate_identifier()" - Mailing list pgsql-hackers

From Tom Lane
Subject Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date
Msg-id 11120.1290532369@sss.pgh.pa.us
Whole thread Raw
In response to Re: multibyte-character aware support for function "downcase_truncate_identifier()"  (Greg Stark <gsstark@mit.edu>)
Responses Re: multibyte-character aware support for function "downcase_truncate_identifier()"
List pgsql-hackers
Greg Stark <gsstark@mit.edu> writes:
> On Mon, Nov 22, 2010 at 12:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, that's why there's been no movement on this since 2004 :-(. �The
>> amount of work needed for a better solution seems far out of proportion
>> to the benefits.

> We could extend the existing logic to handle multi-bytes characters
> though, couldn't we? It's not going to fix all the problems but at
> least it'll do something sane.

Not easily, cheaply, or portably.  The closest you could get in that
line would be to use towlower(), which doesn't exist everywhere
(though I grant probably most platforms have it by now).  The much much
bigger problem though is that we don't know what character representation
towlower() deals in.  We recently kluged the regex code to assume that
the wchar_t representation for UTF8 locales is the standardized Unicode
code point.  I haven't heard of that breaking, but 9.0 hasn't been out
that long.  In other multibyte encodings we have no idea how to use that
function, short of invoking mbstowcs/wcstombs or local equivalent, which
is expensive and doesn't readily allow a short-circuit for ASCII.

And, after you've hacked your way through all that, you still end up
with case-folding behavior that depends on the prevailing locale.
Which is dangerous for the previously cited reasons, and arguably not
spec-compliant.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Stefan Kaltenbrunner
Date:
Subject: NLS builds on windows and lc_messages
Next
From: Andrew Dunstan
Date:
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"