Thread: Multi-byte character case-folding

Multi-byte character case-folding

From
Thom Brown
Date:
Hi,

At the moment, only single-byte characters in identifiers are
case-folded, and multi-byte characters are not.

For example, abĉDĚF is case-folded to "abĉdĚf".  This can be referred
to as "abĉdĚf" or "ABĉDĚF", but not "abĉděf" or "ABĈDĚF".

downcase_identifier() has the following comment:

        /*
         * SQL99 specifies Unicode-aware case normalization, which we don't yet
         * have the infrastructure for.  Instead we use tolower() to provide a
         * locale-aware translation.  However, there are some locales where this
         * is not right either (eg, Turkish may do strange things with 'i' and
         * 'I').  Our current compromise is to use tolower() for characters with
         * the high bit set, as long as they aren't part of a multi-byte
         * character, and use an ASCII-only downcasing for 7-bit characters.
         */

So my question is, do we yet have the infrastructure to make
case-folding consistent across all character widths?

Thanks

Thom



Re: Multi-byte character case-folding

From
Tom Lane
Date:
Thom Brown <thom@linux.com> writes:
> At the moment, only single-byte characters in identifiers are
> case-folded, and multi-byte characters are not.
> ...
> So my question is, do we yet have the infrastructure to make
> case-folding consistent across all character widths?

We still lack any built-in knowledge about this, and would have to rely
on libc, which means the results would likely be platform-dependent
and probably LC_CTYPE-dependent.

More generally, I'd be mighty hesitant to change this behavior after
it's stood for so many years.  I suspect more people would complain
that we broke their application than would be happy about it.

Having said that, we are already relying on towlower() in places,
and could do similarly here if we didn't care about the above issues.

            regards, tom lane



Re: Multi-byte character case-folding

From
Alvaro Herrera
Date:
On 2020-Jul-06, Tom Lane wrote:

> More generally, I'd be mighty hesitant to change this behavior after
> it's stood for so many years.  I suspect more people would complain
> that we broke their application than would be happy about it.
> 
> Having said that, we are already relying on towlower() in places,
> and could do similarly here if we didn't care about the above issues.

I think the fact that identifiers fail to follow language-specific case
folding rules is more a known gotcha than a desired property, but on
principle I tend to agree that Turkish people would not be happy about
the prospect of us changing the downcasing rule in a major release -- it
would mean having to edit any affected application code as part of a
pg_upgrade process, which is not great.

Now you could say that this can be fixed by adding a GUC that preserves
the old behavior, but generally we don't like that too much.

The counter argument is that there are more future users than there are
current users.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Multi-byte character case-folding

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-Jul-06, Tom Lane wrote:
>> More generally, I'd be mighty hesitant to change this behavior after
>> it's stood for so many years.  I suspect more people would complain
>> that we broke their application than would be happy about it.

> I think the fact that identifiers fail to follow language-specific case
> folding rules is more a known gotcha than a desired property, but on
> principle I tend to agree that Turkish people would not be happy about
> the prospect of us changing the downcasing rule in a major release -- it
> would mean having to edit any affected application code as part of a
> pg_upgrade process, which is not great.

It's not just the Turks.  As near as I can tell, we'd likely break *every*
app that's using such identifiers.  For example, supposing I do

test=# create table MYÉCLASS (f1 text);
CREATE TABLE
test=# \dt
          List of relations
 Schema |   Name   | Type  |  Owner   
--------+----------+-------+----------
 public | myÉclass | table | postgres
(1 row)

pg_dump will render this as

CREATE TABLE public."myÉclass" (
    f1 text
);

If we start to case-fold É, then the only way to access this table will
be by double-quoting its name, which the application probably is not
expecting (else it would have double-quoted in the original CREATE TABLE).

> Now you could say that this can be fixed by adding a GUC that preserves
> the old behavior, but generally we don't like that too much.

Yes, a GUC changing this would be a headache.  It would be just as much of
a compatibility and security hazard as standard_conforming_strings (which
indeed I've been thinking of proposing that we get rid of; it's hung
around long enough).

> The counter argument is that there are more future users than there are
> current users.

Especially if we drive away the current users :-(.  In practice, we've
heard very very few complaints about this, so my gut says to leave
it alone.

            regards, tom lane



Re: Multi-byte character case-folding

From
"Daniel Verite"
Date:
    Tom Lane wrote:

> CREATE TABLE public."myÉclass" (
>    f1 text
> );
>
> If we start to case-fold É, then the only way to access this table will
> be by double-quoting its name, which the application probably is not
> expecting (else it would have double-quoted in the original CREATE TABLE).

This problem already exists when migrating from a mono-byte database
to a multi-byte database, since downcase_identifier()  does use
tolower() for mono-byte databases.

db9=# show server_encoding ;
 server_encoding
-----------------
 LATIN9
(1 row)

db9=# create table MYÉCLASS (f1 text);
CREATE TABLE

db9=# \d
      List of relations
 Schema |   Name   | Type  |  Owner
--------+----------+-------+----------
 public | myéclass | table | postgres
(1 row)

db9=# select * from MYÉCLASS;
 f1
----
(0 rows)

pg_dump will dump this as

CREATE TABLE public."myéclass" (
    f1 text
);

So far so good. But after importing this into an UTF-8 database,
the same "select * from MYÉCLASS" that used to work now fails:

u8=# show server_encoding ;
 server_encoding
-----------------
 UTF8
(1 row)

u8=# select * from MYÉCLASS;
ERROR:    relation "myÉclass" does not exist


The compromise that is mentioned in downcase_identifier() justifying
this inconsistency is not very convincing, because the issues in case
folding due to linguistic differences exist both in mono-byte and
multi-byte encodings. For instance, if it's fine to trust the locale
to downcase 'İ' in a LATIN5 db, it should be okay in a UTF-8 db too.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: Multi-byte character case-folding

From
Tom Lane
Date:
"Daniel Verite" <daniel@manitou-mail.org> writes:
>     Tom Lane wrote:
>> If we start to case-fold É, then the only way to access this table will
>> be by double-quoting its name, which the application probably is not
>> expecting (else it would have double-quoted in the original CREATE TABLE).

> This problem already exists when migrating from a mono-byte database
> to a multi-byte database, since downcase_identifier()  does use
> tolower() for mono-byte databases.

Sure, but that's a tiny minority of use-cases.  In particular it would
not bite you after a straight upgrade to a new PG version.

[ thinks... ]  Wait, actually the described case would occur if you
migrated *from* UTF8 (no folding) to LATINn (with folding).  That's
gotta be an even tinier minority.  Migration to UTF8 would show
different, though perhaps just as annoying, symptoms.

Anyway, I freely concede that I'm ill-equipped to judge how annoying
this is, since I don't program in any languages where it'd make a
difference.  But we mustn't fool ourselves: changing this would be
just as dangerous as the standard_conforming_strings changeover was.
I'm not really convinced it's worth it.  In particular, I don't find
the "it's required by the standard" argument convincing.  The standard
requires us to fold to upper case, too, but we've long ago decided to
just say no to that.  (Which reminds me: there are extensive threads in
the archives analyzing whether it's practical to support more than one
folding behavior.  Those discussions would likely be relevant here.)

            regards, tom lane



Re: Multi-byte character case-folding

From
Robert Haas
Date:
On Mon, Jul 6, 2020 at 8:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> test=# create table MYÉCLASS (f1 text);
> CREATE TABLE
> test=# \dt
>           List of relations
>  Schema |   Name   | Type  |  Owner
> --------+----------+-------+----------
>  public | myÉclass | table | postgres
> (1 row)
>
> pg_dump will render this as
>
> CREATE TABLE public."myÉclass" (
>     f1 text
> );
>
> If we start to case-fold É, then the only way to access this table will
> be by double-quoting its name, which the application probably is not
> expecting (else it would have double-quoted in the original CREATE TABLE).

While this is true, it's also pretty hard to imagine a user being
satisfied with a table that ends up with this kind of mixed-case name.

That's not to say that I have any good idea what to do about this. I
just disagree with labelling the above case as a success.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multi-byte character case-folding

From
Alvaro Herrera
Date:
On 2020-Jul-08, Robert Haas wrote:

> That's not to say that I have any good idea what to do about this. I
> just disagree with labelling the above case as a success.

Yeah, particularly since it works differently in single-char encodings.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Multi-byte character case-folding

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> That's not to say that I have any good idea what to do about this. I
> just disagree with labelling the above case as a success.

I can't say that I like it either.  But I'm afraid that changing it now
will create many more problems than it solves.

            regards, tom lane



Re: Multi-byte character case-folding

From
Bruce Momjian
Date:
On Mon, Jul  6, 2020 at 08:32:22PM -0400, Tom Lane wrote:
> Yes, a GUC changing this would be a headache.  It would be just as much of
> a compatibility and security hazard as standard_conforming_strings (which
> indeed I've been thinking of proposing that we get rid of; it's hung
> around long enough).

+1

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee