Thread: Corruption of multibyte identifiers on UTF-8 locale

Corruption of multibyte identifiers on UTF-8 locale

From
Victor Snezhko
Date:
Hello,=20

Looks like we have more serious problem with multibyte identifiers.
When I run the following sequence of queries:

CREATE OR REPLACE FUNCTION CreateOrAlterTable()
RETURNS int
AS $$
BEGIN
  if not EXISTS(SELECT relname FROM pg_class WHERE relname ILIKE '=D41' AND=
 relkind =3D 'r') then
    CREATE TABLE =D41 (
           =CB1 int NOT NULL,
           PRIMARY KEY (=CB1)
    );
  end if;
  return 0;
END;
$$ LANGUAGE plpgsql;

SELECT CreateOrAlterTable();

CREATE OR REPLACE FUNCTION CreateOrAlterTable()
RETURNS int
AS $$
BEGIN
  if not EXISTS(SELECT relname FROM pg_class WHERE relname ILIKE '=D42' AND=
 relkind =3D 'r') then
    CREATE TABLE =D42 (
           =CB2 int NOT NULL,
           PRIMARY KEY (=CB2)
    );
  end if;
  return 0;
END;
$$ LANGUAGE plpgsql;

and then try to create the second table:

  SELECT CreateOrAlterTable();

, this gives me the following error (on HEAD as well as patched 8.1.4):

ERROR:  invalid byte sequence for encoding "UTF8": 0xf18231
HINT:  This error can also happen if the byte sequence does not match the e=
ncoding expected by the server, which is controlled by "client_encoding".
CONTEXT:  SQL statement "SELECT not EXISTS(SELECT relname FROM pg_class WHE=
RE relname ILIKE '?1' AND relkind =3D 'r')"
PL/pgSQL function "createoraltertable" line 2 at if

correct utf-8 byte sequence is 0xd18231, so it looks like we call
tolower() somewhere on parts of multibyte characters, and it does the
same as isspace() - it interprets it's argument as wide character, and
converts it.

simple create tables work, as well as create tables which are called
inside a procedure without "IF EXISTS" check.

So, we either don't support utf-8 on BSDs (BTW, this needs to be
checked on less popular BSD flavors) for now, or we need to fix this
somehow. E.g., by calling only wide-character checks, which will
complicate things...

--=20
WBR, Victor V. Snezhko
E-mail: snezhko@indorsoft.ru

Re: Corruption of multibyte identifiers on UTF-8 locale

From
Victor Snezhko
Date:
Victor Snezhko <snezhko@indorsoft.ru> writes:

> So, we either don't support utf-8 on BSDs

Hmm, tolower'ing octets of a multibyte string is a bug not only on
BSDs but on other architectures as well. But on BSDs it additionally
causes corruption of utf-8 data.

--
WBR, Victor V. Snezhko
E-mail: snezhko@indorsoft.ru

Re: Corruption of multibyte identifiers on UTF-8 locale

From
Tom Lane
Date:
Victor Snezhko <snezhko@indorsoft.ru> writes:
> correct utf-8 byte sequence is 0xd18231, so it looks like we call
> tolower() somewhere on parts of multibyte characters, and it does the
> same as isspace() - it interprets it's argument as wide character, and
> converts it.

Indeed, and I am certainly wondering why we should not just say that
you've got a broken locale definition there.  There is absolutely no
doubt that the ctype.h functions are defined to work on char, not wchar.
They have no business mangling high-bit-set bytes in a multibyte
encoding.

            regards, tom lane

Re: Corruption of multibyte identifiers on UTF-8 locale

From
Victor Snezhko
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

>> correct utf-8 byte sequence is 0xd18231, so it looks like we call
>> tolower() somewhere on parts of multibyte characters, and it does the
>> same as isspace() - it interprets it's argument as wide character, and
>> converts it.
>
> Indeed, and I am certainly wondering why we should not just say that
> you've got a broken locale definition there.  There is absolutely no
> doubt that the ctype.h functions are defined to work on char, not
> wchar.

Agreed, but such corruption indicates that there is non-multibyte-safe
(octet-wise) case conversion somewhere, at best (with fully working
locale) it will cause case conversion to do nothing instead of actual
conversion.

> They have no business mangling high-bit-set bytes in a multibyte
> encoding.

--
WBR, Victor V. Snezhko
E-mail: snezhko@indorsoft.ru

Re: Corruption of multibyte identifiers on UTF-8 locale

From
Tom Lane
Date:
Victor Snezhko <snezhko@indorsoft.ru> writes:
> Agreed, but such corruption indicates that there is non-multibyte-safe
> (octet-wise) case conversion somewhere, at best (with fully working
> locale) it will cause case conversion to do nothing instead of actual
> conversion.

Yours is the first installation I've heard of that fails to get this
right, and the code in question (downcase_truncate_identifier) has
been like that since PG 7.4.something ...

            regards, tom lane

Re: Corruption of multibyte identifiers on UTF-8 locale

From
Victor Snezhko
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

>> Agreed, but such corruption indicates that there is non-multibyte-safe
>> (octet-wise) case conversion somewhere, at best (with fully working
>> locale) it will cause case conversion to do nothing instead of actual
>> conversion.
>
> Yours is the first installation I've heard of that fails to get this
> right, and the code in question (downcase_truncate_identifier) has
> been like that since PG 7.4.something ...

This code from downcase_truncate_identifier():

    else if (ch >= 0x80 && isupper(ch))
        ch = tolower(ch);

just can't work on multibyte encodings unless tolower can magically
guess what unicode symbol it operates on (having only one octet of
it). On my (ok, broken) locale definition it corrupts multibyte
characters, on working locale defs it must fail to downcase
identifiers. Unless I'm again missing something obvious...

But, from the comment above:
 * SQL99 specifies Unicode-aware case normalization, which we don't yet
 * have the infrastructure for.

OK, a lot of work is required to fix it, I see. Are there any plans to
either switch to wide-char strings or do a per-character (unlike
per-octet) processing?

--
WBR, Victor V. Snezhko
E-mail: snezhko@indorsoft.ru