BUG #8105: names are transformed to lowercase incorrectly - Mailing list pgsql-bugs

From pg@kolesar.hu
Subject BUG #8105: names are transformed to lowercase incorrectly
Date
Msg-id E1UUHU1-0000iG-BT@wrigleys.postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      8105
Logged by:          Andr=C3=A1s Koles=C3=A1r
Email address:      pg@kolesar.hu
PostgreSQL version: 9.1.5
Operating system:   Windows =

Description:        =


If I specify an unicode field name without quotes, field name gets lowecased
incorrectly. pgAdmin 1.14.2 on Linux, PostgreSQL server 9.1.5 on Windows:

SELECT =C3=A9rt=C3=A9k FROM (SELECT 1 AS "=C3=A9rt=C3=A9k") AS x;

********** Error **********
SQL state: 42703
Character: 8

In the example above I specify an unicode column name ("=C3=A9rt=C3=A9k" me=
ans "value"
in Hungarian), then I try to read it. If I use double quotes in the outer
query, it works.

However, the above example works fine if the server runs on Linux:

"PostgreSQL 9.1.9 on i686-pc-linux-gnu, compiled by gcc (Ubuntu/Linaro
4.7.2-2ubuntu1) 4.7.2, 32-bit"

I see the same problem from PHP client. There is a more verbose error
message:

ERROR:  column "=EF=BF=BDrt=EF=BF=BDk" does not exist
LINE 1: SELECT =C3=A9rt=C3=A9k FROM (SELECT 1 AS "=C3=A9rt=C3=A9k") AS x
               ^

The "=C3=A9" character is represented incorrectly in the error message, it =
shows
where the problem is. This character (U+00E9) is represented in UTF8 as C3
A9. In the error message it is an invalid UTF8 sequence: E3 A9. I think
Windows uses Windows-1250 or Windows-1252 character set where C3 lowers to
E3. A9 survives tolower() because it means =C2=A9 (copyright sign) in these
charsets, without lowercase pair.

I have localized the problem in PostgreSQL source:
src/backend/parser/scansup.c:128

char *
downcase_truncate_identifier(const char *ident, int len, bool warn) {
// ...
for (i =3D 0; i < len; i++)
// ...
    if (IS_HIGHBIT_SET(ch) && isupper(ch))
        ch =3D tolower(ch);

This function walks through identifiers byte-by-byte, lowers them if they
were individual characters. This is incorrect in multibyte character sets.
It works on Linux with UTF8 system encoding because isupper() returns false
both for C3 and A9.

The same issue is reported below:

Database object names and libpq in UTF-8 locale on Windows
http://permalink.gmane.org/gmane.comp.db.postgresql.sql/29464

Solution 1: tolower() only A-Z.
Solution 2: use a lowercase function that uses client_encoding

pgsql-bugs by date:

Previous
From: E E
Date:
Subject: Re: BUG #8056: postgres forgets hstore over time
Next
From: ams214@cam.ac.uk
Date:
Subject: BUG #8106: Redundant function definition in contrib/cube/cube.c