Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS - Mailing list pgsql-hackers

From Jeevan Chalke
Subject Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Date
Msg-id BANLkTimJWsSxko3HU-qsGnNR4Hk8u5eHvA@mail.gmail.com
Whole thread Raw
Responses Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
List pgsql-hackers
Hi Tom,

Issue is on Windows:

If you see in attached failure.out file, (after running failure.sql) we are getting "ERROR:  invalid
byte sequence for encoding "UTF8": 0xe59aff" error. Please note that byte
sequence we got from database is e5 9a ff, where as actual byte sequence for
the wide character '功' is e5 8a 9f.


'功'      ==> UNICODE Character
e5 8a 9f  ==> Original Byte Sequence for the given characters
e5 9a ff  ==> downcase_truncate_identifier() result, which is invalid UTF8 representation, stored in pg_catalog table

While displaying on client, we receive this invalid byte sequence which throws an error. Note that UTF8 characters have predefined character ranges for each byte which is checked in pg_utf8_islegal() function. Here is the code snippet:

==
    a = source[2];
    if (a < 0x80 || a > 0xBF)
        return false;
==
Note that source[2] = ff, which does not fall into the valid range which results in illegal UTF8 character sequence. If you carefully see the original one i.e. 9f, which falls within the range.

since we smash the identifier to lower case using downcase_truncate_identifier() function, the solution is to make this function should be wide-char aware, like LOWER() function functionality.

I see some discussion related to downcase_truncate_identifier() and wide-char aware function, but seems like we lost somewhere.
(http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
This invalid byte sequence issue seems like a more serious issue, because it might lead e.g to pg_dump failures.

I have tested this on PG9.0 beta4 (one click installers), BTW, we have
observed same with earlier version as well.

Attached is the .sql and its output (run on PG9.0 beta4).

Any thoughts???

Thanks

--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: WALInsertLock tuning
Next
From: Heikki Linnakangas
Date:
Subject: Re: SIREAD lock versus ACCESS EXCLUSIVE lock