Thread: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

Hi Tom,

Issue is on Windows:

If you see in attached failure.out file, (after running failure.sql) we are getting "ERROR:  invalid
byte sequence for encoding "UTF8": 0xe59aff" error. Please note that byte
sequence we got from database is e5 9a ff, where as actual byte sequence for
the wide character '功' is e5 8a 9f.


'功'      ==> UNICODE Character
e5 8a 9f  ==> Original Byte Sequence for the given characters
e5 9a ff  ==> downcase_truncate_identifier() result, which is invalid UTF8 representation, stored in pg_catalog table

While displaying on client, we receive this invalid byte sequence which throws an error. Note that UTF8 characters have predefined character ranges for each byte which is checked in pg_utf8_islegal() function. Here is the code snippet:

==
    a = source[2];
    if (a < 0x80 || a > 0xBF)
        return false;
==
Note that source[2] = ff, which does not fall into the valid range which results in illegal UTF8 character sequence. If you carefully see the original one i.e. 9f, which falls within the range.

since we smash the identifier to lower case using downcase_truncate_identifier() function, the solution is to make this function should be wide-char aware, like LOWER() function functionality.

I see some discussion related to downcase_truncate_identifier() and wide-char aware function, but seems like we lost somewhere.
(http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
This invalid byte sequence issue seems like a more serious issue, because it might lead e.g to pg_dump failures.

I have tested this on PG9.0 beta4 (one click installers), BTW, we have
observed same with earlier version as well.

Attached is the .sql and its output (run on PG9.0 beta4).

Any thoughts???

Thanks

--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Attachment
2011/6/7 Jeevan Chalke <jeevan.chalke@enterprisedb.com>:
> since we smash the identifier to lower case using
> downcase_truncate_identifier() function, the solution is to make this
> function should be wide-char aware, like LOWER() function functionality.
>
> I see some discussion related to downcase_truncate_identifier() and
> wide-char aware function, but seems like we lost somewhere.
> (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
> This invalid byte sequence issue seems like a more serious issue, because it
> might lead e.g to pg_dump failures.

It's a problem, but without an efficient algorithm for Unicode case
folding, any fix we attempt to implement seems like it'll just be
moving the problem around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




On Wed, Jun 8, 2011 at 6:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
2011/6/7 Jeevan Chalke <jeevan.chalke@enterprisedb.com>:
> since we smash the identifier to lower case using
> downcase_truncate_identifier() function, the solution is to make this
> function should be wide-char aware, like LOWER() function functionality.
>
> I see some discussion related to downcase_truncate_identifier() and
> wide-char aware function, but seems like we lost somewhere.
> (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
> This invalid byte sequence issue seems like a more serious issue, because it
> might lead e.g to pg_dump failures.

It's a problem, but without an efficient algorithm for Unicode case
folding, any fix we attempt to implement seems like it'll just be
moving the problem around.

Agree.

I read on other mail thread that str_tolower() is a  wide-character-aware lower function but it is also a collation-aware and hence might change its behaviour wrt change in locale. However, Tom suggested that we need to have non-locale-dependent case folding algorithm.

But still for same locale on same machine, where we can able to create a table, insert some data, we cannot retrieve it. Don't you think it is more serious and we need a quick solution here? As said earlier it may even lead to pg_dump failures. Given that str_tolower() functionality is locale dependent but still it will resolve this particular issue. Not sure, there might be a performance issue but at-least we are not giving an error.

Please excuse me, if community already had a lot of discussion and kept this behaviour intentionally knowing all these errors and serious issues.

Thanks



--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
On Thu, Jun 9, 2011 at 12:39 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>> It's a problem, but without an efficient algorithm for Unicode case
>> folding, any fix we attempt to implement seems like it'll just be
>> moving the problem around.
>
> Agree.
>
> I read on other mail thread that str_tolower() is a  wide-character-aware
> lower function but it is also a collation-aware and hence might change its
> behaviour wrt change in locale. However, Tom suggested that we need to have
> non-locale-dependent case folding algorithm.
>
> But still for same locale on same machine, where we can able to create a
> table, insert some data, we cannot retrieve it. Don't you think it is more
> serious and we need a quick solution here? As said earlier it may even lead
> to pg_dump failures. Given that str_tolower() functionality is locale
> dependent but still it will resolve this particular issue. Not sure, there
> might be a performance issue but at-least we are not giving an error.

Well, as I understand it, the problem here is that if someone goes and
changes the locale, then you might massively break the user's
application.  For example, if the user says:

CREATE TABLE FOO (...);
SELECT * FROM FOO;

...that'll work, of course, because whatever you get when you downcase
FOO will be the same both times.  But if the locale now changes, then
the next...

SELECT * FROM FOO;

...might fail, because the new downcasing of FOO might not match the old one.

You could argue that that's better than the current situation, but
it's not clear-cut.

But now that I re-think about it, I guess what I'm confused about is
this code here:
               if (ch >= 'A' && ch <= 'Z')                       ch += 'a' - 'A';               else if
(IS_HIGHBIT_SET(ch)&& isupper(ch))                       ch = tolower(ch);               result[i] = (char) ch; 

It seems to me that we're downcasing the first byte of each wide
character and ignoring the rest... which seems like it can't possibly
be a good idea in a multi-byte encoding.  Perhaps we could keep that
approach for single-byte encodings and just pass through multi-byte
characters untouched?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> But now that I re-think about it, I guess what I'm confused about is
> this code here:

>                 if (ch >= 'A' && ch <= 'Z')
>                         ch += 'a' - 'A';
>                 else if (IS_HIGHBIT_SET(ch) && isupper(ch))
>                         ch = tolower(ch);
>                 result[i] = (char) ch;

The expected behavior there is that case-folding of non-ASCII characters
will occur in single-byte encodings but nothing will happen to
multi-byte characters.  We are relying on isupper() to not return true
when presented with a character fragment in a multibyte locale.
        regards, tom lane


On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> But now that I re-think about it, I guess what I'm confused about is
>> this code here:
>
>>                 if (ch >= 'A' && ch <= 'Z')
>>                         ch += 'a' - 'A';
>>                 else if (IS_HIGHBIT_SET(ch) && isupper(ch))
>>                         ch = tolower(ch);
>>                 result[i] = (char) ch;
>
> The expected behavior there is that case-folding of non-ASCII characters
> will occur in single-byte encodings but nothing will happen to
> multi-byte characters.  We are relying on isupper() to not return true
> when presented with a character fragment in a multibyte locale.

Based on Jeevan's original message, it seems like that's not always
the case, at least on Windows.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We are relying on isupper() to not return true
>> when presented with a character fragment in a multibyte locale.

> Based on Jeevan's original message, it seems like that's not always
> the case, at least on Windows.

Hmm.  Now that you mention it, I think the same has been said about OSX.

If we need to work around brain-dead isupper() tests, maybe the best
thing is to implement two versions of the loop:
if (encoding is single byte)    ... loop as it stands ...else    ... loop without the "else if" part
        regards, tom lane


On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> We are relying on isupper() to not return true
>>> when presented with a character fragment in a multibyte locale.
>
>> Based on Jeevan's original message, it seems like that's not always
>> the case, at least on Windows.
>
> Hmm.  Now that you mention it, I think the same has been said about OSX.
>
> If we need to work around brain-dead isupper() tests, maybe the best
> thing is to implement two versions of the loop:
>
>        if (encoding is single byte)
>                ... loop as it stands ...
>        else
>                ... loop without the "else if" part

That seems like a clear improvement.  It's a long way from perfect,
but still worthwhile.

Would we back-patch that?  Just do it in master?  Wait for 9.2?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If we need to work around brain-dead isupper() tests, maybe the best
>> thing is to implement two versions of the loop:
>> 
>>        if (encoding is single byte)
>>                ... loop as it stands ...
>>        else
>>                ... loop without the "else if" part

> That seems like a clear improvement.  It's a long way from perfect,
> but still worthwhile.

> Would we back-patch that?  Just do it in master?  Wait for 9.2?

It looks to me like a portability bug fix, so I'd say back-patch.

I'll take care of it, unless you're hot to do so?
        regards, tom lane


On Thu, Jun 9, 2011 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> If we need to work around brain-dead isupper() tests, maybe the best
>>> thing is to implement two versions of the loop:
>>>
>>>        if (encoding is single byte)
>>>                ... loop as it stands ...
>>>        else
>>>                ... loop without the "else if" part
>
>> That seems like a clear improvement.  It's a long way from perfect,
>> but still worthwhile.
>
>> Would we back-patch that?  Just do it in master?  Wait for 9.2?
>
> It looks to me like a portability bug fix, so I'd say back-patch.
>
> I'll take care of it, unless you're hot to do so?

Nope, have at it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If we need to work around brain-dead isupper() tests, maybe the best
>> thing is to implement two versions of the loop:
>>        if (encoding is single byte)

> That seems like a clear improvement.  It's a long way from perfect,
> but still worthwhile.

Hmm ... while the above is easy enough to do in the backend, where we
can look at pg_database_encoding_max_length, we have also got instances
of this coding pattern in src/port/pgstrcasecmp.c.  It's a lot less
obvious how to make the test in frontend environments.  Thoughts anyone?
        regards, tom lane


On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> If we need to work around brain-dead isupper() tests, maybe the best
>>> thing is to implement two versions of the loop:
>>>        if (encoding is single byte)
>
>> That seems like a clear improvement.  It's a long way from perfect,
>> but still worthwhile.
>
> Hmm ... while the above is easy enough to do in the backend, where we
> can look at pg_database_encoding_max_length, we have also got instances
> of this coding pattern in src/port/pgstrcasecmp.c.  It's a lot less
> obvious how to make the test in frontend environments.  Thoughts anyone?

I'm not sure if this helps at all, but an awful lot of those tests are
against hard-coded strings that are known to contain only ASCII
characters.  Is there some way we can optimize this for that case?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm ... while the above is easy enough to do in the backend, where we
>> can look at pg_database_encoding_max_length, we have also got instances
>> of this coding pattern in src/port/pgstrcasecmp.c. �It's a lot less
>> obvious how to make the test in frontend environments. �Thoughts anyone?

> I'm not sure if this helps at all, but an awful lot of those tests are
> against hard-coded strings that are known to contain only ASCII
> characters.  Is there some way we can optimize this for that case?

For the places where we're just looking for a match to a fixed all-ASCII
string, an ASCII-only downcasing would be sufficient, and would
eliminate the whole problem.  But I doubt all the callers fall into that
class.

What I'm particularly worried about at the moment is whether we are
assuming anywhere that the frontend side can duplicate the backend's
identifier downcasing behavior.  That seems like a complete morass,
because (1) they might not have the same locale, (2) they might not
have the same encoding, (3) even if they do, the "same" locale is known
to behave differently on different platforms.
        regards, tom lane


On Thu, Jun 9, 2011 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Hmm ... while the above is easy enough to do in the backend, where we
>>> can look at pg_database_encoding_max_length, we have also got instances
>>> of this coding pattern in src/port/pgstrcasecmp.c.  It's a lot less
>>> obvious how to make the test in frontend environments.  Thoughts anyone?
>
>> I'm not sure if this helps at all, but an awful lot of those tests are
>> against hard-coded strings that are known to contain only ASCII
>> characters.  Is there some way we can optimize this for that case?
>
> For the places where we're just looking for a match to a fixed all-ASCII
> string, an ASCII-only downcasing would be sufficient, and would
> eliminate the whole problem.  But I doubt all the callers fall into that
> class.
>
> What I'm particularly worried about at the moment is whether we are
> assuming anywhere that the frontend side can duplicate the backend's
> identifier downcasing behavior.  That seems like a complete morass,
> because (1) they might not have the same locale, (2) they might not
> have the same encoding, (3) even if they do, the "same" locale is known
> to behave differently on different platforms.

Right.  Understood.  So let's look at the cases (from git grep
pg_strcasecmp and pg_strncasecmp):

contrib/dict_int: Fixed strings only, and it's all backend code anyway.
contrib/dict_xsyn: Fixed strings only, and it's all backend code anyway.
contrib/hstore: Fixed strings only, and it's all backend code anyway.
contrib/pg_upgrade: Used to compare LC_COLLATE, LC_CTYPE, and encoding names.
contrib/pgbench: Definitely front-end code, but it's all fixed strings.
contrib/pgcrypto: All fixed strings except for one instance in
px_find_digit.  But it's all backend
contrib/spi: One instance, not a fixed string, but it's backend code.
contrib/unaccent: One instance, not a fixed string, but it's backend code.
src/backend/*: Backend code, obviously.
src/bin/initdb: Strings from a constant lookup table
(tsearch_config_languages) only.
src/bin/pg_basebackup: Fixed strings only.
src/bin/pg_ctl: Fixed strings only.
src/bin/pg_dump: Fixed strings only.
src/bin/psql: Fixed strings only.  In a couple of cases they are not
constants - help.c uses strings from to generated file sql_help.h, and
tab-complete.c uses strings from a constant array called
words_after_create[].  But these are constant lookup tables.
src/include: access/reloptions.h uses strncasecmp() as part of a
macro.  That should be OK as long as no one tries to include this in
frontend code, which seems rather impractical.
src/interfaces/ecpg/ecpglib: Fixed strings.
src/interfaces/ecpg/pgtypeslib: Fixed strings, and strings from a
constant lookup table, only.
src/interfaces/ecpg/preproc: This looks a bit worrisome.  It seems we
might be using it on identifiers here.
src/interfaces/libpq: This is attempting to match a wildcard
certificate name against a hostname, in two different places.
src/port/chklocale.c: Fixed strings or ones from a lookup table.
src/timezone/pgtz.c: Matches input strings against filenames read from the OS.

So mostly I think these are OK.  The instance in
src/interfaces/ecpg/preproc looks like the most likely candidate for a
problem spot.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> Right.  Understood.  So let's look at the cases (from git grep
> pg_strcasecmp and pg_strncasecmp):

Also pg_toupper and pg_tolower.  Right offhand, it looks like psql
*does* assume it can lower-case identifiers this way :-(
        regards, tom lane


On Thu, Jun 9, 2011 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Right.  Understood.  So let's look at the cases (from git grep
>> pg_strcasecmp and pg_strncasecmp):
>
> Also pg_toupper and pg_tolower.  Right offhand, it looks like psql
> *does* assume it can lower-case identifiers this way :-(

Blarg.  I dunno what to do about that - but surely we must have the
encoding available somewhere?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 9, 2011 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Right. �Understood. �So let's look at the cases (from git grep
>>> pg_strcasecmp and pg_strncasecmp):

>> Also pg_toupper and pg_tolower. �Right offhand, it looks like psql
>> *does* assume it can lower-case identifiers this way :-(

> Blarg.  I dunno what to do about that - but surely we must have the
> encoding available somewhere?

Well, psql does at least, but changing all the callers to pass encoding
info would be a PITA.

The idea I'd been mulling was to keep a static variable in
pgstrcasecmp.c, which would control whether to allow the isupper/tolower
path to be tried.  The static initializer value would be false, meaning
you get ASCII-only case folding by default.  We'd also add a callable
function to set the flag, which could be called during startup by
whatever bit of logic is aware of the encoding we're using.  psql,
pg_dump, and the backend could do that; offhand I suspect it doesn't
matter for anything else.

However, this is all still dependent on the assumption that our idea of
the encoding matches the libc locale setting.  I'm prepared to believe
that we have that locked down pretty well now in the backend, but I
don't think I believe it for either psql or pg_dump.

There's also the meta-problem that psql's locale might not match the
backend's, leading to wrong case folding of identifiers compared to what
the backend does, even if it's entirely correct for psql's locale.  I'm
not sure that's a huge deal for psql, since most likely anybody who's
typing non-ASCII identifiers has taken the trouble to set the locale the
way she wants, and anyway the effects would only be seen in interactive
commands and so are easily recovered from.  However, if pg_dump is
trying to do this, the possible downsides seem a lot worse.
        regards, tom lane