Thread: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Jeevan Chalke
Date:
Hi Tom,
Issue is on Windows:
If you see in attached failure.out file, (after running failure.sql) we are getting "ERROR: invalid
byte sequence for encoding "UTF8": 0xe59aff" error. Please note that byte
sequence we got from database is e5 9a ff, where as actual byte sequence for
the wide character '功' is e5 8a 9f.
'功' ==> UNICODE Character
e5 8a 9f ==> Original Byte Sequence for the given characters
e5 9a ff ==> downcase_truncate_identifier() result, which is invalid UTF8 representation, stored in pg_catalog table
While displaying on client, we receive this invalid byte sequence which throws an error. Note that UTF8 characters have predefined character ranges for each byte which is checked in pg_utf8_islegal() function. Here is the code snippet:
==
a = source[2];
if (a < 0x80 || a > 0xBF)
return false;
==
Note that source[2] = ff, which does not fall into the valid range which results in illegal UTF8 character sequence. If you carefully see the original one i.e. 9f, which falls within the range.
since we smash the identifier to lower case using downcase_truncate_identifier() function, the solution is to make this function should be wide-char aware, like LOWER() function functionality.
I see some discussion related to downcase_truncate_identifier() and wide-char aware function, but seems like we lost somewhere.
(http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
This invalid byte sequence issue seems like a more serious issue, because it might lead e.g to pg_dump failures.
I have tested this on PG9.0 beta4 (one click installers), BTW, we have
observed same with earlier version as well.
Attached is the .sql and its output (run on PG9.0 beta4).
Any thoughts???
Thanks
--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Phone: +91 20 30589500
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Issue is on Windows:
If you see in attached failure.out file, (after running failure.sql) we are getting "ERROR: invalid
byte sequence for encoding "UTF8": 0xe59aff" error. Please note that byte
sequence we got from database is e5 9a ff, where as actual byte sequence for
the wide character '功' is e5 8a 9f.
'功' ==> UNICODE Character
e5 8a 9f ==> Original Byte Sequence for the given characters
e5 9a ff ==> downcase_truncate_identifier() result, which is invalid UTF8 representation, stored in pg_catalog table
While displaying on client, we receive this invalid byte sequence which throws an error. Note that UTF8 characters have predefined character ranges for each byte which is checked in pg_utf8_islegal() function. Here is the code snippet:
==
a = source[2];
if (a < 0x80 || a > 0xBF)
return false;
==
Note that source[2] = ff, which does not fall into the valid range which results in illegal UTF8 character sequence. If you carefully see the original one i.e. 9f, which falls within the range.
since we smash the identifier to lower case using downcase_truncate_identifier() function, the solution is to make this function should be wide-char aware, like LOWER() function functionality.
I see some discussion related to downcase_truncate_identifier() and wide-char aware function, but seems like we lost somewhere.
(http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
This invalid byte sequence issue seems like a more serious issue, because it might lead e.g to pg_dump failures.
I have tested this on PG9.0 beta4 (one click installers), BTW, we have
observed same with earlier version as well.
Attached is the .sql and its output (run on PG9.0 beta4).
Any thoughts???
Thanks
--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Phone: +91 20 30589500
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Attachment
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Robert Haas
Date:
2011/6/7 Jeevan Chalke <jeevan.chalke@enterprisedb.com>: > since we smash the identifier to lower case using > downcase_truncate_identifier() function, the solution is to make this > function should be wide-char aware, like LOWER() function functionality. > > I see some discussion related to downcase_truncate_identifier() and > wide-char aware function, but seems like we lost somewhere. > (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php) > This invalid byte sequence issue seems like a more serious issue, because it > might lead e.g to pg_dump failures. It's a problem, but without an efficient algorithm for Unicode case folding, any fix we attempt to implement seems like it'll just be moving the problem around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Jeevan Chalke
Date:
On Wed, Jun 8, 2011 at 6:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Agree.
I read on other mail thread that str_tolower() is a wide-character-aware lower function but it is also a collation-aware and hence might change its behaviour wrt change in locale. However, Tom suggested that we need to have non-locale-dependent case folding algorithm.
But still for same locale on same machine, where we can able to create a table, insert some data, we cannot retrieve it. Don't you think it is more serious and we need a quick solution here? As said earlier it may even lead to pg_dump failures. Given that str_tolower() functionality is locale dependent but still it will resolve this particular issue. Not sure, there might be a performance issue but at-least we are not giving an error.
Please excuse me, if community already had a lot of discussion and kept this behaviour intentionally knowing all these errors and serious issues.
Thanks
2011/6/7 Jeevan Chalke <jeevan.chalke@enterprisedb.com>:> since we smash the identifier to lower case usingIt's a problem, but without an efficient algorithm for Unicode case
> downcase_truncate_identifier() function, the solution is to make this
> function should be wide-char aware, like LOWER() function functionality.
>
> I see some discussion related to downcase_truncate_identifier() and
> wide-char aware function, but seems like we lost somewhere.
> (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
> This invalid byte sequence issue seems like a more serious issue, because it
> might lead e.g to pg_dump failures.
folding, any fix we attempt to implement seems like it'll just be
moving the problem around.
Agree.
I read on other mail thread that str_tolower() is a wide-character-aware lower function but it is also a collation-aware and hence might change its behaviour wrt change in locale. However, Tom suggested that we need to have non-locale-dependent case folding algorithm.
But still for same locale on same machine, where we can able to create a table, insert some data, we cannot retrieve it. Don't you think it is more serious and we need a quick solution here? As said earlier it may even lead to pg_dump failures. Given that str_tolower() functionality is locale dependent but still it will resolve this particular issue. Not sure, there might be a performance issue but at-least we are not giving an error.
Please excuse me, if community already had a lot of discussion and kept this behaviour intentionally knowing all these errors and serious issues.
Thanks
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Phone: +91 20 30589500
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 12:39 AM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: >> It's a problem, but without an efficient algorithm for Unicode case >> folding, any fix we attempt to implement seems like it'll just be >> moving the problem around. > > Agree. > > I read on other mail thread that str_tolower() is a wide-character-aware > lower function but it is also a collation-aware and hence might change its > behaviour wrt change in locale. However, Tom suggested that we need to have > non-locale-dependent case folding algorithm. > > But still for same locale on same machine, where we can able to create a > table, insert some data, we cannot retrieve it. Don't you think it is more > serious and we need a quick solution here? As said earlier it may even lead > to pg_dump failures. Given that str_tolower() functionality is locale > dependent but still it will resolve this particular issue. Not sure, there > might be a performance issue but at-least we are not giving an error. Well, as I understand it, the problem here is that if someone goes and changes the locale, then you might massively break the user's application. For example, if the user says: CREATE TABLE FOO (...); SELECT * FROM FOO; ...that'll work, of course, because whatever you get when you downcase FOO will be the same both times. But if the locale now changes, then the next... SELECT * FROM FOO; ...might fail, because the new downcasing of FOO might not match the old one. You could argue that that's better than the current situation, but it's not clear-cut. But now that I re-think about it, I guess what I'm confused about is this code here: if (ch >= 'A' && ch <= 'Z') ch += 'a' - 'A'; else if (IS_HIGHBIT_SET(ch)&& isupper(ch)) ch = tolower(ch); result[i] = (char) ch; It seems to me that we're downcasing the first byte of each wide character and ignoring the rest... which seems like it can't possibly be a good idea in a multi-byte encoding. Perhaps we could keep that approach for single-byte encodings and just pass through multi-byte characters untouched? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > But now that I re-think about it, I guess what I'm confused about is > this code here: > if (ch >= 'A' && ch <= 'Z') > ch += 'a' - 'A'; > else if (IS_HIGHBIT_SET(ch) && isupper(ch)) > ch = tolower(ch); > result[i] = (char) ch; The expected behavior there is that case-folding of non-ASCII characters will occur in single-byte encodings but nothing will happen to multi-byte characters. We are relying on isupper() to not return true when presented with a character fragment in a multibyte locale. regards, tom lane
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> But now that I re-think about it, I guess what I'm confused about is >> this code here: > >> if (ch >= 'A' && ch <= 'Z') >> ch += 'a' - 'A'; >> else if (IS_HIGHBIT_SET(ch) && isupper(ch)) >> ch = tolower(ch); >> result[i] = (char) ch; > > The expected behavior there is that case-folding of non-ASCII characters > will occur in single-byte encodings but nothing will happen to > multi-byte characters. We are relying on isupper() to not return true > when presented with a character fragment in a multibyte locale. Based on Jeevan's original message, it seems like that's not always the case, at least on Windows. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We are relying on isupper() to not return true >> when presented with a character fragment in a multibyte locale. > Based on Jeevan's original message, it seems like that's not always > the case, at least on Windows. Hmm. Now that you mention it, I think the same has been said about OSX. If we need to work around brain-dead isupper() tests, maybe the best thing is to implement two versions of the loop: if (encoding is single byte) ... loop as it stands ...else ... loop without the "else if" part regards, tom lane
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> We are relying on isupper() to not return true >>> when presented with a character fragment in a multibyte locale. > >> Based on Jeevan's original message, it seems like that's not always >> the case, at least on Windows. > > Hmm. Now that you mention it, I think the same has been said about OSX. > > If we need to work around brain-dead isupper() tests, maybe the best > thing is to implement two versions of the loop: > > if (encoding is single byte) > ... loop as it stands ... > else > ... loop without the "else if" part That seems like a clear improvement. It's a long way from perfect, but still worthwhile. Would we back-patch that? Just do it in master? Wait for 9.2? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If we need to work around brain-dead isupper() tests, maybe the best >> thing is to implement two versions of the loop: >> >> if (encoding is single byte) >> ... loop as it stands ... >> else >> ... loop without the "else if" part > That seems like a clear improvement. It's a long way from perfect, > but still worthwhile. > Would we back-patch that? Just do it in master? Wait for 9.2? It looks to me like a portability bug fix, so I'd say back-patch. I'll take care of it, unless you're hot to do so? regards, tom lane
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> If we need to work around brain-dead isupper() tests, maybe the best >>> thing is to implement two versions of the loop: >>> >>> if (encoding is single byte) >>> ... loop as it stands ... >>> else >>> ... loop without the "else if" part > >> That seems like a clear improvement. It's a long way from perfect, >> but still worthwhile. > >> Would we back-patch that? Just do it in master? Wait for 9.2? > > It looks to me like a portability bug fix, so I'd say back-patch. > > I'll take care of it, unless you're hot to do so? Nope, have at it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If we need to work around brain-dead isupper() tests, maybe the best >> thing is to implement two versions of the loop: >> if (encoding is single byte) > That seems like a clear improvement. It's a long way from perfect, > but still worthwhile. Hmm ... while the above is easy enough to do in the backend, where we can look at pg_database_encoding_max_length, we have also got instances of this coding pattern in src/port/pgstrcasecmp.c. It's a lot less obvious how to make the test in frontend environments. Thoughts anyone? regards, tom lane
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> If we need to work around brain-dead isupper() tests, maybe the best >>> thing is to implement two versions of the loop: >>> if (encoding is single byte) > >> That seems like a clear improvement. It's a long way from perfect, >> but still worthwhile. > > Hmm ... while the above is easy enough to do in the backend, where we > can look at pg_database_encoding_max_length, we have also got instances > of this coding pattern in src/port/pgstrcasecmp.c. It's a lot less > obvious how to make the test in frontend environments. Thoughts anyone? I'm not sure if this helps at all, but an awful lot of those tests are against hard-coded strings that are known to contain only ASCII characters. Is there some way we can optimize this for that case? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm ... while the above is easy enough to do in the backend, where we >> can look at pg_database_encoding_max_length, we have also got instances >> of this coding pattern in src/port/pgstrcasecmp.c. �It's a lot less >> obvious how to make the test in frontend environments. �Thoughts anyone? > I'm not sure if this helps at all, but an awful lot of those tests are > against hard-coded strings that are known to contain only ASCII > characters. Is there some way we can optimize this for that case? For the places where we're just looking for a match to a fixed all-ASCII string, an ASCII-only downcasing would be sufficient, and would eliminate the whole problem. But I doubt all the callers fall into that class. What I'm particularly worried about at the moment is whether we are assuming anywhere that the frontend side can duplicate the backend's identifier downcasing behavior. That seems like a complete morass, because (1) they might not have the same locale, (2) they might not have the same encoding, (3) even if they do, the "same" locale is known to behave differently on different platforms. regards, tom lane
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Hmm ... while the above is easy enough to do in the backend, where we >>> can look at pg_database_encoding_max_length, we have also got instances >>> of this coding pattern in src/port/pgstrcasecmp.c. It's a lot less >>> obvious how to make the test in frontend environments. Thoughts anyone? > >> I'm not sure if this helps at all, but an awful lot of those tests are >> against hard-coded strings that are known to contain only ASCII >> characters. Is there some way we can optimize this for that case? > > For the places where we're just looking for a match to a fixed all-ASCII > string, an ASCII-only downcasing would be sufficient, and would > eliminate the whole problem. But I doubt all the callers fall into that > class. > > What I'm particularly worried about at the moment is whether we are > assuming anywhere that the frontend side can duplicate the backend's > identifier downcasing behavior. That seems like a complete morass, > because (1) they might not have the same locale, (2) they might not > have the same encoding, (3) even if they do, the "same" locale is known > to behave differently on different platforms. Right. Understood. So let's look at the cases (from git grep pg_strcasecmp and pg_strncasecmp): contrib/dict_int: Fixed strings only, and it's all backend code anyway. contrib/dict_xsyn: Fixed strings only, and it's all backend code anyway. contrib/hstore: Fixed strings only, and it's all backend code anyway. contrib/pg_upgrade: Used to compare LC_COLLATE, LC_CTYPE, and encoding names. contrib/pgbench: Definitely front-end code, but it's all fixed strings. contrib/pgcrypto: All fixed strings except for one instance in px_find_digit. But it's all backend contrib/spi: One instance, not a fixed string, but it's backend code. contrib/unaccent: One instance, not a fixed string, but it's backend code. src/backend/*: Backend code, obviously. src/bin/initdb: Strings from a constant lookup table (tsearch_config_languages) only. src/bin/pg_basebackup: Fixed strings only. src/bin/pg_ctl: Fixed strings only. src/bin/pg_dump: Fixed strings only. src/bin/psql: Fixed strings only. In a couple of cases they are not constants - help.c uses strings from to generated file sql_help.h, and tab-complete.c uses strings from a constant array called words_after_create[]. But these are constant lookup tables. src/include: access/reloptions.h uses strncasecmp() as part of a macro. That should be OK as long as no one tries to include this in frontend code, which seems rather impractical. src/interfaces/ecpg/ecpglib: Fixed strings. src/interfaces/ecpg/pgtypeslib: Fixed strings, and strings from a constant lookup table, only. src/interfaces/ecpg/preproc: This looks a bit worrisome. It seems we might be using it on identifiers here. src/interfaces/libpq: This is attempting to match a wildcard certificate name against a hostname, in two different places. src/port/chklocale.c: Fixed strings or ones from a lookup table. src/timezone/pgtz.c: Matches input strings against filenames read from the OS. So mostly I think these are OK. The instance in src/interfaces/ecpg/preproc looks like the most likely candidate for a problem spot. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > Right. Understood. So let's look at the cases (from git grep > pg_strcasecmp and pg_strncasecmp): Also pg_toupper and pg_tolower. Right offhand, it looks like psql *does* assume it can lower-case identifiers this way :-( regards, tom lane
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Right. Understood. So let's look at the cases (from git grep >> pg_strcasecmp and pg_strncasecmp): > > Also pg_toupper and pg_tolower. Right offhand, it looks like psql > *does* assume it can lower-case identifiers this way :-( Blarg. I dunno what to do about that - but surely we must have the encoding available somewhere? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 9, 2011 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Right. �Understood. �So let's look at the cases (from git grep >>> pg_strcasecmp and pg_strncasecmp): >> Also pg_toupper and pg_tolower. �Right offhand, it looks like psql >> *does* assume it can lower-case identifiers this way :-( > Blarg. I dunno what to do about that - but surely we must have the > encoding available somewhere? Well, psql does at least, but changing all the callers to pass encoding info would be a PITA. The idea I'd been mulling was to keep a static variable in pgstrcasecmp.c, which would control whether to allow the isupper/tolower path to be tried. The static initializer value would be false, meaning you get ASCII-only case folding by default. We'd also add a callable function to set the flag, which could be called during startup by whatever bit of logic is aware of the encoding we're using. psql, pg_dump, and the backend could do that; offhand I suspect it doesn't matter for anything else. However, this is all still dependent on the assumption that our idea of the encoding matches the libc locale setting. I'm prepared to believe that we have that locked down pretty well now in the backend, but I don't think I believe it for either psql or pg_dump. There's also the meta-problem that psql's locale might not match the backend's, leading to wrong case folding of identifiers compared to what the backend does, even if it's entirely correct for psql's locale. I'm not sure that's a huge deal for psql, since most likely anybody who's typing non-ASCII identifiers has taken the trouble to set the locale the way she wants, and anyway the effects would only be seen in interactive commands and so are easily recovered from. However, if pg_dump is trying to do this, the possible downsides seem a lot worse. regards, tom lane