Thread: BUG #2685: Wrong charset of server messages on client [PATCH]
The following bug has been logged online: Bug reference: 2685 Logged by: Sergiy Vyshnevetskiy Email address: serg@vostok.net PostgreSQL version: 8.1 Operating system: FreeBSD-6 stable Description: Wrong charset of server messages on client [PATCH] Details: DESCRIPTION: PostgreSQL backend uses gettext() to localize its messages. The charset of localized messages is determined by LC_CTYPE by default. Then the message is processed through sprintf-like mechanism (with database data as possible arguments) and fed to send_message_to_frontend(), that converts data from _database_charset_(!) to client charset. If LC_CTYPE is not the same as (at least binary compatible to) database charset, then client gets garbage characters in server messages. If database charset is UTF-8, then cluster may recusively generate "invalid byte sequence for encoding" errors till it fills up errordata[ERRORDATA_STACK_SIZE], then it panics. SOLUTION: Convert server messages to database charset. PATCH: --- src/backend/utils/mb/mbutils.c.o0 Tue Oct 10 11:51:13 2006 +++ src/backend/utils/mb/mbutils.c Tue Oct 10 11:49:22 2006 @@ -615,6 +615,7 @@ DatabaseEncoding = &pg_enc2name_tbl[encoding]; Assert(DatabaseEncoding->encoding == encoding); #ifdef USE_ICU + bind_textdomain_codeset("postgres",(&pg_enc2iananame_tbl[encoding])->name); ucnv_setDefaultName((&pg_enc2iananame_tbl[encoding])->name); #endif } This, however, uncovers another bug: PostgreSQL dumps the messages into stderr/syslog as-is, without converting database data from database charset to charset from LC_MESSAGES. After this patch it will do so with message text too. The fix should be trivial - set up a conversion from database charset to server charset. I will post a patch for it later. NOTE: I used pg_enc2iananame_tbl instead of pg_enc2name_tbl, because gettext doesn't accept many Possible TODO: Change PostgreSQL charset names to IANA-standard names.
"Sergiy Vyshnevetskiy" <serg@vostok.net> writes: > Convert server messages to database charset. This has been discussed before: http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php The magic pg_enc2iananame_tbl[] you reference in your patch does not exist, and if it did exist it wouldn't work on all platforms, since encoding names aren't sufficiently well standardized :-( > This, however, uncovers another bug: PostgreSQL dumps the messages into > stderr/syslog as-is, without converting database data from database charset > to charset from LC_MESSAGES. I'm quite unconvinced that that's a bug. If we tried to do a conversion here, it would be trivial to set up denials of service for logging --- just include a character in a comment in your SQL command that cannot be converted to the LC_MESSAGES character set. regards, tom lane
On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote: > On Tue, 10 Oct 2006, Tom Lane wrote: > >> "Sergiy Vyshnevetskiy" <serg@vostok.net> writes: >> > Convert server messages to database charset. >> >> This has been discussed before: >> http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php >> >> The magic pg_enc2iananame_tbl[] you reference in your patch does not >> exist, >> and if it did exist it wouldn't work on all platforms, since encoding >> names aren't sufficiently well standardized :-( > > It's not magic, it's from ICU patch. Want me to send you a copy? :) Sorry. I thought it was more well-known. Just looked into gentoo portage - they don't know about it eigther. The patch is here: http://people.freebsd.org/~girgen/postgresql-icu/pg-814-icu-xx-2006-09-25.diff.gz This is the current list of encodings, according to iana: http://www.iana.org/assignments/character-sets
On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote: > On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote: > >> On Tue, 10 Oct 2006, Tom Lane wrote: >> >> > "Sergiy Vyshnevetskiy" <serg@vostok.net> writes: >> > > Convert server messages to database charset. >> > >> > This has been discussed before: >> > http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php >> > >> > The magic pg_enc2iananame_tbl[] you reference in your patch does not >> > exist, >> > and if it did exist it wouldn't work on all platforms, since encoding >> > names aren't sufficiently well standardized :-( >> >> It's not magic, it's from ICU patch. Want me to send you a copy? :) > > Sorry. I thought it was more well-known. Just looked into gentoo portage - > they don't know about it eigther. > > The patch is here: > > http://people.freebsd.org/~girgen/postgresql-icu/pg-814-icu-xx-2006-09-25.diff.gz > > This is the current list of encodings, according to iana: > > http://www.iana.org/assignments/character-sets ICU homepage is http://icu.sourceforge.net/
Sergiy Vyshnevetskiy <serg@vostok.net> writes: >> It's not magic, it's from ICU patch. Want me to send you a copy? :) You're missing my point, which is that non-ICU locale support doesn't necessarily recognize the same encoding names. We would have done this years ago if we had a solution to that problem. regards, tom lane
On Tue, 10 Oct 2006, Tom Lane wrote: > "Sergiy Vyshnevetskiy" <serg@vostok.net> writes: >> Convert server messages to database charset. > > This has been discussed before: > http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php > > The magic pg_enc2iananame_tbl[] you reference in your patch does not exist, > and if it did exist it wouldn't work on all platforms, since encoding > names aren't sufficiently well standardized :-( It's not magic, it's from ICU patch. Want me to send you a copy? :) >> This, however, uncovers another bug: PostgreSQL dumps the messages into >> stderr/syslog as-is, without converting database data from database charset >> to charset from LC_MESSAGES. > > I'm quite unconvinced that that's a bug. If we tried to do a conversion > here, it would be trivial to set up denials of service for logging --- > just include a character in a comment in your SQL command that cannot be > converted to the LC_MESSAGES character set. They have to be printed as escape sequences. I think that dumping raw string data in log without converting them to printable form can be used to mess up log viewer at least. (At most this can be a security breach.) Having row multibyte characters mixed with characters in LC_CTYPE in the log makes it less useful. Syslog would mangle them further to a complete unrecognition.
On Tue, 10 Oct 2006, Tom Lane wrote: > Sergiy Vyshnevetskiy <serg@vostok.net> writes: >>> It's not magic, it's from ICU patch. Want me to send you a copy? :) > > You're missing my point, which is that non-ICU locale support doesn't > necessarily recognize the same encoding names. We would have done this > years ago if we had a solution to that problem. We should use IANA-standard names. If it fails - it does nothing. Anybody porting PostgreSQL to new platform can go over the list and make a patch for their port.
On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote: > On Tue, 10 Oct 2006, Tom Lane wrote: > >> Sergiy Vyshnevetskiy <serg@vostok.net> writes: >> > > It's not magic, it's from ICU patch. Want me to send you a copy? :) >> >> You're missing my point, which is that non-ICU locale support doesn't >> necessarily recognize the same encoding names. We would have done this >> years ago if we had a solution to that problem. > > We should use IANA-standard names. If it fails - it does nothing. > Anybody porting PostgreSQL to new platform can go over the list and make a > patch for their port. Here is a new and improved patch, that closes security hole as well. To prevent DOS attack we lock LC_MESSAGES as C for any database encoding that we are unable to bind to our textdomain.
Sergiy Vyshnevetskiy <serg@vostok.net> writes: > Here is a new and improved patch, that closes security hole as well. We really can't consider a patch like this, because not only does it ignore the problem of multiple spellings of encoding names, but it actually breaks existing functionality on platforms with a variant spelling of the name. I think a minimum requirement ought to be that it work with any of the spellings recognized by initdb. regards, tom lane
On Tue, 10 Oct 2006, Tom Lane wrote: > Sergiy Vyshnevetskiy <serg@vostok.net> writes: >> Here is a new and improved patch, that closes security hole as well. > > We really can't consider a patch like this, because not only does it > ignore the problem of multiple spellings of encoding names, but it > actually breaks existing functionality on platforms with a variant > spelling of the name. I think a minimum requirement ought to be that > it work with any of the spellings recognized by initdb. Alright, that was too strict. But when server uses messages in LC_CTYPE encoding with data in database encoding and pushes this mix through database-to-client charset conversion - that's a bug. PostgreSQL bug. And "UTF-8 panic" is it's direct result. As a stop-gap I included a version of patch that breaks nothing. But it will fix the "wrong encoding" bug and "UTF-8 panic" only on those OS who recognize the supplied spelling. Linux and FreeBSD are among them. Cycling through possible spellings in SetDatabaseEncoding() is suboptimal. The time and place to do it is somewhere in the configure script. There we can fill pg_enc2localname_tbl with results of testing possible charset names. We can also just leave the patch as it is, because more and more OS learn more and more different charset name spellings every new version. Why waste too mush power chasing a horce that runs _to_you_? :)