Thread: BUG #2685: Wrong charset of server messages on client [PATCH]

BUG #2685: Wrong charset of server messages on client [PATCH]

From
"Sergiy Vyshnevetskiy"
Date:
The following bug has been logged online:

Bug reference:      2685
Logged by:          Sergiy Vyshnevetskiy
Email address:      serg@vostok.net
PostgreSQL version: 8.1
Operating system:   FreeBSD-6 stable
Description:        Wrong charset of server messages on client [PATCH]
Details:

DESCRIPTION:

PostgreSQL backend uses gettext() to localize its messages. The charset of
localized messages is determined by LC_CTYPE by default.

Then the message is processed through sprintf-like mechanism (with database
data as possible arguments) and fed to send_message_to_frontend(), that
converts data from _database_charset_(!) to client charset.

If LC_CTYPE is not the same as (at least binary compatible to) database
charset, then client gets garbage characters in server messages. If database
charset is UTF-8, then cluster may recusively generate "invalid byte
sequence for encoding" errors till it fills up
errordata[ERRORDATA_STACK_SIZE], then it panics.

SOLUTION:

Convert server messages to database charset.

PATCH:

--- src/backend/utils/mb/mbutils.c.o0 Tue Oct 10 11:51:13 2006

+++ src/backend/utils/mb/mbutils.c  Tue Oct 10 11:49:22 2006

@@ -615,6 +615,7 @@

  DatabaseEncoding = &pg_enc2name_tbl[encoding];

  Assert(DatabaseEncoding->encoding == encoding);

 #ifdef USE_ICU

+
bind_textdomain_codeset("postgres",(&pg_enc2iananame_tbl[encoding])->name);

  ucnv_setDefaultName((&pg_enc2iananame_tbl[encoding])->name);

 #endif

 }




This, however, uncovers another bug: PostgreSQL dumps the messages into
stderr/syslog as-is, without converting database data from database charset
to charset from LC_MESSAGES. After this patch it will do so with message
text too. The fix should be trivial - set up a conversion from database
charset to server charset. I will post a patch for it later.

NOTE:

I used pg_enc2iananame_tbl instead of pg_enc2name_tbl, because gettext
doesn't accept many

Possible TODO:
Change PostgreSQL charset names to IANA-standard names.

Re: BUG #2685: Wrong charset of server messages on client [PATCH]

From
Tom Lane
Date:
"Sergiy Vyshnevetskiy" <serg@vostok.net> writes:
> Convert server messages to database charset.

This has been discussed before:
http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php

The magic pg_enc2iananame_tbl[] you reference in your patch does not exist,
and if it did exist it wouldn't work on all platforms, since encoding
names aren't sufficiently well standardized :-(

> This, however, uncovers another bug: PostgreSQL dumps the messages into
> stderr/syslog as-is, without converting database data from database charset
> to charset from LC_MESSAGES.

I'm quite unconvinced that that's a bug.  If we tried to do a conversion
here, it would be trivial to set up denials of service for logging ---
just include a character in a comment in your SQL command that cannot be
converted to the LC_MESSAGES character set.

            regards, tom lane

Re: BUG #2685: Wrong charset of server messages on client

From
Sergiy Vyshnevetskiy
Date:
On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote:

> On Tue, 10 Oct 2006, Tom Lane wrote:
>
>>  "Sergiy Vyshnevetskiy" <serg@vostok.net> writes:
>> >  Convert server messages to database charset.
>>
>>  This has been discussed before:
>>  http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php
>>
>>  The magic pg_enc2iananame_tbl[] you reference in your patch does not
>>  exist,
>>  and if it did exist it wouldn't work on all platforms, since encoding
>>  names aren't sufficiently well standardized :-(
>
> It's not magic, it's from ICU patch. Want me to send you a copy? :)

Sorry. I thought it was more well-known. Just looked into gentoo portage -
they don't know about it eigther.

The patch is here:

http://people.freebsd.org/~girgen/postgresql-icu/pg-814-icu-xx-2006-09-25.diff.gz

This is the current list of encodings, according to iana:

http://www.iana.org/assignments/character-sets

Re: BUG #2685: Wrong charset of server messages on client

From
Sergiy Vyshnevetskiy
Date:
On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote:

> On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote:
>
>>  On Tue, 10 Oct 2006, Tom Lane wrote:
>>
>> >   "Sergiy Vyshnevetskiy" <serg@vostok.net> writes:
>> > >   Convert server messages to database charset.
>> >
>> >   This has been discussed before:
>> >   http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php
>> >
>> >   The magic pg_enc2iananame_tbl[] you reference in your patch does not
>> >   exist,
>> >   and if it did exist it wouldn't work on all platforms, since encoding
>> >   names aren't sufficiently well standardized :-(
>>
>>  It's not magic, it's from ICU patch. Want me to send you a copy? :)
>
> Sorry. I thought it was more well-known. Just looked into gentoo portage -
> they don't know about it eigther.
>
> The patch is here:
>
> http://people.freebsd.org/~girgen/postgresql-icu/pg-814-icu-xx-2006-09-25.diff.gz
>
> This is the current list of encodings, according to iana:
>
> http://www.iana.org/assignments/character-sets

ICU homepage is

http://icu.sourceforge.net/

Re: BUG #2685: Wrong charset of server messages on client [PATCH]

From
Tom Lane
Date:
Sergiy Vyshnevetskiy <serg@vostok.net> writes:
>> It's not magic, it's from ICU patch. Want me to send you a copy? :)

You're missing my point, which is that non-ICU locale support doesn't
necessarily recognize the same encoding names.  We would have done this
years ago if we had a solution to that problem.

            regards, tom lane

Re: BUG #2685: Wrong charset of server messages on client

From
Sergiy Vyshnevetskiy
Date:
On Tue, 10 Oct 2006, Tom Lane wrote:

> "Sergiy Vyshnevetskiy" <serg@vostok.net> writes:
>> Convert server messages to database charset.
>
> This has been discussed before:
> http://archives.postgresql.org/pgsql-patches/2005-08/msg00245.php
>
> The magic pg_enc2iananame_tbl[] you reference in your patch does not exist,
> and if it did exist it wouldn't work on all platforms, since encoding
> names aren't sufficiently well standardized :-(

It's not magic, it's from ICU patch. Want me to send you a copy? :)

>> This, however, uncovers another bug: PostgreSQL dumps the messages into
>> stderr/syslog as-is, without converting database data from database charset
>> to charset from LC_MESSAGES.
>
> I'm quite unconvinced that that's a bug.  If we tried to do a conversion
> here, it would be trivial to set up denials of service for logging ---
> just include a character in a comment in your SQL command that cannot be
> converted to the LC_MESSAGES character set.

They have to be printed as escape sequences. I think that dumping raw
string data in log without converting them to printable form can be used
to mess up log viewer at least. (At most this can be a security breach.)

Having row multibyte characters mixed with characters in LC_CTYPE in the
log makes it less useful. Syslog would mangle them further to a complete
unrecognition.

Re: BUG #2685: Wrong charset of server messages on client

From
Sergiy Vyshnevetskiy
Date:
On Tue, 10 Oct 2006, Tom Lane wrote:

> Sergiy Vyshnevetskiy <serg@vostok.net> writes:
>>> It's not magic, it's from ICU patch. Want me to send you a copy? :)
>
> You're missing my point, which is that non-ICU locale support doesn't
> necessarily recognize the same encoding names.  We would have done this
> years ago if we had a solution to that problem.

We should use IANA-standard names. If it fails - it does nothing.
Anybody porting PostgreSQL to new platform can go over the list and make a
patch for their port.

Re: BUG #2685: Wrong charset of server messages on client

From
Sergiy Vyshnevetskiy
Date:
On Tue, 10 Oct 2006, Sergiy Vyshnevetskiy wrote:

> On Tue, 10 Oct 2006, Tom Lane wrote:
>
>>  Sergiy Vyshnevetskiy <serg@vostok.net> writes:
>> > >  It's not magic, it's from ICU patch. Want me to send you a copy? :)
>>
>>  You're missing my point, which is that non-ICU locale support doesn't
>>  necessarily recognize the same encoding names.  We would have done this
>>  years ago if we had a solution to that problem.
>
> We should use IANA-standard names. If it fails - it does nothing.
> Anybody porting PostgreSQL to new platform can go over the list and make a
> patch for their port.

Here is a new and improved patch, that closes security hole as well. To
prevent DOS attack we lock LC_MESSAGES as C for any database encoding that
we are unable to bind to our textdomain.

Re: BUG #2685: Wrong charset of server messages on client

From
Tom Lane
Date:
Sergiy Vyshnevetskiy <serg@vostok.net> writes:
> Here is a new and improved patch, that closes security hole as well.

We really can't consider a patch like this, because not only does it
ignore the problem of multiple spellings of encoding names, but it
actually breaks existing functionality on platforms with a variant
spelling of the name.  I think a minimum requirement ought to be that
it work with any of the spellings recognized by initdb.

            regards, tom lane

Re: BUG #2685: Wrong charset of server messages on client

From
Sergiy Vyshnevetskiy
Date:
On Tue, 10 Oct 2006, Tom Lane wrote:

> Sergiy Vyshnevetskiy <serg@vostok.net> writes:
>> Here is a new and improved patch, that closes security hole as well.
>
> We really can't consider a patch like this, because not only does it
> ignore the problem of multiple spellings of encoding names, but it
> actually breaks existing functionality on platforms with a variant
> spelling of the name.  I think a minimum requirement ought to be that
> it work with any of the spellings recognized by initdb.

Alright, that was too strict. But when server uses messages in
LC_CTYPE encoding with data in database encoding and pushes this mix
through database-to-client charset conversion - that's a bug. PostgreSQL
bug. And "UTF-8 panic" is it's direct result.

As a stop-gap I included a version of patch that breaks nothing. But it
will fix the "wrong encoding" bug and "UTF-8 panic" only on those OS who
recognize the supplied spelling. Linux and FreeBSD are among them.

Cycling through possible spellings in SetDatabaseEncoding() is suboptimal.
The time and place to do it is somewhere in the configure script. There we
can fill pg_enc2localname_tbl with results of testing possible charset
names.

We can also just leave the patch as it is, because more and more OS learn
more and more different charset name spellings every new version. Why
waste too mush power chasing a horce that runs _to_you_? :)