Re: [bug fix] multibyte messages are displayed incorrectly on the client - Mailing list pgsql-hackers

From Noah Misch
Subject Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date
Msg-id 20131230030207.GA1551279@tornado.leadboat.com
Whole thread Raw
In response to Re: [bug fix] multibyte messages are displayed incorrectly on the client  ("MauMau" <maumau307@gmail.com>)
Responses Re: [bug fix] multibyte messages are displayed incorrectly on the client
List pgsql-hackers
On Sun, Dec 22, 2013 at 07:51:55PM +0900, MauMau wrote:
> From: "Noah Misch" <noah@leadboat.com>
> >Better to attack that directly.  Arrange to apply any
> >client_encoding named in
> >the startup packet earlier, before authentication.  This relates
> >to the TODO
> >item "Let the client indicate character encoding of database names, user
> >names, and passwords".  (I expect such an endeavor to be tricky.)
> 
> Unfortunately, character set conversion is not possible until the
> database session is established, since it requires system catalog
> access.  Please the comment in src/backend/utils/mb/mbutils.c:
> 
> * During backend startup we can't set client encoding because we (a)
> * can't look up the conversion functions, and (b) may not know the database
> * encoding yet either.  So SetClientEncoding() just accepts anything and
> * remembers it for InitializeClientEncoding() to apply later.

Yes, changing that is the tricky part.

> I guess that's why Tom-san suggested the same solution as my patch
> (as a compromise) in the below thread, which is also a TODO item:
> 
> Re: encoding of PostgreSQL messages
> http://www.postgresql.org/message-id/19896.1234107496@sss.pgh.pa.us

That's fair for the necessarily-earliest messages, like 'invalid value for
parameter "client_encoding"' and messages pertaining to the physical structure
of the startup packet.  The client's encoding expectation is unknowable.  An
error that mentions "client_encoding" will hopefully put users on the right
track regardless of how we translate and encode the surrounding words.  The
other affected messages are quite technical, making a casual user unlikely to
fix or even see them.  Not so for authentication messages, so I'm wary of
forcing use of ASCII that late in the handshake.

Note that choosing to use ASCII need not imply wholly declining to translate.
If the build uses GNU libiconv, gettext can emit ASCII approximations for
translations that conform to a Latin-derived alphabet, falling back to no
translation where the alphabet differs too much.  pg_perm_setlocale(LC_CTYPE,
"C") requests such behavior.  (The inferior iconv //TRANSLIT implementation of
GNU libc will convert non-ASCII characters to question marks, though.)

> From: "Alvaro Herrera" <alvherre@2ndquadrant.com>
> >The problem is that if there's an encoding mismatch, the message might
> >be impossible to figure out.  If the message is in english, at least it
> >can be searched for in the web, or something -- the user might even find
> >a page in which the english error string appears, with a native language
> >explanation.
> 
> I feel like this, too.  Being readable in English is better than
> being unrecognizable.

I agree that English consistently beats mojibake.  I question whether that
makes up for the loss of translation when encodings do happen to match,
particularly for non-technical errors like a mistyped password.  The
everything-UTF8 scenario appears often, perhaps explaining infrequent
complaints about the status quo.  If 90% of translated message users have
client_encoding != server_encoding, then +1 for your patch's strategy.  If the
figure is only 60%, I'd vote for holding out for a more-extensive fix that
allows us to encoding-convert localized authentication failure messages.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: control to don't toast one new type
Next
From: Peter Geoghegan
Date:
Subject: Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE