Thread: [bug fix] multibyte messages are displayed incorrectly on the client
Hello, The attached patch fixes incorrect message output on the client side. I guess this problem can happen with any major release. Could you review this? [Problem] When the client's locale differs from the server's message locale, the messages generated on the server are converted appropriately and sent to the client. For example, if the server runs on Linux with lc_messages = 'ja_JP.UTF-8' in postgresql.conf, and you run psql on Windows where the system locale is SJIS, Japanese messages are converted from UTF-8 to SJIS on the server and sent to psql. psql can display those SJIS messages correctly. This is no problem. However, this desirable behavior holds true only after the database session is established. The error messages during session establishment are displayed incorrectly, and you cannot recognize the message contents. For example, run psql -d postgres -U non-existent-username. The displayed message is unrecognizable. [Cause] While the session is being established, the server cannot use the client encoding for message conversion yet, because it cannot access system catalogs to retrieve conversion functions. So, the server sends messages to the client without conversion. In the above example, the server sends Japanese UTF-8 messages to psql, which expects those messages in SJIS. [Fix] Disable message localization during session startup. In other words, messages are output in English until the database session is established. Regards MauMau
Attachment
On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote: > [Cause] > While the session is being established, the server cannot use the > client encoding for message conversion yet, because it cannot access > system catalogs to retrieve conversion functions. So, the server > sends messages to the client without conversion. In the above > example, the server sends Japanese UTF-8 messages to psql, which > expects those messages in SJIS. > > > [Fix] > Disable message localization during session startup. In other > words, messages are output in English until the database session is > established. I think the question is whether the server encoding or English are likely to be better for the average client. My bet is that the server encoding is more likely correct. However, you are right that English/ASCII at least will always be viewable, while there are many server/client combinations that will produce unreadable characters. I would be interested to hear other people's experience with this. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Tue, Dec 17, 2013 at 01:42:08PM -0500, Bruce Momjian wrote: > On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote: > > [Cause] > > While the session is being established, the server cannot use the > > client encoding for message conversion yet, because it cannot access > > system catalogs to retrieve conversion functions. So, the server > > sends messages to the client without conversion. In the above > > example, the server sends Japanese UTF-8 messages to psql, which > > expects those messages in SJIS. Better to attack that directly. Arrange to apply any client_encoding named in the startup packet earlier, before authentication. This relates to the TODO item "Let the client indicate character encoding of database names, user names, and passwords". (I expect such an endeavor to be tricky.) > > [Fix] > > Disable message localization during session startup. In other > > words, messages are output in English until the database session is > > established. > > I think the question is whether the server encoding or English are > likely to be better for the average client. My bet is that the server > encoding is more likely correct. > > However, you are right that English/ASCII at least will always be > viewable, while there are many server/client combinations that will > produce unreadable characters. > > I would be interested to hear other people's experience with this. I don't have a sufficient sense of multilingualism among our users to know whether English/ASCII messages would be more useful, on average, than localized messages in the server encoding. Forcing English/ASCII does worsen behavior in the frequent situation where client encoding will match server encoding. I lean toward retaining the status quo of delivering localized messages in the server encoding. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Re: Re: [bug fix] multibyte messages are displayed incorrectly on the client
From
Alvaro Herrera
Date:
Noah Misch escribió: > On Tue, Dec 17, 2013 at 01:42:08PM -0500, Bruce Momjian wrote: > > On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote: > > > [Fix] > > > Disable message localization during session startup. In other > > > words, messages are output in English until the database session is > > > established. > > > > I think the question is whether the server encoding or English are > > likely to be better for the average client. My bet is that the server > > encoding is more likely correct. > > > > However, you are right that English/ASCII at least will always be > > viewable, while there are many server/client combinations that will > > produce unreadable characters. > > > > I would be interested to hear other people's experience with this. > > I don't have a sufficient sense of multilingualism among our users to know > whether English/ASCII messages would be more useful, on average, than > localized messages in the server encoding. Forcing English/ASCII does worsen > behavior in the frequent situation where client encoding will match server > encoding. I lean toward retaining the status quo of delivering localized > messages in the server encoding. The problem is that if there's an encoding mismatch, the message might be impossible to figure out. If the message is in english, at least it can be searched for in the web, or something -- the user might even find a page in which the english error string appears, with a native language explanation. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
From: "Noah Misch" <noah@leadboat.com> > Better to attack that directly. Arrange to apply any client_encoding > named in > the startup packet earlier, before authentication. This relates to the > TODO > item "Let the client indicate character encoding of database names, user > names, and passwords". (I expect such an endeavor to be tricky.) Unfortunately, character set conversion is not possible until the database session is established, since it requires system catalog access. Please the comment in src/backend/utils/mb/mbutils.c: * During backend startup we can't set client encoding because we (a)* can't look up the conversion functions, and (b) maynot know the database* encoding yet either. So SetClientEncoding() just accepts anything and* remembers it for InitializeClientEncoding()to apply later. I guess that's why Tom-san suggested the same solution as my patch (as a compromise) in the below thread, which is also a TODO item: Re: encoding of PostgreSQL messages http://www.postgresql.org/message-id/19896.1234107496@sss.pgh.pa.us From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > The problem is that if there's an encoding mismatch, the message might > be impossible to figure out. If the message is in english, at least it > can be searched for in the web, or something -- the user might even find > a page in which the english error string appears, with a native language > explanation. I feel like this, too. Being readable in English is better than being unrecognizable. Regards MauMau
On Sun, Dec 22, 2013 at 07:51:55PM +0900, MauMau wrote: > From: "Noah Misch" <noah@leadboat.com> > >Better to attack that directly. Arrange to apply any > >client_encoding named in > >the startup packet earlier, before authentication. This relates > >to the TODO > >item "Let the client indicate character encoding of database names, user > >names, and passwords". (I expect such an endeavor to be tricky.) > > Unfortunately, character set conversion is not possible until the > database session is established, since it requires system catalog > access. Please the comment in src/backend/utils/mb/mbutils.c: > > * During backend startup we can't set client encoding because we (a) > * can't look up the conversion functions, and (b) may not know the database > * encoding yet either. So SetClientEncoding() just accepts anything and > * remembers it for InitializeClientEncoding() to apply later. Yes, changing that is the tricky part. > I guess that's why Tom-san suggested the same solution as my patch > (as a compromise) in the below thread, which is also a TODO item: > > Re: encoding of PostgreSQL messages > http://www.postgresql.org/message-id/19896.1234107496@sss.pgh.pa.us That's fair for the necessarily-earliest messages, like 'invalid value for parameter "client_encoding"' and messages pertaining to the physical structure of the startup packet. The client's encoding expectation is unknowable. An error that mentions "client_encoding" will hopefully put users on the right track regardless of how we translate and encode the surrounding words. The other affected messages are quite technical, making a casual user unlikely to fix or even see them. Not so for authentication messages, so I'm wary of forcing use of ASCII that late in the handshake. Note that choosing to use ASCII need not imply wholly declining to translate. If the build uses GNU libiconv, gettext can emit ASCII approximations for translations that conform to a Latin-derived alphabet, falling back to no translation where the alphabet differs too much. pg_perm_setlocale(LC_CTYPE, "C") requests such behavior. (The inferior iconv //TRANSLIT implementation of GNU libc will convert non-ASCII characters to question marks, though.) > From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > >The problem is that if there's an encoding mismatch, the message might > >be impossible to figure out. If the message is in english, at least it > >can be searched for in the web, or something -- the user might even find > >a page in which the english error string appears, with a native language > >explanation. > > I feel like this, too. Being readable in English is better than > being unrecognizable. I agree that English consistently beats mojibake. I question whether that makes up for the loss of translation when encodings do happen to match, particularly for non-technical errors like a mistyped password. The everything-UTF8 scenario appears often, perhaps explaining infrequent complaints about the status quo. If 90% of translated message users have client_encoding != server_encoding, then +1 for your patch's strategy. If the figure is only 60%, I'd vote for holding out for a more-extensive fix that allows us to encoding-convert localized authentication failure messages. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
From: "Noah Misch" <noah@leadboat.com> > I agree that English consistently beats mojibake. I question whether that > makes up for the loss of translation when encodings do happen to match, > particularly for non-technical errors like a mistyped password. The > everything-UTF8 scenario appears often, perhaps explaining infrequent > complaints about the status quo. If 90% of translated message users have > client_encoding != server_encoding, then +1 for your patch's strategy. If > the > figure is only 60%, I'd vote for holding out for a more-extensive fix that > allows us to encoding-convert localized authentication failure messages. I agree with you. It would be more friendly to users if more messages are localized. Then, as a happy medium, how about disabling message localization only if the client encoding differs from the server one? That is, compare the client_encoding value in the startup packet with the result of GetPlatformEncoding(). If they don't match, call disable_message_localization(). Regards MauMau
On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote: > From: "Noah Misch" <noah@leadboat.com> > >I agree that English consistently beats mojibake. I question whether that > >makes up for the loss of translation when encodings do happen to match, > >particularly for non-technical errors like a mistyped password. The > >everything-UTF8 scenario appears often, perhaps explaining infrequent > >complaints about the status quo. If 90% of translated message users have > >client_encoding != server_encoding, then +1 for your patch's > >strategy. If the > >figure is only 60%, I'd vote for holding out for a more-extensive fix that > >allows us to encoding-convert localized authentication failure messages. > > I agree with you. It would be more friendly to users if more > messages are localized. > > Then, as a happy medium, how about disabling message localization > only if the client encoding differs from the server one? That is, > compare the client_encoding value in the startup packet with the > result of GetPlatformEncoding(). If they don't match, call > disable_message_localization(). I think the problem is we don't know the client and server encodings at that time. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
From: "Bruce Momjian" <bruce@momjian.us> > On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote: >> Then, as a happy medium, how about disabling message localization >> only if the client encoding differs from the server one? That is, >> compare the client_encoding value in the startup packet with the >> result of GetPlatformEncoding(). If they don't match, call >> disable_message_localization(). > > I think the problem is we don't know the client and server encodings > at that time. I suppose we know (or at least believe) those encodings during backend startup: * client encoding - the client_encoding parameter passed in the startup packet, or if that's not present, client_encoding GUC value. * server encoding - the encoding of strings gettext() returns. That is what GetPlatformEncoding() returns. Am I correct? Regards MauMau
On Tue, Jan 7, 2014 at 8:56 AM, MauMau <maumau307@gmail.com> wrote: > I suppose we know (or at least believe) those encodings during backend > startup: > > * client encoding - the client_encoding parameter passed in the startup > packet, or if that's not present, client_encoding GUC value. > > * server encoding - the encoding of strings gettext() returns. That is what > GetPlatformEncoding() returns. Suppose the startup packet itself is malformed. How will you report the error? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: "Robert Haas" <robertmhaas@gmail.com> > Suppose the startup packet itself is malformed. How will you report the > error? I think we have no choice but to report the error in English, because we don't know what the client wants. Regards MauMau
From: "Robert Haas" <robertmhaas@gmail.com> > Suppose the startup packet itself is malformed. How will you report the > error? I think we have no choice but to report the error in English, because we don't know what the client wants. Regards MauMau
On Tue, Jan 07, 2014 at 10:56:28PM +0900, MauMau wrote: > From: "Bruce Momjian" <bruce@momjian.us> > >On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote: > >>Then, as a happy medium, how about disabling message localization > >>only if the client encoding differs from the server one? That is, > >>compare the client_encoding value in the startup packet with the > >>result of GetPlatformEncoding(). If they don't match, call > >>disable_message_localization(). I like this proposal. Thanks. > >I think the problem is we don't know the client and server encodings > >at that time. > > I suppose we know (or at least believe) those encodings during > backend startup: > > * client encoding - the client_encoding parameter passed in the > startup packet, or if that's not present, client_encoding GUC value. > > * server encoding - the encoding of strings gettext() returns. That > is what GetPlatformEncoding() returns. Agreed. You would need to poke into the relevant part of the startup packet much earlier than we do today, but that's tractable. Note that GetPlatformEncoding() is gone; use GetMessageEncoding(). -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Noah Misch <noah@leadboat.com> writes: >> On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote: >>> Then, as a happy medium, how about disabling message localization >>> only if the client encoding differs from the server one? That is, >>> compare the client_encoding value in the startup packet with the >>> result of GetPlatformEncoding(). If they don't match, call >>> disable_message_localization(). > I like this proposal. Thanks. > ... > Agreed. You would need to poke into the relevant part of the startup packet > much earlier than we do today, but that's tractable. There's still the problem of what to do before we have a complete startup packet, or if the packet is defective enough to not contain a recognizable client encoding. Perhaps more to the point, what it sounds like this is doing is creating a third behavioral state, in between what prevails when we're first reading the packet and what prevails after we've finally adopted the requested client encoding. I'm less than convinced that's a good thing. I'm also rather unexcited by the idea of introducing redundant and/or ad-hoc code to parse the startup packet. That sounds like a recipe for bugs, some of which might even rise to security issues, considering it would happen before client authentication. I think if we're going to do anything like this at all, it'd be best just to disable localization from postmaster fork up till we've gotten a client encoding out of the packet in the normal course of events. regards, tom lane
On Fri, Jan 10, 2014 at 08:03:00PM -0500, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > >> On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote: > >>> Then, as a happy medium, how about disabling message localization > >>> only if the client encoding differs from the server one? That is, > >>> compare the client_encoding value in the startup packet with the > >>> result of GetPlatformEncoding(). If they don't match, call > >>> disable_message_localization(). > > > I like this proposal. Thanks. > > ... > > Agreed. You would need to poke into the relevant part of the startup packet > > much earlier than we do today, but that's tractable. > > There's still the problem of what to do before we have a complete startup > packet, or if the packet is defective enough to not contain a recognizable > client encoding. MauMau proposed using untranslated messages until we're past that point. I like that answer fine, because routine mistakes from ordinary users will not elicit the errors in question. The most interesting message in that group might be 'invalid value for parameter "client_encoding"', and I think the presence of the term "client_encoding" will be a sufficient clue regardless of how we translate and encode the surrounding words. > Perhaps more to the point, what it sounds like this is doing is creating > a third behavioral state, in between what prevails when we're first > reading the packet and what prevails after we've finally adopted the > requested client encoding. I'm less than convinced that's a good thing. > > I'm also rather unexcited by the idea of introducing redundant and/or > ad-hoc code to parse the startup packet. That sounds like a recipe for > bugs, some of which might even rise to security issues, considering it > would happen before client authentication. Valid worries. > I think if we're going to do anything like this at all, it'd be best > just to disable localization from postmaster fork up till we've gotten > a client encoding out of the packet in the normal course of events. That was MauMau's original proposal. I opined upthread that it would be better to change nothing than to do that. nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
From: "MauMau" <maumau307@gmail.com> > From: "Noah Misch" <noah@leadboat.com> >> I agree that English consistently beats mojibake. I question whether >> that >> makes up for the loss of translation when encodings do happen to match, >> particularly for non-technical errors like a mistyped password. The >> everything-UTF8 scenario appears often, perhaps explaining infrequent >> complaints about the status quo. If 90% of translated message users have >> client_encoding != server_encoding, then +1 for your patch's strategy. >> If the >> figure is only 60%, I'd vote for holding out for a more-extensive fix >> that >> allows us to encoding-convert localized authentication failure messages. > > I agree with you. It would be more friendly to users if more messages are > localized. > > Then, as a happy medium, how about disabling message localization only if > the client encoding differs from the server one? That is, compare the > client_encoding value in the startup packet with the result of > GetPlatformEncoding(). If they don't match, call > disable_message_localization(). I did this with the attached patch. I added some code in BackendInitialize(). I'll update the CommitFest entry in a few days. Regards MauMau
Attachment
"MauMau" <maumau307@gmail.com> writes: >> Then, as a happy medium, how about disabling message localization only if >> the client encoding differs from the server one? That is, compare the >> client_encoding value in the startup packet with the result of >> GetPlatformEncoding(). If they don't match, call >> disable_message_localization(). AFAICT this is not what was agreed to in this thread. It puts far too much credence in the server-side default for client_encoding, which up to now has never been thought to be very interesting; indeed I doubt most people bother to set it at all. The reason that this issue is even on the table is that that default is too likely to be wrong, no? Also, whatever possessed you to use pg_get_encoding_from_locale to identify the server's encoding? That's expensive and seems fairly unlikely to yield the right answer. I don't remember offhand where we keep the postmaster's idea of what encoding messages should be in, but I'm fairly sure it's stored explicitly somewhere. Or if it isn't, we can for sure do better than recalculating it during every connection attempt. Having said all that, though, I'm unconvinced that this cure isn't worse than the disease. Somebody claimed upthread that no very interesting messages would be delocalized by a change like this, but that's complete nonsense: in particular, *every* message associated with client authentication will be sent in English if we go down this path. Given the nearly complete lack of complaints in the many years that this code has worked like this, I'm betting that most people will find a change like this to be a net reduction in friendliness. Given the changes here to extract client_encoding from the startup packet ASAP, I wonder whether the right thing isn't just to set the client encoding immediately when we do that. Most application libraries pass client encoding in the startup packet anyway (libpq certainly does). regards, tom lane
Re: [bug fix] multibyte messages are displayed incorrectly on the client
From
Heikki Linnakangas
Date:
On 04/05/2014 07:56 AM, Tom Lane wrote: > "MauMau" <maumau307@gmail.com> writes: >>> Then, as a happy medium, how about disabling message localization only if >>> the client encoding differs from the server one? That is, compare the >>> client_encoding value in the startup packet with the result of >>> GetPlatformEncoding(). If they don't match, call >>> disable_message_localization(). > > AFAICT this is not what was agreed to in this thread. It puts far too > much credence in the server-side default for client_encoding, which up to > now has never been thought to be very interesting; indeed I doubt most > people bother to set it at all. The reason that this issue is even on > the table is that that default is too likely to be wrong, no? > > Also, whatever possessed you to use pg_get_encoding_from_locale to > identify the server's encoding? That's expensive and seems fairly > unlikely to yield the right answer. I don't remember offhand where we > keep the postmaster's idea of what encoding messages should be in, but I'm > fairly sure it's stored explicitly somewhere. Or if it isn't, we can for > sure do better than recalculating it during every connection attempt. > > Having said all that, though, I'm unconvinced that this cure isn't worse > than the disease. Somebody claimed upthread that no very interesting > messages would be delocalized by a change like this, but that's complete > nonsense: in particular, *every* message associated with client > authentication will be sent in English if we go down this path. Given > the nearly complete lack of complaints in the many years that this code > has worked like this, I'm betting that most people will find a change > like this to be a net reduction in friendliness. > > Given the changes here to extract client_encoding from the startup packet > ASAP, I wonder whether the right thing isn't just to set the client > encoding immediately when we do that. Most application libraries pass > client encoding in the startup packet anyway (libpq certainly does). Based on Tom's comments above, I'm marking this as returned with feedback in the commitfest. I agree that setting client_encoding as early as possible seems like the right thing to do. Earlier in this thread, MauMau pointed out that we can't do encoding conversions until we have connected to the database because you need to read pg_conversion for that. That's because we support creating custom conversions with CREATE CONVERSION. Frankly, I don't think anyone cares about that feature. If we just dropped the CREATE/DROP CONVERSION feature altogether and hard-coded the conversions we have, there would be close to zero complaints. Even if you want to extend something around encodings and conversions, the CREATE CONVERSION interface is clunky. Firstly, conversions are per-database, and even schema-qualified, which just seems like an extra complication. You'll most likely want to modify the conversion across the whole system. Secondly, rather than define a new conversion between encodings, you'll likely want to define a whole new encoding with conversions to/from existing encodings, but you can't do that anyway without hacking the source code. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > Earlier in this thread, MauMau pointed out that we can't do encoding > conversions until we have connected to the database because you need to > read pg_conversion for that. That's because we support creating custom > conversions with CREATE CONVERSION. Frankly, I don't think anyone cares > about that feature. If we just dropped the CREATE/DROP CONVERSION > feature altogether and hard-coded the conversions we have, there would > be close to zero complaints. Even if you want to extend something around > encodings and conversions, the CREATE CONVERSION interface is clunky. > Firstly, conversions are per-database, and even schema-qualified, which > just seems like an extra complication. You'll most likely want to modify > the conversion across the whole system. Secondly, rather than define a > new conversion between encodings, you'll likely want to define a whole > new encoding with conversions to/from existing encodings, but you can't > do that anyway without hacking the source code. There's certainly something to be said for that position. If there were any prospect of extensions defining new encodings someday, I'd argue for keeping CREATE CONVERSION. But the performance headaches would be substantial, and there aren't new encodings coming down the pike often enough to justify the work involved, so I don't see us ever doing CREATE ENCODING; and that means that CREATE CONVERSION is of little value. I'd kind of like to see this go just because having catalog accesses involved in encoding conversion setup is messy and fragile. regards, tom lane