Thread: Re: BUG #5661: The character encoding in logfile is confusing.
[moving to pgsql-hackers; this isn't the simple bug I initially suspected it might be] On 20/09/10 03:10, Tom Lane wrote: > Craig Ringer <craig@postnewspapers.com.au> writes: >> One of the correctly encoded messages is "Unexpected EOF received on >> client connection" > >> One of the incorrectly encoded (shift-JIS) messages is: "Fast Shutdown >> request received". Another is "Aborting any active transactions". > >> ... question now is where the messages are converted from UTF-8 to shift-JIS >> and why that conversion is being applied inconsistently. > > Given those three examples, I wonder whether all the mis-encoded > messages are emitted by the postmaster, rather than backends. > Anyway it seems that you ought to look for some pattern in which > messages are correctly vs incorrectly encoded. I think you're right. Looking into it more, though, I'm not even sure what the correct behaviour even is. I don't think this is a simple bug where Pg fails to convert between encodings in a few places; rather, it's a design oversight where the effect of having a system encoding different from the encoding of the database(s) isn't considered. A single log file should obviously be in a single encoding, it's the only sane way to do things. But which encoding is it in? And which *should* it be in? - The system text encoding? This is what the postmaster will have from its environment, and is what the user will expectthe logs to be in. Postmaster will emit messages in this encoding at least during startup, as it doesn't know whatencoding the cluster uses yet. (In fact it seems to stick to the system encoding throughout its life). - The default database encoding supplied to initdb during cluster creation? - The encoding of the database emitting a message? This makes sense when considering RAISE messages, for example. Backendswill currently use this encoding when emitting log messages, whether user-supplied or translated from po files. This confusion leads to the mixed encoding issues reported by the OP. It's not a simple bug, it's a design issue. Unfortunately, it's not as simple as picking one of the above encodings for all logging. The system encoding isn't a good choice, because it might not be capable of representing all characters emitted by user RAISE statements in databases with a different encoding, nor all "double quoted" identifiers, parameter values, etc etc etc. For example, if the system encoding is SHIFT-JIS, but user databases emit messages with Chinese, Cyrillic, extended latin, or pretty much any non-Japanese characters, there's no sane way to convert messages containing any user text to shift-JIS for logging. The same applies with a latin-1 (iso-8859-1) system encoding and a utf-8 or shift-jis database emitting Japanese messages. Scratch using the system encoding for logging. What about the encoding used by initdb to create the cluster? It seems sensible, but: - The postmaster doesn't know what it is when it's doing it's initial startup. How can the postmaster complain that it can'tfind / open the cluster datadir when it doesn't know what encoding to use for the complaint? - If the cluster isn't created as utf-8, the same issue as with the system encoding applies. Using the encoding of the emitting database will permit all messages to be represented, but will give rise to mixed encodings in the log file, and still won't help the postmaster know what to do before it's found and read the cluster. I'm now inclined to propose that all logging be done unconditionally in utf-8, with a BOM written to the start of every log file. Backends with non-utf-8 databases should convert messages to utf-8 for logging. Because PostgreSQL supports the use of different encodings in different databases this is the only way to ensure sane logging with consistent encoding in a single log file. The only alternative I see is to break logging out into separate files: - postmaster.log for postmaster etc - [databasename].log for each database, in that database's encoding ... but I'm not confident that'd be worth the confusion. Neither scheme solves the question of what to do when logging to syslog, though. Syslog expects messages in the system encoding, and Pg would be wrong to log in any other encoding. Yet as databases may have characters that cannot be represented in the system encoding, the system encoding isn't good enough. Should syslog messages be converted to the system encoding with non-representable characters replaced by "?" or some other placeholder? Blech. Ideas? Suggestions? -- Craig Ringer Tech-related writing: http://soapyfrogs.blogspot.com/
On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote: > A single log file should obviously be in a single encoding, it's the > only sane way to do things. But which encoding is it in? And which > *should* it be in? We need to produce the log output in the server encoding, because that's how we need to send it to the client. If you have different databases with different server encodings, you will get inconsistently encoded output in the log file. Conceivably, we could create a configuration option that specifies the encoding for the log file, and strings a recoded from whatever gettext() produces to the specified encoding. initdb could initialize that option suitably, so in most cases users won't have to do anything.
On 22/09/2010 5:45 PM, Peter Eisentraut wrote: > On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote: >> A single log file should obviously be in a single encoding, it's the >> only sane way to do things. But which encoding is it in? And which >> *should* it be in? > > We need to produce the log output in the server encoding, because that's > how we need to send it to the client. That doesn't mean it can't be recoded for writing to the log file, though. Perhaps it needs to be. It should be reasonably practical to detect when the database and log encoding are the same and avoid the transcoding performance penalty, not that it's big anyway. > If you have different databases > with different server encodings, you will get inconsistently encoded > output in the log file. I don't think that's an OK answer, myself. Mixed encodings with no delineation in one file = bug as far as I'm concerned. You can't even rely on being able to search the log anymore. You'll only get away with it when using languages that mostly stick to the 7-bit ASCII subset, so most text is still readable; with most other languages you'll get logs full of what looks to the user like garbage. > Conceivably, we could create a configuration option that specifies the > encoding for the log file, and strings a recoded from whatever gettext() > produces to the specified encoding. initdb could initialize that option > suitably, so in most cases users won't have to do anything. Yep, I tend to think that'd be the right way to go. It'd still be a bit of a pain, though, as messages written to stdout/stderr by the postmaster should be in the system encoding, but messages written to the log files should be in the encoding specified for logs, unless logging is being done to syslog, in which case it has to be in the system encoding after all... And, of course, the postmaster still doesn't know how to log anything it might emit before reading postgresql.conf, because it doesn't know what encoding to use. I still wonder if, rather than making this configurable, the right choice is to force logging to UTF-8 (with BOM) across the board, right from postmaster startup. It's consistent, all messages in all other encodings can be converted to UTF-8 for logging, it's platform independent, and text editors etc tend to understand and recognise UTF-8 especially with the BOM. Unfortunately, because many unix utilities (grep etc) aren't encoding aware, that'll cause problems when people go to search log files. For (eg) "広告掲載" The log files will contain the utf-8 bytes: \xe5\xba\x83\xe5\x91\x8a\xe6\x8e\xb2\xe8\xbc\x89 but grep on a shift-jis system will be looking for: \x8d\x4c\x8d\x90\x8cf\x8d\xda so it won't match. Ugh. If only we could say "PostgreSQL requires a system locale with a UTF-8 encoding". Alas, I don't think that'd go down very well with packagers or installers. [Insert rant about how stupid it is that *nix systems still aren't all UTF-8 here]. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
On Wed, Sep 22, 2010 at 12:25 PM, Craig Ringer <craig@postnewspapers.com.au> wrote: > I don't think that's an OK answer, myself. Mixed encodings with no > delineation in one file = bug as far as I'm concerned. You can't even rely > on being able to search the log anymore. You'll only get away with it when > using languages that mostly stick to the 7-bit ASCII subset, so most text is > still readable; with most other languages you'll get logs full of what looks > to the user like garbage. This issue crops up periodically in the pgAdmin lists as well, as the mixed encoding sometimes break the log viewer. > I still wonder if, rather than making this configurable, the right choice is > to force logging to UTF-8 (with BOM) across the board, right from postmaster > startup. It's consistent, all messages in all other encodings can be > converted to UTF-8 for logging, it's platform independent, and text editors > etc tend to understand and recognise UTF-8 especially with the BOM. That would be ideal for us. > Unfortunately, because many unix utilities (grep etc) aren't encoding aware, > that'll cause problems when people go to search log files. For (eg) "広告掲載" But not for others! -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise Postgres Company
On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote: > Yep, I tend to think that'd be the right way to go. It'd still be a bit > of a pain, though, as messages written to stdout/stderr by the > postmaster should be in the system encoding, but messages written to the > log files should be in the encoding specified for logs, unless logging > is being done to syslog, in which case it has to be in the system > encoding after all... I think that should not be a problem to implement. Those two go through different routines anyway. > And, of course, the postmaster still doesn't know how to log anything it > might emit before reading postgresql.conf, because it doesn't know what > encoding to use. That should also not be a big issue. The postmaster needs the configuration file to know where to write the log file anyway. > I still wonder if, rather than making this configurable, the right > choice is to force logging to UTF-8 (with BOM) across the board, right > from postmaster startup. It's consistent, all messages in all other > encodings can be converted to UTF-8 for logging, it's platform > independent, and text editors etc tend to understand and recognise UTF-8 > especially with the BOM. I don't think this would make things better or easier. At some point you're going to have to insert a recode call, and it doesn't matter much whether the destination argument is a constant or a variable.
Craig Ringer <craig@postnewspapers.com.au> writes: > On 22/09/2010 5:45 PM, Peter Eisentraut wrote: >> We need to produce the log output in the server encoding, because that's >> how we need to send it to the client. > That doesn't mean it can't be recoded for writing to the log file, > though. Perhaps it needs to be. It should be reasonably practical to > detect when the database and log encoding are the same and avoid the > transcoding performance penalty, not that it's big anyway. We have seen ... and rejected ... such proposals before. The problem is that "transcode to some other encoding" is not a simple and guaranteed error-free operation. As an example, if you choose to name some table using a character that doesn't exist in the log encoding, you have just ensured that no message about that table will ever get to the log. Nice way to hide your activities from the DBA ;-) Transcoding also eats memory, which might be in exceedingly short supply while trying to report an "out of memory" error; and IIRC there are some other failure scenarios to be concerned about. We could maybe accept a design for this that included a sufficiently well-thought-out set of fallback behaviors. But we haven't seen one yet. regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes: > On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote: >> I still wonder if, rather than making this configurable, the right >> choice is to force logging to UTF-8 (with BOM) across the board, > I don't think this would make things better or easier. At some point > you're going to have to insert a recode call, and it doesn't matter much > whether the destination argument is a constant or a variable. It'd avoid the problem of having possibly-unconvertable messages ... at the cost of pissing off users who have a uniform server encoding selection already and don't see why they should be forced to deal with UTF8 in the log. It's pretty much just one step from here to deciding that the server should work exclusively in UTF8 and never mind all those other legacy encodings. We've resisted that attitude for quite some years now, and are probably not really ready to adopt it for the log either. regards, tom lane
Hi Craig, Almost Japanese software emit log files by encoding of the server the software running on. I'm not sure it is the best way or not, but Japanese users taking it for granted. So I feel that Japanese users would hope that postgre server has same style with other software, cause many administrators in Japan are familiar and experienced for the way. On Unix, user can specify default character encoding at installing. Software can get it to refer the environment value $LANG e.g. > % echo $LANG > ja_JP.eucJP On Japanese Windows, default encoding is MS-932(or cp-932 or Windows-31J). This is fixed. MS-932 is almost same as Shift-JIS but very few characters has different character code between MS-932 and Shit-JIS. And Shift-JIS doesn't have some characters in MS-932. This is very important issue. This issue has been making a lot of related bugs e.g. below: http://bugs.mysql.com/bug.php?id=7607 And if postgre could be configured to emit a log file by row English messages, some users will choice it if the translating messages to Japanese has some costs. Some administrators in Japan don't hate reading English messages. (Many software are not user friendly for not English users. Many Japanese users are wondering and impressed with postgre emits Japanese messages in log file.) Thank you. =Mikio -- <tkbysh2000@yahoo.co.jp> On Wed, 22 Sep 2010 19:25:47 +0800 Craig Ringer <craig@postnewspapers.com.au> wrote: > On 22/09/2010 5:45 PM, Peter Eisentraut wrote: > > On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote: > >> A single log file should obviously be in a single encoding, it's the > >> only sane way to do things. But which encoding is it in? And which > >> *should* it be in? > > > > We need to produce the log output in the server encoding, because that's > > how we need to send it to the client. > > That doesn't mean it can't be recoded for writing to the log file, > though. Perhaps it needs to be. It should be reasonably practical to > detect when the database and log encoding are the same and avoid the > transcoding performance penalty, not that it's big anyway. > > > If you have different databases > > with different server encodings, you will get inconsistently encoded > > output in the log file. > > I don't think that's an OK answer, myself. Mixed encodings with no > delineation in one file = bug as far as I'm concerned. You can't even > rely on being able to search the log anymore. You'll only get away with > it when using languages that mostly stick to the 7-bit ASCII subset, so > most text is still readable; with most other languages you'll get logs > full of what looks to the user like garbage. > > > Conceivably, we could create a configuration option that specifies the > > encoding for the log file, and strings a recoded from whatever gettext() > > produces to the specified encoding. initdb could initialize that option > > suitably, so in most cases users won't have to do anything. > > Yep, I tend to think that'd be the right way to go. It'd still be a bit > of a pain, though, as messages written to stdout/stderr by the > postmaster should be in the system encoding, but messages written to the > log files should be in the encoding specified for logs, unless logging > is being done to syslog, in which case it has to be in the system > encoding after all... > > And, of course, the postmaster still doesn't know how to log anything it > might emit before reading postgresql.conf, because it doesn't know what > encoding to use. > > I still wonder if, rather than making this configurable, the right > choice is to force logging to UTF-8 (with BOM) across the board, right > from postmaster startup. It's consistent, all messages in all other > encodings can be converted to UTF-8 for logging, it's platform > independent, and text editors etc tend to understand and recognise UTF-8 > especially with the BOM. > > Unfortunately, because many unix utilities (grep etc) aren't encoding > aware, that'll cause problems when people go to search log files. For > (eg) "広告掲載" The log files will contain the utf-8 bytes: > > \xe5\xba\x83\xe5\x91\x8a\xe6\x8e\xb2\xe8\xbc\x89 > > but grep on a shift-jis system will be looking for: > > \x8d\x4c\x8d\x90\x8cf\x8d\xda > > so it won't match. > > > Ugh. If only we could say "PostgreSQL requires a system locale with a > UTF-8 encoding". Alas, I don't think that'd go down very well with > packagers or installers. [Insert rant about how stupid it is that *nix > systems still aren't all UTF-8 here]. > > -- > Craig Ringer > > Tech-related writing at http://soapyfrogs.blogspot.com/
On 09/22/2010 09:55 PM, Tom Lane wrote: > Peter Eisentraut<peter_e@gmx.net> writes: >> On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote: >>> I still wonder if, rather than making this configurable, the right >>> choice is to force logging to UTF-8 (with BOM) across the board, > >> I don't think this would make things better or easier. At some point >> you're going to have to insert a recode call, and it doesn't matter much >> whether the destination argument is a constant or a variable. > > It'd avoid the problem of having possibly-unconvertable messages ... > at the cost of pissing off users who have a uniform server encoding > selection already and don't see why they should be forced to deal with > UTF8 in the log. > > It's pretty much just one step from here to deciding that the server > should work exclusively in UTF8 and never mind all those other legacy > encodings. We've resisted that attitude for quite some years now, > and are probably not really ready to adopt it for the log either. Fair enough. The current approach is broken, though. Mis-encoded messages the user can't read are little more good to them than messages that're never logged. I see four options here (two of which are practical IMO): (1) Log in UTF-8, convert everything to UTF-8. Better for admin tools & apps, sucks for OS utilities/grep/etc on non-utf-8 locales. Preserves all messages no matter what the database and system encodings are. (2) Log in default encoding for locale, convert all messages to that encoding. Where characters cannot be represented in the target encoding replace them with a placeholder (? or something). Better - but far from good - for OS utilities/grep/etc, sucks for admin tools and apps. Doesn't preserve all messages properly if user has databases in encodings other than the system encoding. (3) Have a log for the postmaster in the default locale for the system. Have a log file for each database that's in the encoding for that database. IMO this is the worst of both worlds, but it does preserve original encodings without transcoding or forcing a particular encoding and does preserve messages. Horribly complicated for admin tools, inconsistent and horrid for grep etc. (4) Keep things much as they are, but log an encoding identifier prefix for each line. Lets GUI/admin tools post-process the logs into something sane, permits automated log processing because line encodings are known. Sucks for shell tools, which can't tell which lines are which; we'd need to provide a "pggrep" and "pgless" for reliable log search! Preserves all messages, but not in a reliably searchable manner. (0) Change nothing. Log all messages in the original encoding they were generated in. Perform no conversion. Logs contain mixed encodings. Horrible for admin/gui tools (broken text). Horrible for shell utilities/OS tools (can't trust grep results etc). Automatic log processing impossible as the encoding for each line isn't known and can't be reliably discovered. As far as I'm concerned, (3) is out. It's horrible. I don't think the status quo (0) is OK either, it's producing broken log files. (4) is pretty awful too, but it's the smallest change that kind-of fixes the issue to the point where it's at least possible for PgAdmin etc to convert the logs into a consistent encoding. IMO it's down to (1) and (2). There's no clear consensus between those two, so I'd be inclined to offer the admin the choice between them as a config option, depending on the trade-off they prefer to make. For sensible systems in a utf-8 locale (1) and (2) are equivalent, and (2) is fine for systems where the database encoding is always the same as the system encoding. It's only for systems with a non-utf-8 locale that use databases in encodings other than the system locale's encoding that problems arise. In this case they're going to get suboptimal results one way or the other, it's just a matter of letting them pick how. Thoughts? I should ask on the various language-specific mailing lists and see what people there have to say about it. Maybe it doesn't affect people enough in practice for them to care. -- Craig Ringer
On 22/09/2010 9:41 PM, Tom Lane wrote: > Craig Ringer<craig@postnewspapers.com.au> writes: >> On 22/09/2010 5:45 PM, Peter Eisentraut wrote: >>> We need to produce the log output in the server encoding, because that's >>> how we need to send it to the client. > >> That doesn't mean it can't be recoded for writing to the log file, >> though. Perhaps it needs to be. It should be reasonably practical to >> detect when the database and log encoding are the same and avoid the >> transcoding performance penalty, not that it's big anyway. > > We have seen ... and rejected ... such proposals before. The problem is > that "transcode to some other encoding" is not a simple and guaranteed > error-free operation. As an example, if you choose to name some table > using a character that doesn't exist in the log encoding, you have just > ensured that no message about that table will ever get to the log. Well, an arguably reasonable if still suboptimal approach is to mask out characters without any representation in the target encoding, replacing them with a substitute ("?" or whatever). The rest of the log message is still emitted that way. Currently, Pg may as well be emitting "!@#!#!#!@#$!@#$" for these log records. It's garbage unless the user's editor/log viewer/whatever happens to use the encoding of that set of messages, turning all the others into garbage instead. To interpret them, I had to It's not a big deal with languages that mostly use the 7-bit ascii space most encodings share, but for russian, chinese, japanese, thai, the various indian languages, etc etc etc it's pretty awful, as seen in Mikio's example log files. > Nice way to hide your activities from the DBA ;-) Emitting messages in the wrong encoding doesn't do the DBA any favours either. Automated log analysis and reporting will have a hard time dealing with the logs, and the DBA will have to keep on switching encodings in their editor/viewer to interpret or search the logs. Assuming they know how, and know they need to. > Transcoding also > eats memory, which might be in exceedingly short supply while trying > to report an "out of memory" error; and IIRC there are some other > failure scenarios to be concerned about. Yep, that's certainly a problem. Pre-transcoding them on backend start isn't particularly desirable (wasted startup time, memory) and neither is pre-allocating extra memory for use on fatal exit paths. OTOH, don't the current message translations also cost at least some memory, too? I don't have a good answer for this issue. Only rather less-than-good ideas like: mmap() a file the postmaster generates that contains various fatal messages, already in the right encodings/translations, with an offset table at the front? Icky, but effective and doesn't waste precious shared memory or produce new unsharable allocations in the backends that'll only ever get used when something breaks. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/