Thread: Re: BUG #5661: The character encoding in logfile is confusing.

Re: BUG #5661: The character encoding in logfile is confusing.

From

Craig Ringer

Date:

22 September 2010, 05:25:43

[moving to pgsql-hackers; this isn't the simple bug I initially
suspected it might be]

On 20/09/10 03:10, Tom Lane wrote:
> Craig Ringer <craig@postnewspapers.com.au> writes:
>> One of the correctly encoded messages is "Unexpected EOF received on 
>> client connection"
> 
>> One of the incorrectly encoded (shift-JIS) messages is: "Fast Shutdown 
>> request received". Another is "Aborting any active transactions".
> 
>> ... question now is where the messages are converted from UTF-8 to shift-JIS 
>> and why that conversion is being applied inconsistently.
> 
> Given those three examples, I wonder whether all the mis-encoded
> messages are emitted by the postmaster, rather than backends.
> Anyway it seems that you ought to look for some pattern in which
> messages are correctly vs incorrectly encoded.

I think you're right. Looking into it more, though, I'm not even sure
what the correct behaviour even is. I don't think this is a simple bug
where Pg fails to convert between encodings in a few places; rather,
it's a design oversight where the effect of having a system encoding
different from the encoding of the database(s) isn't considered.

A single log file should obviously be in a single encoding, it's the
only sane way to do things. But which encoding is it in? And which
*should* it be in?

- The system text encoding? This is what the postmaster will have from its environment, and is what the user will
expectthe logs to be in. Postmaster will emit messages in this encoding at least during startup, as it doesn't know
whatencoding the cluster uses yet. (In fact it seems to stick to the system encoding throughout its life).

- The default database encoding supplied to initdb during cluster creation?

- The encoding of the database emitting a message? This makes sense when considering RAISE messages, for example.
Backendswill currently use this encoding when emitting log messages, whether user-supplied or translated from po
files.

This confusion leads to the mixed encoding issues reported by the OP.
It's not a simple bug, it's a design issue.

Unfortunately, it's not as simple as picking one of the above encodings
for all logging.

The system encoding isn't a good choice, because it might not be capable
of representing all characters emitted by user RAISE statements in
databases with a different encoding, nor all "double quoted"
identifiers, parameter values, etc etc etc. For example, if the system
encoding is SHIFT-JIS, but user databases emit messages with Chinese,
Cyrillic, extended latin, or pretty much any non-Japanese characters,
there's no sane way to convert messages containing any user text to
shift-JIS for logging. The same applies with a latin-1 (iso-8859-1)
system encoding and a utf-8 or shift-jis database emitting Japanese
messages. Scratch using the system encoding for logging.

What about the encoding used by initdb to create the cluster? It seems
sensible, but:
- The postmaster doesn't know what it is when it's doing it's initial startup. How can the postmaster complain that it
can'tfind / open the cluster datadir when it doesn't know what encoding to use for the complaint?

- If the cluster isn't created as utf-8, the same issue as with the system encoding applies.

Using the encoding of the emitting database will permit all messages to
be represented, but will give rise to mixed encodings in the log file,
and still won't help the postmaster know what to do before it's found
and read the cluster.

I'm now inclined to propose that all logging be done unconditionally in
utf-8, with a BOM written to the start of every log file. Backends with
non-utf-8 databases should convert messages to utf-8 for logging.
Because PostgreSQL supports the use of different encodings in different
databases this is the only way to ensure sane logging with consistent
encoding in a single log file.

The only alternative I see is to break logging out into separate files:
- postmaster.log       for postmaster etc
- [databasename].log   for each database, in that database's encoding
... but I'm not confident that'd be worth the confusion.

Neither scheme solves the question of what to do when logging to syslog,
though. Syslog expects messages in the system encoding, and Pg would be
wrong to log in any other encoding. Yet as databases may have characters
that cannot be represented in the system encoding, the system encoding
isn't good enough. Should syslog messages be converted to the system
encoding with non-representable characters replaced by "?" or some other
placeholder? Blech.

Ideas? Suggestions?

-- 
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/

Re: BUG #5661: The character encoding in logfile is confusing.

From

Peter Eisentraut

Date:

22 September 2010, 06:45:35

On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote:
> A single log file should obviously be in a single encoding, it's the
> only sane way to do things. But which encoding is it in? And which
> *should* it be in?

We need to produce the log output in the server encoding, because that's
how we need to send it to the client.  If you have different databases
with different server encodings, you will get inconsistently encoded
output in the log file.

Conceivably, we could create a configuration option that specifies the
encoding for the log file, and strings a recoded from whatever gettext()
produces to the specified encoding.  initdb could initialize that option
suitably, so in most cases users won't have to do anything.

Re: BUG #5661: The character encoding in logfile is confusing.

From

Craig Ringer

Date:

22 September 2010, 08:26:16

On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
> On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote:
>> A single log file should obviously be in a single encoding, it's the
>> only sane way to do things. But which encoding is it in? And which
>> *should* it be in?
>
> We need to produce the log output in the server encoding, because that's
> how we need to send it to the client.

That doesn't mean it can't be recoded for writing to the log file, 
though. Perhaps it needs to be. It should be reasonably practical to 
detect when the database and log encoding are the same and avoid the 
transcoding performance penalty, not that it's big anyway.

> If you have different databases
> with different server encodings, you will get inconsistently encoded
> output in the log file.

I don't think that's an OK answer, myself. Mixed encodings with no 
delineation in one file = bug as far as I'm concerned. You can't even 
rely on being able to search the log anymore. You'll only get away with 
it when using languages that mostly stick to the 7-bit ASCII subset, so 
most text is still readable; with most other languages you'll get logs 
full of what looks to the user like garbage.

> Conceivably, we could create a configuration option that specifies the
> encoding for the log file, and strings a recoded from whatever gettext()
> produces to the specified encoding.  initdb could initialize that option
> suitably, so in most cases users won't have to do anything.

Yep, I tend to think that'd be the right way to go. It'd still be a bit 
of a pain, though, as messages written to stdout/stderr by the 
postmaster should be in the system encoding, but messages written to the 
log files should be in the encoding specified for logs, unless logging 
is being done to syslog, in which case it has to be in the system 
encoding after all...

And, of course, the postmaster still doesn't know how to log anything it 
might emit before reading postgresql.conf, because it doesn't know what 
encoding to use.

I still wonder if, rather than making this configurable, the right 
choice is to force logging to UTF-8 (with BOM) across the board, right 
from postmaster startup. It's consistent, all messages in all other 
encodings can be converted to UTF-8 for logging, it's platform 
independent, and text editors etc tend to understand and recognise UTF-8 
especially with the BOM.

Unfortunately, because many unix utilities (grep etc) aren't encoding 
aware, that'll cause problems when people go to search log files. For 
(eg) "広告掲載" The log files will contain the utf-8 bytes:
  \xe5\xba\x83\xe5\x91\x8a\xe6\x8e\xb2\xe8\xbc\x89

but grep on a shift-jis system will be looking for:
  \x8d\x4c\x8d\x90\x8cf\x8d\xda

so it won't match.

Ugh. If only we could say "PostgreSQL requires a system locale with a 
UTF-8 encoding". Alas, I don't think that'd go down very well with 
packagers or installers. [Insert rant about how stupid it is that *nix 
systems still aren't all UTF-8 here].

-- 
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: BUG #5661: The character encoding in logfile is confusing.

From

Dave Page

Date:

22 September 2010, 08:30:14

On Wed, Sep 22, 2010 at 12:25 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> I don't think that's an OK answer, myself. Mixed encodings with no
> delineation in one file = bug as far as I'm concerned. You can't even rely
> on being able to search the log anymore. You'll only get away with it when
> using languages that mostly stick to the 7-bit ASCII subset, so most text is
> still readable; with most other languages you'll get logs full of what looks
> to the user like garbage.

This issue crops up periodically in the pgAdmin lists as well, as the
mixed encoding sometimes break the log viewer.

> I still wonder if, rather than making this configurable, the right choice is
> to force logging to UTF-8 (with BOM) across the board, right from postmaster
> startup. It's consistent, all messages in all other encodings can be
> converted to UTF-8 for logging, it's platform independent, and text editors
> etc tend to understand and recognise UTF-8 especially with the BOM.

That would be ideal for us.

> Unfortunately, because many unix utilities (grep etc) aren't encoding aware,
> that'll cause problems when people go to search log files. For (eg) "広告掲載"

But not for others!

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: BUG #5661: The character encoding in logfile is confusing.

From

Peter Eisentraut

Date:

22 September 2010, 09:43:13

On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote:
> Yep, I tend to think that'd be the right way to go. It'd still be a bit 
> of a pain, though, as messages written to stdout/stderr by the 
> postmaster should be in the system encoding, but messages written to the 
> log files should be in the encoding specified for logs, unless logging 
> is being done to syslog, in which case it has to be in the system 
> encoding after all...

I think that should not be a problem to implement.  Those two go through
different routines anyway.

> And, of course, the postmaster still doesn't know how to log anything it 
> might emit before reading postgresql.conf, because it doesn't know what 
> encoding to use.

That should also not be a big issue.  The postmaster needs the
configuration file to know where to write the log file anyway.

> I still wonder if, rather than making this configurable, the right 
> choice is to force logging to UTF-8 (with BOM) across the board, right 
> from postmaster startup. It's consistent, all messages in all other 
> encodings can be converted to UTF-8 for logging, it's platform 
> independent, and text editors etc tend to understand and recognise UTF-8 
> especially with the BOM.

I don't think this would make things better or easier.  At some point
you're going to have to insert a recode call, and it doesn't matter much
whether the destination argument is a constant or a variable.

Re: BUG #5661: The character encoding in logfile is confusing.

From

Tom Lane

Date:

22 September 2010, 10:41:50

Craig Ringer <craig@postnewspapers.com.au> writes:
> On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
>> We need to produce the log output in the server encoding, because that's
>> how we need to send it to the client.

> That doesn't mean it can't be recoded for writing to the log file, 
> though. Perhaps it needs to be. It should be reasonably practical to 
> detect when the database and log encoding are the same and avoid the 
> transcoding performance penalty, not that it's big anyway.

We have seen ... and rejected ... such proposals before.  The problem is
that "transcode to some other encoding" is not a simple and guaranteed
error-free operation.  As an example, if you choose to name some table
using a character that doesn't exist in the log encoding, you have just
ensured that no message about that table will ever get to the log.
Nice way to hide your activities from the DBA ;-)  Transcoding also
eats memory, which might be in exceedingly short supply while trying
to report an "out of memory" error; and IIRC there are some other
failure scenarios to be concerned about.

We could maybe accept a design for this that included a sufficiently
well-thought-out set of fallback behaviors.  But we haven't seen one
yet.
        regards, tom lane

Re: BUG #5661: The character encoding in logfile is confusing.

From

Tom Lane

Date:

22 September 2010, 10:56:02

Peter Eisentraut <peter_e@gmx.net> writes:
> On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote:
>> I still wonder if, rather than making this configurable, the right 
>> choice is to force logging to UTF-8 (with BOM) across the board,

> I don't think this would make things better or easier.  At some point
> you're going to have to insert a recode call, and it doesn't matter much
> whether the destination argument is a constant or a variable.

It'd avoid the problem of having possibly-unconvertable messages ...
at the cost of pissing off users who have a uniform server encoding
selection already and don't see why they should be forced to deal with
UTF8 in the log.

It's pretty much just one step from here to deciding that the server
should work exclusively in UTF8 and never mind all those other legacy
encodings.  We've resisted that attitude for quite some years now,
and are probably not really ready to adopt it for the log either.
        regards, tom lane

Re: BUG #5661: The character encoding in logfile is confusing.

From

tkbysh2000@yahoo.co.jp

Date:

23 September 2010, 09:31:01

Hi Craig,

Almost Japanese software emit log files by encoding of the server the
software running on. I'm not sure it is the best way or not, but
Japanese users taking it for granted.
So I feel that Japanese users would hope that postgre server has same
style with other software, cause many administrators in Japan are
familiar and experienced for the way.

On Unix, user can specify default character encoding at installing.
Software can get it to refer the environment value $LANG e.g.
> % echo $LANG
> ja_JP.eucJP

On Japanese Windows, default encoding is MS-932(or cp-932 or Windows-31J).
This is fixed.
MS-932 is almost same as Shift-JIS but very few characters has different
character code between MS-932 and Shit-JIS. And Shift-JIS doesn't have
some characters in MS-932.
This is very important issue.
This issue has been making a lot of related bugs e.g. below:
http://bugs.mysql.com/bug.php?id=7607

And if postgre could be configured to emit a log file by row English
messages, some users will choice it if the translating messages to
Japanese has some costs. Some administrators in Japan don't hate reading
English messages. (Many software are not user friendly for not English
users. Many Japanese users are wondering and impressed with postgre
emits Japanese messages in log file.) 

Thank you.

=Mikio

-- <tkbysh2000@yahoo.co.jp>


On Wed, 22 Sep 2010 19:25:47 +0800
Craig Ringer <craig@postnewspapers.com.au> wrote:

> On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
> > On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote:
> >> A single log file should obviously be in a single encoding, it's the
> >> only sane way to do things. But which encoding is it in? And which
> >> *should* it be in?
> >
> > We need to produce the log output in the server encoding, because that's
> > how we need to send it to the client.
> 
> That doesn't mean it can't be recoded for writing to the log file, 
> though. Perhaps it needs to be. It should be reasonably practical to 
> detect when the database and log encoding are the same and avoid the 
> transcoding performance penalty, not that it's big anyway.
> 
> > If you have different databases
> > with different server encodings, you will get inconsistently encoded
> > output in the log file.
> 
> I don't think that's an OK answer, myself. Mixed encodings with no 
> delineation in one file = bug as far as I'm concerned. You can't even 
> rely on being able to search the log anymore. You'll only get away with 
> it when using languages that mostly stick to the 7-bit ASCII subset, so 
> most text is still readable; with most other languages you'll get logs 
> full of what looks to the user like garbage.
> 
> > Conceivably, we could create a configuration option that specifies the
> > encoding for the log file, and strings a recoded from whatever gettext()
> > produces to the specified encoding.  initdb could initialize that option
> > suitably, so in most cases users won't have to do anything.
> 
> Yep, I tend to think that'd be the right way to go. It'd still be a bit 
> of a pain, though, as messages written to stdout/stderr by the 
> postmaster should be in the system encoding, but messages written to the 
> log files should be in the encoding specified for logs, unless logging 
> is being done to syslog, in which case it has to be in the system 
> encoding after all...
> 
> And, of course, the postmaster still doesn't know how to log anything it 
> might emit before reading postgresql.conf, because it doesn't know what 
> encoding to use.
> 
> I still wonder if, rather than making this configurable, the right 
> choice is to force logging to UTF-8 (with BOM) across the board, right 
> from postmaster startup. It's consistent, all messages in all other 
> encodings can be converted to UTF-8 for logging, it's platform 
> independent, and text editors etc tend to understand and recognise UTF-8 
> especially with the BOM.
> 
> Unfortunately, because many unix utilities (grep etc) aren't encoding 
> aware, that'll cause problems when people go to search log files. For 
> (eg) "広告掲載" The log files will contain the utf-8 bytes:
> 
>    \xe5\xba\x83\xe5\x91\x8a\xe6\x8e\xb2\xe8\xbc\x89
> 
> but grep on a shift-jis system will be looking for:
> 
>    \x8d\x4c\x8d\x90\x8cf\x8d\xda
> 
> so it won't match.
> 
> 
> Ugh. If only we could say "PostgreSQL requires a system locale with a 
> UTF-8 encoding". Alas, I don't think that'd go down very well with 
> packagers or installers. [Insert rant about how stupid it is that *nix 
> systems still aren't all UTF-8 here].
> 
> -- 
> Craig Ringer
> 
> Tech-related writing at http://soapyfrogs.blogspot.com/

Re: BUG #5661: The character encoding in logfile is confusing.

From

Craig Ringer

Date:

25 September 2010, 00:33:23

On 09/22/2010 09:55 PM, Tom Lane wrote:
> Peter Eisentraut<peter_e@gmx.net>  writes:
>> On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote:
>>> I still wonder if, rather than making this configurable, the right
>>> choice is to force logging to UTF-8 (with BOM) across the board,
>
>> I don't think this would make things better or easier.  At some point
>> you're going to have to insert a recode call, and it doesn't matter much
>> whether the destination argument is a constant or a variable.
>
> It'd avoid the problem of having possibly-unconvertable messages ...
> at the cost of pissing off users who have a uniform server encoding
> selection already and don't see why they should be forced to deal with
> UTF8 in the log.
>
> It's pretty much just one step from here to deciding that the server
> should work exclusively in UTF8 and never mind all those other legacy
> encodings.  We've resisted that attitude for quite some years now,
> and are probably not really ready to adopt it for the log either.

Fair enough. The current approach is broken, though. Mis-encoded 
messages the user can't read are little more good to them than messages 
that're never logged.

I see four options here (two of which are practical IMO):

(1) Log in UTF-8, convert everything to UTF-8. Better for admin tools & 
apps, sucks for OS utilities/grep/etc on non-utf-8 locales. Preserves 
all messages no matter what the database and system encodings are.

(2) Log in default encoding for locale, convert all messages to that 
encoding. Where characters cannot be represented in the target encoding 
replace them with a placeholder (? or something). Better - but far from 
good - for OS utilities/grep/etc, sucks for admin tools and apps. 
Doesn't preserve all messages properly if user has databases in 
encodings other than the system encoding.

(3) Have a log for the postmaster in the default locale for the system. 
Have a log file for each database that's in the encoding for that 
database. IMO this is the worst of both worlds, but it does preserve 
original encodings without transcoding or forcing a particular encoding 
and does preserve messages. Horribly complicated for admin tools, 
inconsistent and horrid for grep etc.

(4) Keep things much as they are, but log an encoding identifier prefix 
for each line. Lets GUI/admin tools post-process the logs into something 
sane, permits automated log processing because line encodings are known. 
Sucks for shell tools, which can't tell which lines are which; we'd need 
to provide a "pggrep" and "pgless" for reliable log search! Preserves 
all messages, but not in a reliably searchable manner.

(0) Change nothing. Log all messages in the original encoding they were 
generated in. Perform no conversion. Logs contain mixed encodings. 
Horrible for admin/gui tools (broken text). Horrible for shell 
utilities/OS tools (can't trust grep results etc). Automatic log 
processing impossible as the encoding for each line isn't known and 
can't be reliably discovered.

As far as I'm concerned, (3) is out. It's horrible. I don't think the 
status quo (0) is OK either, it's producing broken log files. (4) is 
pretty awful too, but it's the smallest change that kind-of fixes the 
issue to the point where it's at least possible for PgAdmin etc to 
convert the logs into a consistent encoding.

IMO it's down to (1) and (2). There's no clear consensus between those 
two, so I'd be inclined to offer the admin the choice between them as a 
config option, depending on the trade-off they prefer to make.

For sensible systems in a utf-8 locale (1) and (2) are equivalent, and 
(2) is fine for systems where the database encoding is always the same 
as the system encoding. It's only for systems with a non-utf-8 locale 
that use databases in encodings other than the system locale's encoding 
that problems arise. In this case they're going to get suboptimal 
results one way or the other, it's just a matter of letting them pick how.

Thoughts?

I should ask on the various language-specific mailing lists and see what 
people there have to say about it. Maybe it doesn't affect people enough 
in practice for them to care.

--
Craig Ringer

Re: BUG #5661: The character encoding in logfile is confusing.

From

Craig Ringer

Date:

25 September 2010, 03:49:04

On 22/09/2010 9:41 PM, Tom Lane wrote:
> Craig Ringer<craig@postnewspapers.com.au>  writes:
>> On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
>>> We need to produce the log output in the server encoding, because that's
>>> how we need to send it to the client.
>
>> That doesn't mean it can't be recoded for writing to the log file,
>> though. Perhaps it needs to be. It should be reasonably practical to
>> detect when the database and log encoding are the same and avoid the
>> transcoding performance penalty, not that it's big anyway.
>
> We have seen ... and rejected ... such proposals before.  The problem is
> that "transcode to some other encoding" is not a simple and guaranteed
> error-free operation.  As an example, if you choose to name some table
> using a character that doesn't exist in the log encoding, you have just
> ensured that no message about that table will ever get to the log.

Well, an arguably reasonable if still suboptimal approach is to mask out 
characters without any representation in the target encoding, replacing 
them with a substitute ("?" or whatever). The rest of the log message is 
still emitted that way.

Currently, Pg may as well be emitting "!@#!#!#!@#$!@#$" for these log 
records. It's garbage unless the user's editor/log viewer/whatever 
happens to use the encoding of that set of messages, turning all the 
others into garbage instead. To interpret them, I had to

It's not a big deal with languages that mostly use the 7-bit ascii space 
most encodings share, but for russian, chinese, japanese, thai, the 
various indian languages, etc etc etc it's pretty awful, as seen in 
Mikio's example log files.

> Nice way to hide your activities from the DBA ;-)

Emitting messages in the wrong encoding doesn't do the DBA any favours 
either. Automated log analysis and reporting will have a hard time 
dealing with the logs, and the DBA will have to keep on switching 
encodings in their editor/viewer to interpret or search the logs. 
Assuming they know how, and know they need to.

> Transcoding also
> eats memory, which might be in exceedingly short supply while trying
> to report an "out of memory" error; and IIRC there are some other
> failure scenarios to be concerned about.

Yep, that's certainly a problem. Pre-transcoding them on backend start 
isn't particularly desirable (wasted startup time, memory) and neither 
is pre-allocating extra memory for use on fatal exit paths.

OTOH, don't the current message translations also cost at least some 
memory, too?

I don't have a good answer for this issue. Only rather less-than-good 
ideas like: mmap() a file the postmaster generates that contains various 
fatal messages, already in the right encodings/translations, with an 
offset table at the front? Icky, but effective and doesn't waste 
precious shared memory or produce new unsharable allocations in the 
backends that'll only ever get used when something breaks.

-- 
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/