Re: BUG #5661: The character encoding in logfile is confusing. - Mailing list pgsql-hackers

On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
> On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote:
>> A single log file should obviously be in a single encoding, it's the
>> only sane way to do things. But which encoding is it in? And which
>> *should* it be in?
>
> We need to produce the log output in the server encoding, because that's
> how we need to send it to the client.

That doesn't mean it can't be recoded for writing to the log file, 
though. Perhaps it needs to be. It should be reasonably practical to 
detect when the database and log encoding are the same and avoid the 
transcoding performance penalty, not that it's big anyway.

> If you have different databases
> with different server encodings, you will get inconsistently encoded
> output in the log file.

I don't think that's an OK answer, myself. Mixed encodings with no 
delineation in one file = bug as far as I'm concerned. You can't even 
rely on being able to search the log anymore. You'll only get away with 
it when using languages that mostly stick to the 7-bit ASCII subset, so 
most text is still readable; with most other languages you'll get logs 
full of what looks to the user like garbage.

> Conceivably, we could create a configuration option that specifies the
> encoding for the log file, and strings a recoded from whatever gettext()
> produces to the specified encoding.  initdb could initialize that option
> suitably, so in most cases users won't have to do anything.

Yep, I tend to think that'd be the right way to go. It'd still be a bit 
of a pain, though, as messages written to stdout/stderr by the 
postmaster should be in the system encoding, but messages written to the 
log files should be in the encoding specified for logs, unless logging 
is being done to syslog, in which case it has to be in the system 
encoding after all...

And, of course, the postmaster still doesn't know how to log anything it 
might emit before reading postgresql.conf, because it doesn't know what 
encoding to use.

I still wonder if, rather than making this configurable, the right 
choice is to force logging to UTF-8 (with BOM) across the board, right 
from postmaster startup. It's consistent, all messages in all other 
encodings can be converted to UTF-8 for logging, it's platform 
independent, and text editors etc tend to understand and recognise UTF-8 
especially with the BOM.

Unfortunately, because many unix utilities (grep etc) aren't encoding 
aware, that'll cause problems when people go to search log files. For 
(eg) "広告掲載" The log files will contain the utf-8 bytes:
  \xe5\xba\x83\xe5\x91\x8a\xe6\x8e\xb2\xe8\xbc\x89

but grep on a shift-jis system will be looking for:
  \x8d\x4c\x8d\x90\x8cf\x8d\xda

so it won't match.


Ugh. If only we could say "PostgreSQL requires a system locale with a 
UTF-8 encoding". Alas, I don't think that'd go down very well with 
packagers or installers. [Insert rant about how stupid it is that *nix 
systems still aren't all UTF-8 here].

-- 
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: snapshot generation broken
Next
From: Stefan Kaltenbrunner
Date:
Subject: Re: snapshot generation broken