Re: BUG #5661: The character encoding in logfile is confusing. - Mailing list pgsql-hackers

From tkbysh2000@yahoo.co.jp
Subject Re: BUG #5661: The character encoding in logfile is confusing.
Date
Msg-id 20100922212552.93B2.A495B709@yahoo.co.jp
Whole thread Raw
In response to Re: BUG #5661: The character encoding in logfile is confusing.  (Craig Ringer <craig@postnewspapers.com.au>)
List pgsql-hackers
Hi Craig,

Almost Japanese software emit log files by encoding of the server the
software running on. I'm not sure it is the best way or not, but
Japanese users taking it for granted.
So I feel that Japanese users would hope that postgre server has same
style with other software, cause many administrators in Japan are
familiar and experienced for the way.

On Unix, user can specify default character encoding at installing.
Software can get it to refer the environment value $LANG e.g.
> % echo $LANG
> ja_JP.eucJP

On Japanese Windows, default encoding is MS-932(or cp-932 or Windows-31J).
This is fixed.
MS-932 is almost same as Shift-JIS but very few characters has different
character code between MS-932 and Shit-JIS. And Shift-JIS doesn't have
some characters in MS-932.
This is very important issue.
This issue has been making a lot of related bugs e.g. below:
http://bugs.mysql.com/bug.php?id=7607

And if postgre could be configured to emit a log file by row English
messages, some users will choice it if the translating messages to
Japanese has some costs. Some administrators in Japan don't hate reading
English messages. (Many software are not user friendly for not English
users. Many Japanese users are wondering and impressed with postgre
emits Japanese messages in log file.) 

Thank you.

=Mikio

-- <tkbysh2000@yahoo.co.jp>


On Wed, 22 Sep 2010 19:25:47 +0800
Craig Ringer <craig@postnewspapers.com.au> wrote:

> On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
> > On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote:
> >> A single log file should obviously be in a single encoding, it's the
> >> only sane way to do things. But which encoding is it in? And which
> >> *should* it be in?
> >
> > We need to produce the log output in the server encoding, because that's
> > how we need to send it to the client.
> 
> That doesn't mean it can't be recoded for writing to the log file, 
> though. Perhaps it needs to be. It should be reasonably practical to 
> detect when the database and log encoding are the same and avoid the 
> transcoding performance penalty, not that it's big anyway.
> 
> > If you have different databases
> > with different server encodings, you will get inconsistently encoded
> > output in the log file.
> 
> I don't think that's an OK answer, myself. Mixed encodings with no 
> delineation in one file = bug as far as I'm concerned. You can't even 
> rely on being able to search the log anymore. You'll only get away with 
> it when using languages that mostly stick to the 7-bit ASCII subset, so 
> most text is still readable; with most other languages you'll get logs 
> full of what looks to the user like garbage.
> 
> > Conceivably, we could create a configuration option that specifies the
> > encoding for the log file, and strings a recoded from whatever gettext()
> > produces to the specified encoding.  initdb could initialize that option
> > suitably, so in most cases users won't have to do anything.
> 
> Yep, I tend to think that'd be the right way to go. It'd still be a bit 
> of a pain, though, as messages written to stdout/stderr by the 
> postmaster should be in the system encoding, but messages written to the 
> log files should be in the encoding specified for logs, unless logging 
> is being done to syslog, in which case it has to be in the system 
> encoding after all...
> 
> And, of course, the postmaster still doesn't know how to log anything it 
> might emit before reading postgresql.conf, because it doesn't know what 
> encoding to use.
> 
> I still wonder if, rather than making this configurable, the right 
> choice is to force logging to UTF-8 (with BOM) across the board, right 
> from postmaster startup. It's consistent, all messages in all other 
> encodings can be converted to UTF-8 for logging, it's platform 
> independent, and text editors etc tend to understand and recognise UTF-8 
> especially with the BOM.
> 
> Unfortunately, because many unix utilities (grep etc) aren't encoding 
> aware, that'll cause problems when people go to search log files. For 
> (eg) "広告掲載" The log files will contain the utf-8 bytes:
> 
>    \xe5\xba\x83\xe5\x91\x8a\xe6\x8e\xb2\xe8\xbc\x89
> 
> but grep on a shift-jis system will be looking for:
> 
>    \x8d\x4c\x8d\x90\x8cf\x8d\xda
> 
> so it won't match.
> 
> 
> Ugh. If only we could say "PostgreSQL requires a system locale with a 
> UTF-8 encoding". Alas, I don't think that'd go down very well with 
> packagers or installers. [Insert rant about how stupid it is that *nix 
> systems still aren't all UTF-8 here].
> 
> -- 
> Craig Ringer
> 
> Tech-related writing at http://soapyfrogs.blogspot.com/




pgsql-hackers by date:

Previous
From: subham@cse.iitb.ac.in
Date:
Subject: Re: Needs Suggestion
Next
From: Ganesh Venkitachalam-1
Date:
Subject: Latch implementation