Thread: Encoding issues in console and eventlog on win32

Encoding issues in console and eventlog on win32

From

Itagaki Takahiro

Date:

16 August 2009, 22:53:52

We can choose different encodings from platform-dependent one
for database, but postgres writes serverlogs in the database encoding.
As the result, serverlogs are filled with broken characters.

The problem could occur on all platforms, however, there is a solution
for win32. Since Windows supports wide characters to write logs, we can
convert log texts => UTF-8 => UTF-16 and pass them to WriteConsoleW()
and ReportEventW().

Especially in Japan, encoding troubles on Windows are unavoidable
because postgres doesn't support Shift-JIS for database encoding,
that is the native encoding for Windows Japanese edition.

If we also want to support the same functionality on non-win32 platform,
we might need non-throwable version of pg_do_encoding_conversion():

    log_message_to_write = pg_do_encoding_conversion_nothrow(
        log_message_in_database_encoding,
        GetDatabaseEncoding() /* as src_encoding */,
        GetPlatformEncoding() /* as dst_encoding */)

and pass the result to stderr and syslog. But it requires major rewrites
of conversion functions, so I'd like to submit a solution only for win32
for now. Also, the issue is not so serious on non-win32 platforms because
we can choose UTF-8 or EUC_* on those platforms.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

eventlog-20090817.patch

Re: Encoding issues in console and eventlog on win32

From

Heikki Linnakangas

Date:

14 September 2009, 08:08:32

Itagaki Takahiro wrote:
> We can choose different encodings from platform-dependent one
> for database, but postgres writes serverlogs in the database encoding.
> As the result, serverlogs are filled with broken characters.
> 
> The problem could occur on all platforms, however, there is a solution
> for win32. Since Windows supports wide characters to write logs, we can
> convert log texts => UTF-8 => UTF-16 and pass them to WriteConsoleW()
> and ReportEventW().
> 
> Especially in Japan, encoding troubles on Windows are unavoidable
> because postgres doesn't support Shift-JIS for database encoding,
> that is the native encoding for Windows Japanese edition.
> 
> If we also want to support the same functionality on non-win32 platform,
> we might need non-throwable version of pg_do_encoding_conversion():
> 
>     log_message_to_write = pg_do_encoding_conversion_nothrow(
>         log_message_in_database_encoding,
>         GetDatabaseEncoding() /* as src_encoding */,
>         GetPlatformEncoding() /* as dst_encoding */)
> 
> and pass the result to stderr and syslog. But it requires major rewrites
> of conversion functions, so I'd like to submit a solution only for win32
> for now. Also, the issue is not so serious on non-win32 platforms because
> we can choose UTF-8 or EUC_* on those platforms.

Something like that seems reasonable for the Windows event log; that is
clearly supposed to be written using a specific encoding. With the log
files, we're more free to do what we want, and IMHO we shouldn't put a
Windows-specific hack there because as you say we have the same problem
on all platforms.

There's no guarantee that conversion to UTF-8 won't fail, so this isn't
totally risk-free on Windows either. Theoretically, MultiByteToWideChar
could fail too (the patch neglects to check for that), although I
suppose it can't really happen for UTF-8 -> UTF-16 conversion.

Can't we use MultiByteToWideChar() to convert directly to the required
encoding, avoiding the double conversion?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Encoding issues in console and eventlog on win32

From

Itagaki Takahiro

Date:

15 September 2009, 00:49:43

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

> Can't we use MultiByteToWideChar() to convert directly to the required
> encoding, avoiding the double conversion?

Here is an updated version of the patch.
I use direct conversion in pgwin32_toUTF16() if a corresponding codepage
is available. If not available, I still use double conversion.

Now pgwin32_toUTF16() is exported from mbutil.c. I used the function
in following parts, although the main target of the patch is eventlog.

  * WriteConsoleW() - write unredirected stderr log.
  * ReportEventW()  - write evenlog.
  * CreateFileW()   - open non-ascii filename (ex. COPY TO/FROM 'mb-path').

This approach is only available for Windows because any other platform
don't support locale-independent and wide-character-based system calls.
Other platforms require a different approach, but even then we'd still
better have win32-specific routines because UTF16 is the native encoding
in Windows.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

eventlog-20090915.patch

Re: Encoding issues in console and eventlog on win32

From

Josh Williams

Date:

21 September 2009, 01:00:45

On Tue, 2009-09-15 at 12:49 +0900, Itagaki Takahiro wrote:
> Here is an updated version of the patch.

This is a review of the Eventlog encoding on Windows patch:
http://archives.postgresql.org/message-id/20090915123243.9C59.52131E4D@oss.ntt.co.jp

Purpose & Format
================
This patch is designed to coerce log messages to a specific encoding.
It's currently only targeted at the win32 port, where the logs are
written in UTF-16.

The patch applies cleanly.  It doesn't include any documentation updates
or additional regression tests.  A comment in the documentation that
logs on Windows will go through an encoding conversion if appropriate
might be nice, though.

Initial Run
===========
To (hopefully) properly test I initdb'd a couple directories under
different locales.  I then ran a few statements designed to generate
event log messages showing characters in a different encoding:
SELECT E'\xF0'::int;

The unpatched backend generated event log message showing only the byte
value interpreted as the same character each time in the system default
encoding.

With the patch in place the event log message showed the character
correctly for each of the different encodings.

I haven't tried any performance testing against it.

Concurrent Development Issues
=============================
On a hunch, tried applying the "syslogger infrastructure changes" at the
same time.  They conflict on elog.c.  Not sure if we're supposed to
check for that, but thought I'd point it out. :)

Editorial
=========
The problem seems to stem from PG and Windows each having a few
encodings the other won't understand, or at least don't immediately
support.  So log messages back to the system from its perspective
contain incorrect or broken characters.  I'm not sure this is as much of
a problem on other platforms, though, where the database encoding
typically doesn't have any trouble matching the system's; would it be
worth pursuing beyond the win32 port?

I'm not too familiar with alternate character sets...  I would assume if
there's a code page supported on win32 it'll naturally support
conversion to UTF-16 on the platform, but is there any time this could
fail?  What about the few encodings that it doesn't directly support,
which need a conversion to UTF-8 first?

Maybe someone with more familiarity with encoding conversion issues
could comment on that?  Otherwise I think this is ready to be bumped up
for committer review.

- Josh Williams

Re: Encoding issues in console and eventlog on win32

From

Magnus Hagander

Date:

06 October 2009, 09:09:22

2009/9/15 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>:
>
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
>
>> Can't we use MultiByteToWideChar() to convert directly to the required
>> encoding, avoiding the double conversion?
>
> Here is an updated version of the patch.
> I use direct conversion in pgwin32_toUTF16() if a corresponding codepage
> is available. If not available, I still use double conversion.
>
> Now pgwin32_toUTF16() is exported from mbutil.c. I used the function
> in following parts, although the main target of the patch is eventlog.
>
>  * WriteConsoleW() - write unredirected stderr log.
>  * ReportEventW()  - write evenlog.
>  * CreateFileW()   - open non-ascii filename (ex. COPY TO/FROM 'mb-path').
>
> This approach is only available for Windows because any other platform
> don't support locale-independent and wide-character-based system calls.
> Other platforms require a different approach, but even then we'd still
> better have win32-specific routines because UTF16 is the native encoding
> in Windows.

I did a quick check of this, and here are the things I would like to
have changed:

First of all, the change to port/open.c seems to be unrelated to the
rest, and should be a separate patch, correct? I'm sure there's a
usecase for it, but it's not actually  included in the patches
description, so I assume this was a mistake?

Per your own comments earlier, and in the code, what will happen if
pg_do_encoding_conversion() calls ereport()? Didn't you say we need a
non-throwing version of it?

pgwin32_toUTF16() needs error checking on the API calls, and needs to
do something reasonable if it fails. For example, it can fail because
of out of memory error. I suggest just returning the error code in
some way in that case, and have the callers fall back to logging in
the incorrect encoding - in a lot of cases that will produce an at
least partially readable message. A second message should also be
logged saying that the conversion failed - this needs to be done
directly with the eventlog API functions and not ereport, so we don't
end up in infinite recursion.

The encoding_to_codepage array needs to go in encnames.c, where other
such tables are. Perhaps it can even be integrated in pg_enc2name_tbl
as a separate field?

I don't have the time to clean this up right now, so if you have,
please do so and resubmit. If not, I can clean it up later and apply.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Encoding issues in console and eventlog on win32

From

Itagaki Takahiro

Date:

06 October 2009, 23:28:00

Magnus Hagander <magnus@hagander.net> wrote:

> First of all, the change to port/open.c seems to be unrelated to the
> rest, and should be a separate patch, correct? I'm sure there's a
> usecase for it, but it's not actually  included in the patches
> description, so I assume this was a mistake?

It was just a demo for pgwin32_toUTF16(). I'll remove this part from
the patch, but I think we also need to fix the encoding mismatch issue
in path strings. I'll re-submit for the next commitfest.

> Per your own comments earlier, and in the code, what will happen if
> pg_do_encoding_conversion() calls ereport()? Didn't you say we need a
> non-throwing version of it?

We are hard to use encoding conversion functions in logging routines
because they could throw errors if there are some unconvertable characters.
Non-throwing version will convert such characters into '?' or escaped form
(something like \888 or \xFF). If there where such infrastructure, we can
support "log_encoding" settings and convert messages in platform-dependent
encoding before writing to syslog or console.

> pgwin32_toUTF16() needs error checking on the API calls, and needs to
> do something reasonable if it fails.

Now it returns NULL and caller writes messages in the original encoding.
Also I added the following error checks before calling pgwin32_toUTF16()
    (errordata_stack_depth < ERRORDATA_STACK_SIZE - 1)
to avoid recursive errors, but I'm not sure it is really meaningful.
Please remove or rewrite this part if it is not a right way.

> The encoding_to_codepage array needs to go in encnames.c, where other
> such tables are. Perhaps it can even be integrated in pg_enc2name_tbl
> as a separate field?

I added pg_enc2name.codepage. Note that this field is needed only
on Windows, but now exported for all platforms. If you don't like
the useless field, the following macro could be a help.

#ifdef WIN32
#define def_enc2name(name, codepage)    { #name, PG_##name, codepage }
#else
#define def_enc2name(name, codepage)    { #name, PG_##name }
#endif
pg_enc2name pg_enc2name_tbl[] =
{
    def_enc2name(SQL_ASCII),
    def_enc2name(EUC_JP),
    ...

> I don't have the time to clean this up right now, so if you have,
> please do so and resubmit. If not, I can clean it up later and apply.

Patch attached.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

eventlog_20091007.patch

Re: Encoding issues in console and eventlog on win32

From

Magnus Hagander

Date:

10 October 2009, 02:54:40

2009/10/7 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>:
>
> Magnus Hagander <magnus@hagander.net> wrote:
>> Per your own comments earlier, and in the code, what will happen if
>> pg_do_encoding_conversion() calls ereport()? Didn't you say we need a
>> non-throwing version of it?
>
> We are hard to use encoding conversion functions in logging routines
> because they could throw errors if there are some unconvertable characters.
> Non-throwing version will convert such characters into '?' or escaped form
> (something like \888 or \xFF). If there where such infrastructure, we can
> support "log_encoding" settings and convert messages in platform-dependent
> encoding before writing to syslog or console.

Right, which we don't have at this point. That would be very useful on
unix, i believe.

>> pgwin32_toUTF16() needs error checking on the API calls, and needs to
>> do something reasonable if it fails.
>
> Now it returns NULL and caller writes messages in the original encoding.

Seems reasonable. If encoding fails, I think that's the best we can do.

> Also I added the following error checks before calling pgwin32_toUTF16()
>    (errordata_stack_depth < ERRORDATA_STACK_SIZE - 1)
> to avoid recursive errors, but I'm not sure it is really meaningful.
> Please remove or rewrite this part if it is not a right way.

I'm not entirely sure either, but it looks like it could protect us
from getting into a tight loop on an error here.. Tom (or someone else
who knows that for sure :P),comments?

>> The encoding_to_codepage array needs to go in encnames.c, where other
>> such tables are. Perhaps it can even be integrated in pg_enc2name_tbl
>> as a separate field?
>
> I added pg_enc2name.codepage. Note that this field is needed only
> on Windows, but now exported for all platforms. If you don't like
> the useless field, the following macro could be a help.
> #ifdef WIN32
> #define def_enc2name(name, codepage)    { #name, PG_##name, codepage }
> #else
> #define def_enc2name(name, codepage)    { #name, PG_##name }
> #endif
> pg_enc2name pg_enc2name_tbl[] =
> {
>    def_enc2name(SQL_ASCII),
>    def_enc2name(EUC_JP),
>    ...

Yeah, I think that makes sense. It's not much data, but it's
completely unnecessary :-) I can make that change at commit.

One other question - you note that WriteConsoleW() "could fail if
stderr is redirected". Are you saying that it will always fail when
stderr is redirected, or only sometimes? If ony sometimes, do you know
under which conditions it happens?

If it's always, I assume this just means that the logfile will be in
the database encoding and not in UTF16? Is this what we want, or would
we like the logfile to also be in UTF16? If we can convert it to
UTF16, that would fix the case when you have different databases in
different encodings, wouldn't it? (Even if your editor, unlike the
console subsystem, can view the individual encoding you need, I bet it
can't deal with multiple encodings in the same file)

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Encoding issues in console and eventlog on win32

From

Tom Lane

Date:

10 October 2009, 11:01:54

Magnus Hagander <magnus@hagander.net> writes:
> 2009/10/7 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>:
>> Also I added the following error checks before calling pgwin32_toUTF16()
>> � �(errordata_stack_depth < ERRORDATA_STACK_SIZE - 1)
>> to avoid recursive errors, but I'm not sure it is really meaningful.
>> Please remove or rewrite this part if it is not a right way.

> I'm not entirely sure either, but it looks like it could protect us
> from getting into a tight loop on an error here.. Tom (or someone else
> who knows that for sure :P),comments?

I haven't read the patch, but I'd suggest making any behavior changes
dependent on in_error_recursion_trouble(), rather than getting in bed
with internal implementation variables.
        regards, tom lane

Re: Encoding issues in console and eventlog on win32

From

Itagaki Takahiro

Date:

12 October 2009, 22:13:29

Magnus Hagander <magnus@hagander.net> wrote:

> One other question - you note that WriteConsoleW() "could fail if
> stderr is redirected". Are you saying that it will always fail when
> stderr is redirected, or only sometimes? If ony sometimes, do you know
> under which conditions it happens?

It will always fail if redirected. We can test the conditions using:
    pg_ctl start > result.log
So, the comment should be:
    /* WriteConsoleW always fails if stderr is redirected. */

I cleaned up the patch per comments. I hope this will be the final one ;-).

  * Use in_error_recursion_trouble() instead of own implementation.
  * Use def_enc2name() macro to avoid adding the codepage field
    on non-win32 platforms.
  * Fix a bug of calculation of result length.
  * Fix a memory leak on error handling path in pgwin32_toUTF16().


> If it's always, I assume this just means that the logfile will be in
> the database encoding and not in UTF16? Is this what we want, or would
> we like the logfile to also be in UTF16? If we can convert it to
> UTF16, that would fix the case when you have different databases in
> different encodings, wouldn't it? (Even if your editor, unlike the
> console subsystem, can view the individual encoding you need, I bet it
> can't deal with multiple encodings in the same file)

Sure, the logfile will be filled with mixed encoding strings,
that could happen in logfile and syslog on non-win32 platforms.
I think UTF8 is better than UTF16 for logfile encoding because
there are some text editors that do not support wide characters.
At any rate, the logfile encoding feature will come from another patch,
that might add "log_encoding" variable and work on any platforms.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

eventlog_20091013.patch

Re: Encoding issues in console and eventlog on win32

From

Robert Haas

Date:

15 October 2009, 22:24:04

On Mon, Oct 12, 2009 at 9:13 PM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:
>
> Magnus Hagander <magnus@hagander.net> wrote:
>
>> One other question - you note that WriteConsoleW() "could fail if
>> stderr is redirected". Are you saying that it will always fail when
>> stderr is redirected, or only sometimes? If ony sometimes, do you know
>> under which conditions it happens?
>
> It will always fail if redirected. We can test the conditions using:
>    pg_ctl start > result.log
> So, the comment should be:
>    /* WriteConsoleW always fails if stderr is redirected. */
>
> I cleaned up the patch per comments. I hope this will be the final one ;-).
>
>  * Use in_error_recursion_trouble() instead of own implementation.
>  * Use def_enc2name() macro to avoid adding the codepage field
>    on non-win32 platforms.
>  * Fix a bug of calculation of result length.
>  * Fix a memory leak on error handling path in pgwin32_toUTF16().
>
>
>> If it's always, I assume this just means that the logfile will be in
>> the database encoding and not in UTF16? Is this what we want, or would
>> we like the logfile to also be in UTF16? If we can convert it to
>> UTF16, that would fix the case when you have different databases in
>> different encodings, wouldn't it? (Even if your editor, unlike the
>> console subsystem, can view the individual encoding you need, I bet it
>> can't deal with multiple encodings in the same file)
>
> Sure, the logfile will be filled with mixed encoding strings,
> that could happen in logfile and syslog on non-win32 platforms.
> I think UTF8 is better than UTF16 for logfile encoding because
> there are some text editors that do not support wide characters.
> At any rate, the logfile encoding feature will come from another patch,
> that might add "log_encoding" variable and work on any platforms.

Magnus has promised me on a stack of instant messages that he will
review this soon, but as he hasn't gotten to it yet, I am moving it to
the next CommitFest.

...Robert

Re: Encoding issues in console and eventlog on win32

From

Bruce Momjian

Date:

16 October 2009, 12:22:19

Robert Haas wrote:
> > Sure, the logfile will be filled with mixed encoding strings,
> > that could happen in logfile and syslog on non-win32 platforms.
> > I think UTF8 is better than UTF16 for logfile encoding because
> > there are some text editors that do not support wide characters.
> > At any rate, the logfile encoding feature will come from another patch,
> > that might add "log_encoding" variable and work on any platforms.
> 
> Magnus has promised me on a stack of instant messages that he will
> review this soon, but as he hasn't gotten to it yet, I am moving it to
> the next CommitFest.

I am with Magnus today and will make sure it gets done.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Encoding issues in console and eventlog on win32

From

Magnus Hagander

Date:

16 October 2009, 21:25:07

2009/10/13 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>:
>
> Magnus Hagander <magnus@hagander.net> wrote:
>
>> One other question - you note that WriteConsoleW() "could fail if
>> stderr is redirected". Are you saying that it will always fail when
>> stderr is redirected, or only sometimes? If ony sometimes, do you know
>> under which conditions it happens?
>
> It will always fail if redirected. We can test the conditions using:
>    pg_ctl start > result.log
> So, the comment should be:
>    /* WriteConsoleW always fails if stderr is redirected. */

Ok, fair enough. We already have a variable for that though - it's
called redirection_done. I think it does what's necessary - I have
used that one in my version of the patch. Please verify that it works
in your environment.


> I cleaned up the patch per comments. I hope this will be the final one ;-).
>
>  * Use in_error_recursion_trouble() instead of own implementation.
>  * Use def_enc2name() macro to avoid adding the codepage field
>    on non-win32 platforms.

Per previous email, I had done this in my version of the patch, so it
looks slightly different than yours, but it has the same
functionality.


>  * Fix a bug of calculation of result length.

Where exactly is this one? I can't find it compared to my code, but
that could just be out-of-timezone-brain speaking :-)

>  * Fix a memory leak on error handling path in pgwin32_toUTF16().

Missed that one, thanks!


>> If it's always, I assume this just means that the logfile will be in
>> the database encoding and not in UTF16? Is this what we want, or would
>> we like the logfile to also be in UTF16? If we can convert it to
>> UTF16, that would fix the case when you have different databases in
>> different encodings, wouldn't it? (Even if your editor, unlike the
>> console subsystem, can view the individual encoding you need, I bet it
>> can't deal with multiple encodings in the same file)
>
> Sure, the logfile will be filled with mixed encoding strings,
> that could happen in logfile and syslog on non-win32 platforms.
> I think UTF8 is better than UTF16 for logfile encoding because
> there are some text editors that do not support wide characters.

Don't most text editors on Windows do UTF16? In particular, I'd expect
more of them to do UTF16 than UTF8, but I could be wrong?

> At any rate, the logfile encoding feature will come from another patch,
> that might add "log_encoding" variable and work on any platforms.

Ok, good. Particularly the "other platform" is the winning argument.

So, what I believe is the latest version of the patch applied. Please
point out if I made a mistake in my changes against yours.

Sorry about the delay :(


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/