Thread: Encoding issues in console and eventlog on win32
We can choose different encodings from platform-dependent one for database, but postgres writes serverlogs in the database encoding. As the result, serverlogs are filled with broken characters. The problem could occur on all platforms, however, there is a solution for win32. Since Windows supports wide characters to write logs, we can convert log texts => UTF-8 => UTF-16 and pass them to WriteConsoleW() and ReportEventW(). Especially in Japan, encoding troubles on Windows are unavoidable because postgres doesn't support Shift-JIS for database encoding, that is the native encoding for Windows Japanese edition. If we also want to support the same functionality on non-win32 platform, we might need non-throwable version of pg_do_encoding_conversion(): log_message_to_write = pg_do_encoding_conversion_nothrow( log_message_in_database_encoding, GetDatabaseEncoding() /* as src_encoding */, GetPlatformEncoding() /* as dst_encoding */) and pass the result to stderr and syslog. But it requires major rewrites of conversion functions, so I'd like to submit a solution only for win32 for now. Also, the issue is not so serious on non-win32 platforms because we can choose UTF-8 or EUC_* on those platforms. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
Itagaki Takahiro wrote: > We can choose different encodings from platform-dependent one > for database, but postgres writes serverlogs in the database encoding. > As the result, serverlogs are filled with broken characters. > > The problem could occur on all platforms, however, there is a solution > for win32. Since Windows supports wide characters to write logs, we can > convert log texts => UTF-8 => UTF-16 and pass them to WriteConsoleW() > and ReportEventW(). > > Especially in Japan, encoding troubles on Windows are unavoidable > because postgres doesn't support Shift-JIS for database encoding, > that is the native encoding for Windows Japanese edition. > > If we also want to support the same functionality on non-win32 platform, > we might need non-throwable version of pg_do_encoding_conversion(): > > log_message_to_write = pg_do_encoding_conversion_nothrow( > log_message_in_database_encoding, > GetDatabaseEncoding() /* as src_encoding */, > GetPlatformEncoding() /* as dst_encoding */) > > and pass the result to stderr and syslog. But it requires major rewrites > of conversion functions, so I'd like to submit a solution only for win32 > for now. Also, the issue is not so serious on non-win32 platforms because > we can choose UTF-8 or EUC_* on those platforms. Something like that seems reasonable for the Windows event log; that is clearly supposed to be written using a specific encoding. With the log files, we're more free to do what we want, and IMHO we shouldn't put a Windows-specific hack there because as you say we have the same problem on all platforms. There's no guarantee that conversion to UTF-8 won't fail, so this isn't totally risk-free on Windows either. Theoretically, MultiByteToWideChar could fail too (the patch neglects to check for that), although I suppose it can't really happen for UTF-8 -> UTF-16 conversion. Can't we use MultiByteToWideChar() to convert directly to the required encoding, avoiding the double conversion? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Can't we use MultiByteToWideChar() to convert directly to the required > encoding, avoiding the double conversion? Here is an updated version of the patch. I use direct conversion in pgwin32_toUTF16() if a corresponding codepage is available. If not available, I still use double conversion. Now pgwin32_toUTF16() is exported from mbutil.c. I used the function in following parts, although the main target of the patch is eventlog. * WriteConsoleW() - write unredirected stderr log. * ReportEventW() - write evenlog. * CreateFileW() - open non-ascii filename (ex. COPY TO/FROM 'mb-path'). This approach is only available for Windows because any other platform don't support locale-independent and wide-character-based system calls. Other platforms require a different approach, but even then we'd still better have win32-specific routines because UTF16 is the native encoding in Windows. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
On Tue, 2009-09-15 at 12:49 +0900, Itagaki Takahiro wrote: > Here is an updated version of the patch. This is a review of the Eventlog encoding on Windows patch: http://archives.postgresql.org/message-id/20090915123243.9C59.52131E4D@oss.ntt.co.jp Purpose & Format ================ This patch is designed to coerce log messages to a specific encoding. It's currently only targeted at the win32 port, where the logs are written in UTF-16. The patch applies cleanly. It doesn't include any documentation updates or additional regression tests. A comment in the documentation that logs on Windows will go through an encoding conversion if appropriate might be nice, though. Initial Run =========== To (hopefully) properly test I initdb'd a couple directories under different locales. I then ran a few statements designed to generate event log messages showing characters in a different encoding: SELECT E'\xF0'::int; The unpatched backend generated event log message showing only the byte value interpreted as the same character each time in the system default encoding. With the patch in place the event log message showed the character correctly for each of the different encodings. I haven't tried any performance testing against it. Concurrent Development Issues ============================= On a hunch, tried applying the "syslogger infrastructure changes" at the same time. They conflict on elog.c. Not sure if we're supposed to check for that, but thought I'd point it out. :) Editorial ========= The problem seems to stem from PG and Windows each having a few encodings the other won't understand, or at least don't immediately support. So log messages back to the system from its perspective contain incorrect or broken characters. I'm not sure this is as much of a problem on other platforms, though, where the database encoding typically doesn't have any trouble matching the system's; would it be worth pursuing beyond the win32 port? I'm not too familiar with alternate character sets... I would assume if there's a code page supported on win32 it'll naturally support conversion to UTF-16 on the platform, but is there any time this could fail? What about the few encodings that it doesn't directly support, which need a conversion to UTF-8 first? Maybe someone with more familiarity with encoding conversion issues could comment on that? Otherwise I think this is ready to be bumped up for committer review. - Josh Williams
2009/9/15 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>: > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > >> Can't we use MultiByteToWideChar() to convert directly to the required >> encoding, avoiding the double conversion? > > Here is an updated version of the patch. > I use direct conversion in pgwin32_toUTF16() if a corresponding codepage > is available. If not available, I still use double conversion. > > Now pgwin32_toUTF16() is exported from mbutil.c. I used the function > in following parts, although the main target of the patch is eventlog. > > * WriteConsoleW() - write unredirected stderr log. > * ReportEventW() - write evenlog. > * CreateFileW() - open non-ascii filename (ex. COPY TO/FROM 'mb-path'). > > This approach is only available for Windows because any other platform > don't support locale-independent and wide-character-based system calls. > Other platforms require a different approach, but even then we'd still > better have win32-specific routines because UTF16 is the native encoding > in Windows. I did a quick check of this, and here are the things I would like to have changed: First of all, the change to port/open.c seems to be unrelated to the rest, and should be a separate patch, correct? I'm sure there's a usecase for it, but it's not actually included in the patches description, so I assume this was a mistake? Per your own comments earlier, and in the code, what will happen if pg_do_encoding_conversion() calls ereport()? Didn't you say we need a non-throwing version of it? pgwin32_toUTF16() needs error checking on the API calls, and needs to do something reasonable if it fails. For example, it can fail because of out of memory error. I suggest just returning the error code in some way in that case, and have the callers fall back to logging in the incorrect encoding - in a lot of cases that will produce an at least partially readable message. A second message should also be logged saying that the conversion failed - this needs to be done directly with the eventlog API functions and not ereport, so we don't end up in infinite recursion. The encoding_to_codepage array needs to go in encnames.c, where other such tables are. Perhaps it can even be integrated in pg_enc2name_tbl as a separate field? I don't have the time to clean this up right now, so if you have, please do so and resubmit. If not, I can clean it up later and apply. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> wrote: > First of all, the change to port/open.c seems to be unrelated to the > rest, and should be a separate patch, correct? I'm sure there's a > usecase for it, but it's not actually included in the patches > description, so I assume this was a mistake? It was just a demo for pgwin32_toUTF16(). I'll remove this part from the patch, but I think we also need to fix the encoding mismatch issue in path strings. I'll re-submit for the next commitfest. > Per your own comments earlier, and in the code, what will happen if > pg_do_encoding_conversion() calls ereport()? Didn't you say we need a > non-throwing version of it? We are hard to use encoding conversion functions in logging routines because they could throw errors if there are some unconvertable characters. Non-throwing version will convert such characters into '?' or escaped form (something like \888 or \xFF). If there where such infrastructure, we can support "log_encoding" settings and convert messages in platform-dependent encoding before writing to syslog or console. > pgwin32_toUTF16() needs error checking on the API calls, and needs to > do something reasonable if it fails. Now it returns NULL and caller writes messages in the original encoding. Also I added the following error checks before calling pgwin32_toUTF16() (errordata_stack_depth < ERRORDATA_STACK_SIZE - 1) to avoid recursive errors, but I'm not sure it is really meaningful. Please remove or rewrite this part if it is not a right way. > The encoding_to_codepage array needs to go in encnames.c, where other > such tables are. Perhaps it can even be integrated in pg_enc2name_tbl > as a separate field? I added pg_enc2name.codepage. Note that this field is needed only on Windows, but now exported for all platforms. If you don't like the useless field, the following macro could be a help. #ifdef WIN32 #define def_enc2name(name, codepage) { #name, PG_##name, codepage } #else #define def_enc2name(name, codepage) { #name, PG_##name } #endif pg_enc2name pg_enc2name_tbl[] = { def_enc2name(SQL_ASCII), def_enc2name(EUC_JP), ... > I don't have the time to clean this up right now, so if you have, > please do so and resubmit. If not, I can clean it up later and apply. Patch attached. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
2009/10/7 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>: > > Magnus Hagander <magnus@hagander.net> wrote: >> Per your own comments earlier, and in the code, what will happen if >> pg_do_encoding_conversion() calls ereport()? Didn't you say we need a >> non-throwing version of it? > > We are hard to use encoding conversion functions in logging routines > because they could throw errors if there are some unconvertable characters. > Non-throwing version will convert such characters into '?' or escaped form > (something like \888 or \xFF). If there where such infrastructure, we can > support "log_encoding" settings and convert messages in platform-dependent > encoding before writing to syslog or console. Right, which we don't have at this point. That would be very useful on unix, i believe. >> pgwin32_toUTF16() needs error checking on the API calls, and needs to >> do something reasonable if it fails. > > Now it returns NULL and caller writes messages in the original encoding. Seems reasonable. If encoding fails, I think that's the best we can do. > Also I added the following error checks before calling pgwin32_toUTF16() > (errordata_stack_depth < ERRORDATA_STACK_SIZE - 1) > to avoid recursive errors, but I'm not sure it is really meaningful. > Please remove or rewrite this part if it is not a right way. I'm not entirely sure either, but it looks like it could protect us from getting into a tight loop on an error here.. Tom (or someone else who knows that for sure :P),comments? >> The encoding_to_codepage array needs to go in encnames.c, where other >> such tables are. Perhaps it can even be integrated in pg_enc2name_tbl >> as a separate field? > > I added pg_enc2name.codepage. Note that this field is needed only > on Windows, but now exported for all platforms. If you don't like > the useless field, the following macro could be a help. > #ifdef WIN32 > #define def_enc2name(name, codepage) { #name, PG_##name, codepage } > #else > #define def_enc2name(name, codepage) { #name, PG_##name } > #endif > pg_enc2name pg_enc2name_tbl[] = > { > def_enc2name(SQL_ASCII), > def_enc2name(EUC_JP), > ... Yeah, I think that makes sense. It's not much data, but it's completely unnecessary :-) I can make that change at commit. One other question - you note that WriteConsoleW() "could fail if stderr is redirected". Are you saying that it will always fail when stderr is redirected, or only sometimes? If ony sometimes, do you know under which conditions it happens? If it's always, I assume this just means that the logfile will be in the database encoding and not in UTF16? Is this what we want, or would we like the logfile to also be in UTF16? If we can convert it to UTF16, that would fix the case when you have different databases in different encodings, wouldn't it? (Even if your editor, unlike the console subsystem, can view the individual encoding you need, I bet it can't deal with multiple encodings in the same file) -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > 2009/10/7 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>: >> Also I added the following error checks before calling pgwin32_toUTF16() >> � �(errordata_stack_depth < ERRORDATA_STACK_SIZE - 1) >> to avoid recursive errors, but I'm not sure it is really meaningful. >> Please remove or rewrite this part if it is not a right way. > I'm not entirely sure either, but it looks like it could protect us > from getting into a tight loop on an error here.. Tom (or someone else > who knows that for sure :P),comments? I haven't read the patch, but I'd suggest making any behavior changes dependent on in_error_recursion_trouble(), rather than getting in bed with internal implementation variables. regards, tom lane
Magnus Hagander <magnus@hagander.net> wrote: > One other question - you note that WriteConsoleW() "could fail if > stderr is redirected". Are you saying that it will always fail when > stderr is redirected, or only sometimes? If ony sometimes, do you know > under which conditions it happens? It will always fail if redirected. We can test the conditions using: pg_ctl start > result.log So, the comment should be: /* WriteConsoleW always fails if stderr is redirected. */ I cleaned up the patch per comments. I hope this will be the final one ;-). * Use in_error_recursion_trouble() instead of own implementation. * Use def_enc2name() macro to avoid adding the codepage field on non-win32 platforms. * Fix a bug of calculation of result length. * Fix a memory leak on error handling path in pgwin32_toUTF16(). > If it's always, I assume this just means that the logfile will be in > the database encoding and not in UTF16? Is this what we want, or would > we like the logfile to also be in UTF16? If we can convert it to > UTF16, that would fix the case when you have different databases in > different encodings, wouldn't it? (Even if your editor, unlike the > console subsystem, can view the individual encoding you need, I bet it > can't deal with multiple encodings in the same file) Sure, the logfile will be filled with mixed encoding strings, that could happen in logfile and syslog on non-win32 platforms. I think UTF8 is better than UTF16 for logfile encoding because there are some text editors that do not support wide characters. At any rate, the logfile encoding feature will come from another patch, that might add "log_encoding" variable and work on any platforms. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
On Mon, Oct 12, 2009 at 9:13 PM, Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote: > > Magnus Hagander <magnus@hagander.net> wrote: > >> One other question - you note that WriteConsoleW() "could fail if >> stderr is redirected". Are you saying that it will always fail when >> stderr is redirected, or only sometimes? If ony sometimes, do you know >> under which conditions it happens? > > It will always fail if redirected. We can test the conditions using: > pg_ctl start > result.log > So, the comment should be: > /* WriteConsoleW always fails if stderr is redirected. */ > > I cleaned up the patch per comments. I hope this will be the final one ;-). > > * Use in_error_recursion_trouble() instead of own implementation. > * Use def_enc2name() macro to avoid adding the codepage field > on non-win32 platforms. > * Fix a bug of calculation of result length. > * Fix a memory leak on error handling path in pgwin32_toUTF16(). > > >> If it's always, I assume this just means that the logfile will be in >> the database encoding and not in UTF16? Is this what we want, or would >> we like the logfile to also be in UTF16? If we can convert it to >> UTF16, that would fix the case when you have different databases in >> different encodings, wouldn't it? (Even if your editor, unlike the >> console subsystem, can view the individual encoding you need, I bet it >> can't deal with multiple encodings in the same file) > > Sure, the logfile will be filled with mixed encoding strings, > that could happen in logfile and syslog on non-win32 platforms. > I think UTF8 is better than UTF16 for logfile encoding because > there are some text editors that do not support wide characters. > At any rate, the logfile encoding feature will come from another patch, > that might add "log_encoding" variable and work on any platforms. Magnus has promised me on a stack of instant messages that he will review this soon, but as he hasn't gotten to it yet, I am moving it to the next CommitFest. ...Robert
Robert Haas wrote: > > Sure, the logfile will be filled with mixed encoding strings, > > that could happen in logfile and syslog on non-win32 platforms. > > I think UTF8 is better than UTF16 for logfile encoding because > > there are some text editors that do not support wide characters. > > At any rate, the logfile encoding feature will come from another patch, > > that might add "log_encoding" variable and work on any platforms. > > Magnus has promised me on a stack of instant messages that he will > review this soon, but as he hasn't gotten to it yet, I am moving it to > the next CommitFest. I am with Magnus today and will make sure it gets done. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
2009/10/13 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>: > > Magnus Hagander <magnus@hagander.net> wrote: > >> One other question - you note that WriteConsoleW() "could fail if >> stderr is redirected". Are you saying that it will always fail when >> stderr is redirected, or only sometimes? If ony sometimes, do you know >> under which conditions it happens? > > It will always fail if redirected. We can test the conditions using: > pg_ctl start > result.log > So, the comment should be: > /* WriteConsoleW always fails if stderr is redirected. */ Ok, fair enough. We already have a variable for that though - it's called redirection_done. I think it does what's necessary - I have used that one in my version of the patch. Please verify that it works in your environment. > I cleaned up the patch per comments. I hope this will be the final one ;-). > > * Use in_error_recursion_trouble() instead of own implementation. > * Use def_enc2name() macro to avoid adding the codepage field > on non-win32 platforms. Per previous email, I had done this in my version of the patch, so it looks slightly different than yours, but it has the same functionality. > * Fix a bug of calculation of result length. Where exactly is this one? I can't find it compared to my code, but that could just be out-of-timezone-brain speaking :-) > * Fix a memory leak on error handling path in pgwin32_toUTF16(). Missed that one, thanks! >> If it's always, I assume this just means that the logfile will be in >> the database encoding and not in UTF16? Is this what we want, or would >> we like the logfile to also be in UTF16? If we can convert it to >> UTF16, that would fix the case when you have different databases in >> different encodings, wouldn't it? (Even if your editor, unlike the >> console subsystem, can view the individual encoding you need, I bet it >> can't deal with multiple encodings in the same file) > > Sure, the logfile will be filled with mixed encoding strings, > that could happen in logfile and syslog on non-win32 platforms. > I think UTF8 is better than UTF16 for logfile encoding because > there are some text editors that do not support wide characters. Don't most text editors on Windows do UTF16? In particular, I'd expect more of them to do UTF16 than UTF8, but I could be wrong? > At any rate, the logfile encoding feature will come from another patch, > that might add "log_encoding" variable and work on any platforms. Ok, good. Particularly the "other platform" is the winning argument. So, what I believe is the latest version of the patch applied. Please point out if I made a mistake in my changes against yours. Sorry about the delay :( -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/