Thread: BUG #6510: A simple prompt is displayed using wrong charset
The following bug has been logged on the website: Bug reference: 6510 Logged by: Alexander LAW Email address: exclusion@gmail.com PostgreSQL version: 9.1.3 Operating system: Windows Description:=20=20=20=20=20=20=20=20 I'm using postgresSQL in Windows with Russian locale and get unreadable messages when the postgres utilities prompting me for input. Please look at the screenshot: http://oi44.tinypic.com/aotje8.jpg (The psql writes the unreadable message prompting for the password.) But at the same time the following message (WARINING) displayed right. I believe it's related to setlocale and the difference between OEM and ANSI encoding, which we had in Windows with the Russian locale. The startup code of psql sets locale with the call setlocale(LC_ALL, "") and MSDN documentation says that the call: Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system. After the call all the strings printed with the printf(stdout) will go through the ANSI->OEM conversion. But in the simple_prompt function strings written to con, and such writes go without conversion. I've made a little test to illustrate this: #include "stdafx.h" #include <locale.h> int _tmain(int argc, _TCHAR* argv[]) { printf("=D0=9E=D0=9A\n"); setlocale(0, ""); fprintf(stdout, "=D0=9E=D0=9A\n"); FILE * termin =3D fopen("con", "w"); fprintf(termin, "=D0=9E=D0=9A\n"); fflush(termin); return 0; } where "=D0=9E=D0=9A" is "OK" with russian letters. This test gives the following result: http://oi39.tinypic.com/35jgljs.jpg The second line is readable, while the others are not. If it can be helpful to understand the issue, I can perform another tests. Thanks in advance, Alexander
Excerpts from exclusion's message of s=C3=A1b mar 03 15:44:37 -0300 2012: > I'm using postgresSQL in Windows with Russian locale and get unreadable > messages when the postgres utilities prompting me for input. > Please look at the screenshot: > http://oi44.tinypic.com/aotje8.jpg > (The psql writes the unreadable message prompting for the password.) > But at the same time the following message (WARINING) displayed right. >=20 > I believe it's related to setlocale and the difference between OEM and AN= SI > encoding, which we had in Windows with the Russian locale. > The startup code of psql sets locale with the call setlocale(LC_ALL, "") = and > MSDN documentation says that the call: > Sets the locale to the default, which is the user-default ANSI code page > obtained from the operating system. >=20 > After the call all the strings printed with the printf(stdout) will go > through the ANSI->OEM conversion. >=20 > But in the simple_prompt function strings written to con, and such writes= go > without conversion. Were you able to come up with some way to make this work? --=20 =C3=81lvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
I see two ways to resolve the issue.
First is to use CharToOemBuff when writing a string to the "con" and OemToCharBuff when reading an input from it.
The other is to always use stderr/stdin for Win32 as it was done for msys before. I think it's more straightforward.
I tested the attached patch (build the source with msvc) and it fixes the issue. If it looks acceptible, then probably DEVTTY should not be used on Windows at all.
I found two other references of DEVTTY at
psql/command.c
success = saveHistory(fname ? fname : DEVTTY, -1, false, false);
and
contrib/pg_upgrade/option.c
log_opts.debug_fd = fopen(DEVTTY, "w");
By the way, is there any reason to use stderr for the prompt output, not stdout?
Regards,
Alexander
16.03.2012 23:13, Alvaro Herrera пишет:
First is to use CharToOemBuff when writing a string to the "con" and OemToCharBuff when reading an input from it.
The other is to always use stderr/stdin for Win32 as it was done for msys before. I think it's more straightforward.
I tested the attached patch (build the source with msvc) and it fixes the issue. If it looks acceptible, then probably DEVTTY should not be used on Windows at all.
I found two other references of DEVTTY at
psql/command.c
success = saveHistory(fname ? fname : DEVTTY, -1, false, false);
and
contrib/pg_upgrade/option.c
log_opts.debug_fd = fopen(DEVTTY, "w");
By the way, is there any reason to use stderr for the prompt output, not stdout?
Regards,
Alexander
16.03.2012 23:13, Alvaro Herrera пишет:
Excerpts from exclusion's message of sáb mar 03 15:44:37 -0300 2012:I'm using postgresSQL in Windows with Russian locale and get unreadable messages when the postgres utilities prompting me for input. Please look at the screenshot: http://oi44.tinypic.com/aotje8.jpg (The psql writes the unreadable message prompting for the password.) But at the same time the following message (WARINING) displayed right. I believe it's related to setlocale and the difference between OEM and ANSI encoding, which we had in Windows with the Russian locale. The startup code of psql sets locale with the call setlocale(LC_ALL, "") and MSDN documentation says that the call: Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system. After the call all the strings printed with the printf(stdout) will go through the ANSI->OEM conversion. But in the simple_prompt function strings written to con, and such writes go without conversion.Were you able to come up with some way to make this work?
Attachment
Excerpts from Alexander LAW's message of dom mar 18 06:04:51 -0300 2012: > I see two ways to resolve the issue. > First is to use CharToOemBuff when writing a string to the "con" and=20 > OemToCharBuff when reading an input from it. > The other is to always use stderr/stdin for Win32 as it was done for=20 > msys before. I think it's more straightforward. Using console directly instead of stdin/out/err is more appropriate when asking for passwords and reading them back, because you can redirect the rest of the output to/from files or pipes, without the prompt interfering with that. This also explains why stderr is used instead of stdout. --=20 =C3=81lvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Thanks, I've understood your point. Please look at the patch. It implements the first way and it makes psql work too. Regards, Alexander 20.03.2012 00:05, Alvaro Herrera пишет: > Excerpts from Alexander LAW's message of dom mar 18 06:04:51 -0300 2012: >> I see two ways to resolve the issue. >> First is to use CharToOemBuff when writing a string to the "con" and >> OemToCharBuff when reading an input from it. >> The other is to always use stderr/stdin for Win32 as it was done for >> msys before. I think it's more straightforward. > Using console directly instead of stdin/out/err is more appropriate when > asking for passwords and reading them back, because you can redirect the > rest of the output to/from files or pipes, without the prompt > interfering with that. This also explains why stderr is used instead of > stdout. >
Attachment
Excerpts from Alexander LAW's message of mar mar 20 16:50:14 -0300 2012: > Thanks, I've understood your point. > Please look at the patch. It implements the first way and it makes psql > work too. Great, thanks. Hopefully somebody with Windows-compile abilities will have a look at this. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding
From
Alexander Law
Date:
Hello, The dump file itself is correct. The issue is only with the non-ASCII object names in pg_dump messages. The messages text (which is non-ASCII too) displayed consistently with right encoding (i.e. with OS encoding thanks to libintl/gettext), but encoding of db object names depends on the dump encoding and thus they're getting unreadable when different encoding is used. The same can be reproduced in Linux (where console encoding is UTF-8) when doing dump with Windows-1251 or Latin1 (for western european languages). Thanks, Alexander The following bug has been logged on the website: Bug reference: 6742 Logged by: Alexander LAW Email address: exclusion(at)gmail(dot)com PostgreSQL version: 9.1.4 Operating system: Windows Description: When I try to dump database with UTF-8 encoding in Windows, I get unreadable object names. Please look at the screenshot (http://oi50.tinypic.com/2lw6ipf.jpg). On the left window all the pg_dump messages displayed correctly (except for the prompt password (bug #6510)), but the non-ASCII object name is gibberish. On the right window (where dump is done with the Windows 1251 encoding (OS Encoding for Russian locale)) everything is right. Did you check the dump file using an editor that can handle UTF-8? The Windows console is not known for properly handling that encoding. Thomas
Hello! May I to propose a solution and to step up? I've read a discussion of the bug #5800 and here is my 2 cents. To make things clear let me give an example. I am a PostgreSQL hosting provider and I let my customers to create any databases they wish. I have clients all over the world (so they can create databases with different encoding). The question is - what I (as admin) want to see in my postgresql log, containing errors from all the databases? IMHO we should consider two requirements for the log. First, The file should be readable with a generic text viewer. Second, It should be useful and complete as possible. Now I see following solutions. A. We have different logfiles for each database with different encodings. Then all our logs will be readable, but we have to look at them one by onе and it's inconvenient at least. Moreover, our log reader should understand what encoding to use for each file. B. We have one logfile with the operating system encoding. First downside is that the logs can be different for different OSes. The second is that Windows has non-Unicode system encoding. And such an encoding can't represent all the national characters. So at best I will get ??? in the log. C. We have one logfile with UTF-8. Pros: Log messages of all our clients can fit in it. We can use any generic editor/viewer to open it. Nothing changes for Linux (and other OSes with UTF-8 encoding). Cons: All the strings written to log file should go through some conversation function. I think that the last solution is the solution. What is your opinion? In fact the problem exists even with a simple installation on Windows when you use non-English locale. So the solution would be useful for many of us. Best regards, Alexander P.S. sorry for the wrong subject in my previous message sent to pgsql-general On 05/23/2012 09:15 AM, yi huang wrote: > I'm using postgresql 9.1.3 from debian squeeze-backports with > zh_CN.UTF-8 locale, i find my main log (which is > "/var/log/postgresql/postgresql-9.1-main.log") contains "???" which > indicate some sort of charset encoding problem. It's a known issue, I'm afraid. The PostgreSQL postmaster logs in the system locale, and the PostgreSQL backends log in whatever encoding their database is in. They all write to the same log file, producing a log file full of mixed encoding data that'll choke many text editors. If you force your editor to re-interpret the file according to the encoding your database(s) are in, this may help. In the future it's possible that this may be fixed by logging output to different files on a per-database basis or by converting the text encoding of log messages, but no agreement has been reached on the correct approach and nobody has stepped up to implement it. -- Craig Ringer
> C. We have one logfile with UTF-8. > Pros: Log messages of all our clients can fit in it. We can use any > generic editor/viewer to open it. > Nothing changes for Linux (and other OSes with UTF-8 encoding). > Cons: All the strings written to log file should go through some > conversation function. > > I think that the last solution is the solution. What is your opinion? I am thinking about variant of C. Problem with C is, converting from other encoding to UTF-8 is not cheap because it requires huge conversion tables. This may be a serious problem with busy server. Also it is possible some information is lossed while in this conversion. This is because there's no gualntee that there is one-to-one-mapping between UTF-8 and other encodings. Other problem with UTF-8 is, you have to choose *one* locale when using your editor. This may or may not affect handling of string in your editor. My idea is using mule-internal encoding for the log file instead of UTF-8. There are several advantages: 1) Converion to mule-internal encoding is cheap because no conversion table is required. Also no information loss happens in this conversion. 2) Mule-internal encoding can be handled by emacs, one of the most popular editors in the world. 3) No need to worry about locale. Mule-internal encoding has enough information about language. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Tatsuo Ishii <ishii@postgresql.org> writes: > My idea is using mule-internal encoding for the log file instead of > UTF-8. There are several advantages: > 1) Converion to mule-internal encoding is cheap because no conversion > table is required. Also no information loss happens in this > conversion. > 2) Mule-internal encoding can be handled by emacs, one of the most > popular editors in the world. > 3) No need to worry about locale. Mule-internal encoding has enough > information about language. Um ... but ... (1) nothing whatsoever can read MULE, except emacs and xemacs. (2) there is more than one version of MULE (emacs versus xemacs, not to mention any possible cross-version discrepancies). (3) from a log volume standpoint, this could be pretty disastrous. I'm not for a write-only solution, which is pretty much what this would be. regards, tom lane
> Tatsuo Ishii <ishii@postgresql.org> writes: >> My idea is using mule-internal encoding for the log file instead of >> UTF-8. There are several advantages: > >> 1) Converion to mule-internal encoding is cheap because no conversion >> table is required. Also no information loss happens in this >> conversion. > >> 2) Mule-internal encoding can be handled by emacs, one of the most >> popular editors in the world. > >> 3) No need to worry about locale. Mule-internal encoding has enough >> information about language. > > Um ... but ... > > (1) nothing whatsoever can read MULE, except emacs and xemacs. > > (2) there is more than one version of MULE (emacs versus xemacs, > not to mention any possible cross-version discrepancies). > > (3) from a log volume standpoint, this could be pretty disastrous. > > I'm not for a write-only solution, which is pretty much what this > would be. I'm not sure how long xemacs will survive (the last stable release of xemacs was released in 2009). Anyway, I'm not too worried about your points, since it's easy to convert back from mule-internal code encoded log files to original encoding mixed log file. No information will be lost. Even converting to UTF-8 should be possible. My point is, once the log file is converted to UTF-8, there's no way to convert back to original encoding log file. Probably we treat mule-internal encoded log files as an internal format, and have a utility which does conversion from mule-internal to UTF-8. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On 07/18/2012 11:16 PM, Alexander Law wrote: > Hello! > > May I to propose a solution and to step up? > > I've read a discussion of the bug #5800 and here is my 2 cents. > To make things clear let me give an example. > I am a PostgreSQL hosting provider and I let my customers to create > any databases they wish. > I have clients all over the world (so they can create databases with > different encoding). > > The question is - what I (as admin) want to see in my postgresql log, > containing errors from all the databases? > IMHO we should consider two requirements for the log. > First, The file should be readable with a generic text viewer. Second, > It should be useful and complete as possible. > > Now I see following solutions. > A. We have different logfiles for each database with different encodings. > Then all our logs will be readable, but we have to look at them one by > onе and it's inconvenient at least. > Moreover, our log reader should understand what encoding to use for > each file. > > B. We have one logfile with the operating system encoding. > First downside is that the logs can be different for different OSes. > The second is that Windows has non-Unicode system encoding. > And such an encoding can't represent all the national characters. So > at best I will get ??? in the log. > > C. We have one logfile with UTF-8. > Pros: Log messages of all our clients can fit in it. We can use any > generic editor/viewer to open it. > Nothing changes for Linux (and other OSes with UTF-8 encoding). > Cons: All the strings written to log file should go through some > conversation function. > > I think that the last solution is the solution. What is your opinion? Implementing any of these isn't trivial - especially making sure messages emitted to stderr from things like segfaults and dynamic linker messages are always correct. Ensuring that the logging collector knows when setlocale() has been called to change the encoding and translation of system messages, handling the different logging output methods, etc - it's going to be fiddly. I have some performance concerns about the transcoding required for (b) or (c), but realistically it's already the norm to convert all the data sent to and from clients. Conversion for logging should not be a significant additional burden. Conversion can be short-circuited out when source and destination encodings are the same for the common case of logging in utf-8 or to a dedicated file. I suspect the eventual choice will be "all of the above": - Default to (b) or (c), both have pros and cons. I favour (c) with a UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are all in the system locale. - Allow (a) for people who have many different DBs in many different encodings, do high volume logging, and want to avoid conversion overhead. Let them deal with the mess, just provide an additional % code for the encoding so they can name their per-DB log files to indicate the encoding. The main issue is just that code needs to be prototyped, cleaned up, and submitted. So far nobody's cared enough to design it, build it, and get it through patch review. I've just foolishly volunteered myself to work on an automated crash-test system for virtual plug-pull testing, so I'm not stepping up. -- Craig Ringer
Hello,
I believe that postgres has such conversion functions anyway. And they used for data conversion when we have clients (and databases) with different encodings. So if they can be used for data, why not to use them for relatively little amount of log messages?C. We have one logfile with UTF-8. Pros: Log messages of all our clients can fit in it. We can use any generic editor/viewer to open it. Nothing changes for Linux (and other OSes with UTF-8 encoding). Cons: All the strings written to log file should go through some conversation function. I think that the last solution is the solution. What is your opinion?I am thinking about variant of C. Problem with C is, converting from other encoding to UTF-8 is not cheap because it requires huge conversion tables. This may be a serious problem with busy server. Also it is possible some information is lossed while in this conversion. This is because there's no gualntee that there is one-to-one-mapping between UTF-8 and other encodings. Other problem with UTF-8 is, you have to choose *one* locale when using your editor. This may or may not affect handling of string in your editor. My idea is using mule-internal encoding for the log file instead of UTF-8. There are several advantages: 1) Converion to mule-internal encoding is cheap because no conversion table is required. Also no information loss happens in this conversion. 2) Mule-internal encoding can be handled by emacs, one of the most popular editors in the world. 3) No need to worry about locale. Mule-internal encoding has enough information about language. --
And regarding mule internal encoding - reading about Mule http://www.emacswiki.org/emacs/UnicodeEncoding I found:
In future (probably Emacs 22), Mule will use an internal encoding which is a UTF-8 encoding of a superset of Unicode.
So I still see UTF-8 as a common denominator for all the encodings.
I am not aware of any characters absent in Unicode. Can you please provide some examples of these that can results in lossy conversion?
Сhoosing UTF-8 in a viewer/editor is no big deal too. Most of them detect UTF-8 automagically, and for the others BOM can be added.
Best regards,
Aexander
Hello, > > Implementing any of these isn't trivial - especially making sure > messages emitted to stderr from things like segfaults and dynamic > linker messages are always correct. Ensuring that the logging > collector knows when setlocale() has been called to change the > encoding and translation of system messages, handling the different > logging output methods, etc - it's going to be fiddly. > > I have some performance concerns about the transcoding required for > (b) or (c), but realistically it's already the norm to convert all the > data sent to and from clients. Conversion for logging should not be a > significant additional burden. Conversion can be short-circuited out > when source and destination encodings are the same for the common case > of logging in utf-8 or to a dedicated file. > The initial issue was that log file contains messages in different encodings. So transcoding is performed already, but it's not consistent and in my opinion this is the main problem. > I suspect the eventual choice will be "all of the above": > > - Default to (b) or (c), both have pros and cons. I favour (c) with a > UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are > all in the system locale. As I understand UTF-8 is the default encoding for databases. And even when a database is in the system encoding, translated postgres messages still come in UTF-8 and will go through UTF-8 -> System locale conversion within gettext. > > - Allow (a) for people who have many different DBs in many different > encodings, do high volume logging, and want to avoid conversion > overhead. Let them deal with the mess, just provide an additional % > code for the encoding so they can name their per-DB log files to > indicate the encoding. > I think that (a) solution can be an evolvement of the logging mechanism if there will be a need for it. > The main issue is just that code needs to be prototyped, cleaned up, > and submitted. So far nobody's cared enough to design it, build it, > and get it through patch review. I've just foolishly volunteered > myself to work on an automated crash-test system for virtual plug-pull > testing, so I'm not stepping up. > I see you point and I can prepare a prototype if the proposed (c) solution seems reasonable enough and can be accepted. Best regards, Alexander
>> I am thinking about variant of C. >> >> Problem with C is, converting from other encoding to UTF-8 is not >> cheap because it requires huge conversion tables. This may be a >> serious problem with busy server. Also it is possible some information >> is lossed while in this conversion. This is because there's no >> gualntee that there is one-to-one-mapping between UTF-8 and other >> encodings. Other problem with UTF-8 is, you have to choose *one* >> locale when using your editor. This may or may not affect handling of >> string in your editor. >> >> My idea is using mule-internal encoding for the log file instead of >> UTF-8. There are several advantages: >> >> 1) Converion to mule-internal encoding is cheap because no conversion >> table is required. Also no information loss happens in this >> conversion. >> >> 2) Mule-internal encoding can be handled by emacs, one of the most >> popular editors in the world. >> >> 3) No need to worry about locale. Mule-internal encoding has enough >> information about language. >> -- >> > I believe that postgres has such conversion functions anyway. And they > used for data conversion when we have clients (and databases) with > different encodings. So if they can be used for data, why not to use > them for relatively little amount of log messages? Frontend/Backend encoding conversion only happens when they are different. While conversion for logs *always* happens. A busy database could produce tons of logs (i is not unusual that log all SQLs for auditing purpose). > And regarding mule internal encoding - reading about Mule > http://www.emacswiki.org/emacs/UnicodeEncoding I found: > /In future (probably Emacs 22), Mule will use an internal encoding > which is a UTF-8 encoding of a superset of Unicode. / > So I still see UTF-8 as a common denominator for all the encodings. > I am not aware of any characters absent in Unicode. Can you please > provide some examples of these that can results in lossy conversion? You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or some such to find such an example. In this case PostgreSQL just throw an error. For frontend/backend encoding conversion this is fine. But what should we do for logs? Apparently we cannot throw an error here. "Unification" is another problem. Some kanji characters of CJK are "unified" in Unicode. The idea of unification is, if kanji A in China, B in Japan, C in Korea looks "similar" unify ABC to D. This is a great space saving:-) The price of this is inablity of round-trip-conversion. You can convert A, B or C to D, but you cannot convert D to A/B/C. BTW, I'm not stick with mule-internal encoding. What we need here is a "super" encoding which could include any existing encodings without information loss. For this purpose, I think we can even invent a new encoding(maybe something like very first prposal of ISO/IEC 10646?). However, using UTF-8 for this purpose seems to be just a disaster to me. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
> Hello, >> >> Implementing any of these isn't trivial - especially making sure >> messages emitted to stderr from things like segfaults and dynamic >> linker messages are always correct. Ensuring that the logging >> collector knows when setlocale() has been called to change the >> encoding and translation of system messages, handling the different >> logging output methods, etc - it's going to be fiddly. >> >> I have some performance concerns about the transcoding required for >> (b) or (c), but realistically it's already the norm to convert all the >> data sent to and from clients. Conversion for logging should not be a >> significant additional burden. Conversion can be short-circuited out >> when source and destination encodings are the same for the common case >> of logging in utf-8 or to a dedicated file. >> > The initial issue was that log file contains messages in different > encodings. So transcoding is performed already, but it's not This is not true. Transcoding happens only when PostgreSQL is built with --enable-nls option (default is no nls). > consistent and in my opinion this is the main problem. > >> I suspect the eventual choice will be "all of the above": >> >> - Default to (b) or (c), both have pros and cons. I favour (c) with a >> - UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are >> - all in the system locale. > As I understand UTF-8 is the default encoding for databases. And even > when a database is in the system encoding, translated postgres > messages still come in UTF-8 and will go through UTF-8 -> System > locale conversion within gettext. Again, this is not always true. >> - Allow (a) for people who have many different DBs in many different >> - encodings, do high volume logging, and want to avoid conversion >> - overhead. Let them deal with the mess, just provide an additional % >> - code for the encoding so they can name their per-DB log files to >> - indicate the encoding. >> > I think that (a) solution can be an evolvement of the logging > mechanism if there will be a need for it. >> The main issue is just that code needs to be prototyped, cleaned up, >> and submitted. So far nobody's cared enough to design it, build it, >> and get it through patch review. I've just foolishly volunteered >> myself to work on an automated crash-test system for virtual plug-pull >> testing, so I'm not stepping up. >> > I see you point and I can prepare a prototype if the proposed (c) > solution seems reasonable enough and can be accepted. > > Best regards, > Alexander > > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs
>> The initial issue was that log file contains messages in different >> encodings. So transcoding is performed already, but it's not > This is not true. Transcoding happens only when PostgreSQL is built > with --enable-nls option (default is no nls). I'll restate the initial issue as I see it. I have Windows and I'm installing PostgreSQL for Windows (latest version, downloaded from enterprise.db). Then I create a database with default settings (with UTF-8 encoding), do something wrong in my DB and get such a log file with the two different encodings (UTF-8 and Windows-1251 (ANSI)) and with localized postgres messages.
>> And regarding mule internal encoding - reading about Mule >> http://www.emacswiki.org/emacs/UnicodeEncoding I found: >> /In future (probably Emacs 22), Mule will use an internal encoding >> which is a UTF-8 encoding of a superset of Unicode. / >> So I still see UTF-8 as a common denominator for all the encodings. >> I am not aware of any characters absent in Unicode. Can you please >> provide some examples of these that can results in lossy conversion? > You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or > some such to find such an example. In this case PostgreSQL just throw > an error. For frontend/backend encoding conversion this is fine. But > what should we do for logs? Apparently we cannot throw an error here. > > "Unification" is another problem. Some kanji characters of CJK are > "unified" in Unicode. The idea of unification is, if kanji A in China, > B in Japan, C in Korea looks "similar" unify ABC to D. This is a great > space saving:-) The price of this is inablity of > round-trip-conversion. You can convert A, B or C to D, but you cannot > convert D to A/B/C. > > BTW, I'm not stick with mule-internal encoding. What we need here is a > "super" encoding which could include any existing encodings without > information loss. For this purpose, I think we can even invent a new > encoding(maybe something like very first prposal of ISO/IEC > 10646?). However, using UTF-8 for this purpose seems to be just a > disaster to me. > Ok, maybe the time of real universal encoding has not yet come. Then we maybe just should add a new parameter "log_encoding" (UTF-8 by default) to postgresql.conf. And to use this encoding consistently within logging_collector. If this encoding is not available then fall back to 7-bit ASCII.
>> You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or >> some such to find such an example. In this case PostgreSQL just throw >> an error. For frontend/backend encoding conversion this is fine. But >> what should we do for logs? Apparently we cannot throw an error here. >> >> "Unification" is another problem. Some kanji characters of CJK are >> "unified" in Unicode. The idea of unification is, if kanji A in China, >> B in Japan, C in Korea looks "similar" unify ABC to D. This is a great >> space saving:-) The price of this is inablity of >> round-trip-conversion. You can convert A, B or C to D, but you cannot >> convert D to A/B/C. >> >> BTW, I'm not stick with mule-internal encoding. What we need here is a >> "super" encoding which could include any existing encodings without >> information loss. For this purpose, I think we can even invent a new >> encoding(maybe something like very first prposal of ISO/IEC >> 10646?). However, using UTF-8 for this purpose seems to be just a >> disaster to me. >> > Ok, maybe the time of real universal encoding has not yet come. Then > we maybe just should add a new parameter "log_encoding" (UTF-8 by > default) to postgresql.conf. And to use this encoding consistently > within logging_collector. > If this encoding is not available then fall back to 7-bit ASCII. What do you mean by "not available"? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
>> Ok, maybe the time of real universal encoding has not yet come. Then >> we maybe just should add a new parameter "log_encoding" (UTF-8 by >> default) to postgresql.conf. And to use this encoding consistently >> within logging_collector. >> If this encoding is not available then fall back to 7-bit ASCII. > What do you mean by "not available"? Sorry, it was inaccurate phrase. I mean "if the conversion to this encoding is not avaliable". For example, when we have database in EUC_JP and log_encoding set to Latin1. I think that we can even fall back to UTF-8 as we can convert all encodings to it (with some exceptions that you noticed).
> Sorry, it was inaccurate phrase. I mean "if the conversion to this > encoding is not avaliable". For example, when we have database in > EUC_JP and log_encoding set to Latin1. I think that we can even fall > back to UTF-8 as we can convert all encodings to it (with some > exceptions that you noticed). So, what you wanted to say here is: "If the conversion to this encoding is not avaliable then fall back to UTF-8" Am I correct? Also is it possible to completely disable the feature? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On 19 July 2012 10:40, Alexander Law <exclusion@gmail.com> wrote: >>> Ok, maybe the time of real universal encoding has not yet come. Then >>> we maybe just should add a new parameter "log_encoding" (UTF-8 by >>> default) to postgresql.conf. And to use this encoding consistently >>> within logging_collector. >>> If this encoding is not available then fall back to 7-bit ASCII. >> >> What do you mean by "not available"? > > Sorry, it was inaccurate phrase. I mean "if the conversion to this encoding > is not avaliable". For example, when we have database in EUC_JP and > log_encoding set to Latin1. I think that we can even fall back to UTF-8 as > we can convert all encodings to it (with some exceptions that you noticed). I like Craig's idea of adding the client encoding to the log lines. A possible problem with that (I'm not an encoding expert) is that a log line like that will contain data about the database server meta-data (log time, client encoding, etc) in the database default encoding and database data (the logged query and user-supplied values) in the client encoding. One option would be to use the client encoding for the entire log line, but would that result in legible meta-data in every encoding? It appears that the primarly here is that SQL statements and user-supplied data are being logged, while the log-file is a text file in a fixed encoding. Perhaps another solution would be to add the ability to log certain types of information (not the core database server log info, of course!) to a database/table so that each record can be stored in its own encoding? That way the transcoding doesn't have to take place until someone is reading the log, you'd know what to transcode the data to (namely the client_encoding of the reading session) and there isn't any issue of transcoding errors while logging statements. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest.
Yikes, messed up my grammar a bit I see! On 19 July 2012 10:58, Alban Hertroys <haramrae@gmail.com> wrote: > I like Craig's idea of adding the client encoding to the log lines. A > possible problem with that (I'm not an encoding expert) is that a log > line like that will contain data about the database server meta-data > (log time, client encoding, etc) in the database default encoding and ...will contain meta-data about the database server (log time... > It appears that the primarly here is that SQL statements and It appears the primary issue here... -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest.
>> Sorry, it was inaccurate phrase. I mean "if the conversion to this >> encoding is not avaliable". For example, when we have database in >> EUC_JP and log_encoding set to Latin1. I think that we can even fall >> back to UTF-8 as we can convert all encodings to it (with some >> exceptions that you noticed). > So, what you wanted to say here is: > > "If the conversion to this encoding is not avaliable then fall back to > UTF-8" > > Am I correct? > > Also is it possible to completely disable the feature? > Yes, you're. I think it could be disabled by setting log_encoding='', but if the parameter is missing then the feature should be enabled (with UTF-8).
> I like Craig's idea of adding the client encoding to the log lines. A > possible problem with that (I'm not an encoding expert) is that a log > line like that will contain data about the database server meta-data > (log time, client encoding, etc) in the database default encoding and > database data (the logged query and user-supplied values) in the > client encoding. One option would be to use the client encoding for > the entire log line, but would that result in legible meta-data in > every encoding? I think then we get non-human readable logs. We will need one more tool to open and convert the log (and omit excessive encoding specification in each line). > It appears that the primarly here is that SQL statements and > user-supplied data are being logged, while the log-file is a text file > in a fixed encoding. Yes, and in in my opinion there is nothing unusual about it. XML/HTML are examples of a text files with fixed encoding that can contain multi-language strings. UTF-8 is the default encoding for XML. And when it's not good enough (as Tatsou noticed), you still can switch to another. > Perhaps another solution would be to add the ability to log certain > types of information (not the core database server log info, of > course!) to a database/table so that each record can be stored in its > own encoding? > That way the transcoding doesn't have to take place until someone is > reading the log, you'd know what to transcode the data to (namely the > client_encoding of the reading session) and there isn't any issue of > transcoding errors while logging statements. I don't think it would be the simplest solution of the existing problem. It can be another branch of evolution, but it doesn't answer the question - what encoding to use for the core database server log?
On 19 July 2012 13:50, Alexander Law <exclusion@gmail.com> wrote: >> I like Craig's idea of adding the client encoding to the log lines. A >> possible problem with that (I'm not an encoding expert) is that a log >> line like that will contain data about the database server meta-data >> (log time, client encoding, etc) in the database default encoding and >> database data (the logged query and user-supplied values) in the >> client encoding. One option would be to use the client encoding for >> the entire log line, but would that result in legible meta-data in >> every encoding? > > I think then we get non-human readable logs. We will need one more tool to > open and convert the log (and omit excessive encoding specification in each > line). Only the parts that contain user-supplied data in very different encodings would not be "human readable", similar to what we already have. >> It appears that the primarly here is that SQL statements and >> user-supplied data are being logged, while the log-file is a text file >> in a fixed encoding. > > Yes, and in in my opinion there is nothing unusual about it. XML/HTML are > examples of a text files with fixed encoding that can contain multi-language > strings. UTF-8 is the default encoding for XML. And when it's not good > enough (as Tatsou noticed), you still can switch to another. Yes, but in those examples it is acceptable that the application fails to write the output. That, and the output needs to be converted to various different client encodings (namely that of the visitor's browser) anyway, so it does not really add any additional overhead. This doesn't hold true for database server log files. Ideally, writing those has to be reliable (how are you going to catch errors otherwise?) and should not impact the performance of the database server in a significant way (the less the better). The end result will probably be somewhere in the middle. >> Perhaps another solution would be to add the ability to log certain >> types of information (not the core database server log info, of >> course!) to a database/table so that each record can be stored in its >> own encoding? >> That way the transcoding doesn't have to take place until someone is >> reading the log, you'd know what to transcode the data to (namely the >> client_encoding of the reading session) and there isn't any issue of >> transcoding errors while logging statements. > > I don't think it would be the simplest solution of the existing problem. It > can be another branch of evolution, but it doesn't answer the question - > what encoding to use for the core database server log? It makes that problem much easier. If you need the "human-readable" logs, you can write those to a different log (namely one in the database). The result is that the server can use pretty much any encoding (or a mix of multiple!) to write its log files. You'll need a query to read the human-readable logs of course, but since they're in the database, all the tools you need are already available to you. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest.
On 07/19/2012 03:24 PM, Tatsuo Ishii wrote: > BTW, I'm not stick with mule-internal encoding. What we need here is a > "super" encoding which could include any existing encodings without > information loss. For this purpose, I think we can even invent a new > encoding(maybe something like very first prposal of ISO/IEC > 10646?). However, using UTF-8 for this purpose seems to be just a > disaster to me. Good point re unified chars. That was always a bad idea, and that's just one of the issues it causes. I think these difficult encodings are where logging to dedicated file per-database is useful. I'm not convinced that a weird and uncommon encoding is the answer. I guess as an alternative for people for whom it's useful if it's low cost in terms of complexity/maintenance/etc... -- Craig Ringer
On 07/19/2012 04:58 PM, Alban Hertroys wrote: > On 19 July 2012 10:40, Alexander Law <exclusion@gmail.com> wrote: >>>> Ok, maybe the time of real universal encoding has not yet come. Then >>>> we maybe just should add a new parameter "log_encoding" (UTF-8 by >>>> default) to postgresql.conf. And to use this encoding consistently >>>> within logging_collector. >>>> If this encoding is not available then fall back to 7-bit ASCII. >>> What do you mean by "not available"? >> Sorry, it was inaccurate phrase. I mean "if the conversion to this encoding >> is not avaliable". For example, when we have database in EUC_JP and >> log_encoding set to Latin1. I think that we can even fall back to UTF-8 as >> we can convert all encodings to it (with some exceptions that you noticed). > I like Craig's idea of adding the client encoding to the log lines. Nonono! Log *file* *names* when one-file-per-database is in use. Encoding as a log line prefix is a terrible idea for all sorts of reasons. -- Craig Ringer
Hi Alexander, I was able to reproduce the problem based on your description and test case, and your change does resolve it for me. On Tue, Mar 20, 2012 at 11:50:14PM +0400, Alexander LAW wrote: > Thanks, I've understood your point. > Please look at the patch. It implements the first way and it makes psql > work too. > 20.03.2012 00:05, Alvaro Herrera ??????????: >> Excerpts from Alexander LAW's message of dom mar 18 06:04:51 -0300 2012: >>> I see two ways to resolve the issue. >>> First is to use CharToOemBuff when writing a string to the "con" and >>> OemToCharBuff when reading an input from it. >>> The other is to always use stderr/stdin for Win32 as it was done for >>> msys before. I think it's more straightforward. >> Using console directly instead of stdin/out/err is more appropriate when >> asking for passwords and reading them back, because you can redirect the >> rest of the output to/from files or pipes, without the prompt >> interfering with that. This also explains why stderr is used instead of >> stdout. The console output code page will usually match the OEM code page, but this is not guaranteed. For example, one can change it with chcp.exe before starting psql. The conversion should be to the actual console output code page. After "chcp 869", notice how printing to stdout yields question marks while your code yields unrelated characters. It would be nicer still to find a way to make the output routines treat this explicitly-opened console like stdout to a console. I could not find any documentation around this. Digging into the CRT source code, I see that the automatic code page conversion happens in write(). One of tests write() uses to determine whether the destination is a console is to call GetConsoleMode() on the HANDLE underlying the CRT file descriptor. If that call fails, write() assumes the target is not a console. GetConsoleMode() requires GENERIC_READ access on its subject HANDLE, but the HANDLE resulting from our fopen("con", "w") has only GENERIC_WRITE. Therefore, write() wrongly concludes that it's writing to a non-console. fopen("con", "w+") fails, but fopen("CONOUT$", "w+") seems to give the outcome we need. write() recognizes that it's writing to a console and applies the code page conversion. Let's use that. This gave me occasion to look at the special case for MSYS that you mentioned. I observe the same behavior when running a native psql in a Cygwin xterm; writes to the console succeed but do not appear anywhere. Instead of guessing at console visibility based on an environment variable witnessing a particular platform, let's check IsWindowVisible(GetConsoleWindow()). What do you think of taking that approach? Thanks, nm
Hi Noah, Thank you for your review. I agree with you, CONOUT$ way is much simpler. Please look at the patch. Regarding msys - yes, that check was not correct. In fact you can use "con" with msys, if you run sh.exe, not a graphical terminal. So the issue with con not related to msys, but to some terminal implementations. Namely, I see that con is not supported by rxvt, mintty and xterm (from x.cygwin project). (rxvt was default terminal for msys 1.0.10, so I think such behavior was considered as msys feature because of this) Your solution to use IsWindowVisible(GetConsoleWindow()) works for these terminals (I've made simple test and it returns false for all of them), but this check will not work for telnet (console app running through telnet can use con/conout). Maybe this should be considered as a distinct bug with another patch required? (I see no ideal solution for it yet. Probably it's possible to detect not "ostype", but these terminals, though it would not be generic too.) And there is another issue with a console charset. When writing string to a console CRT converts it to console encoding, but when reading input back it doesn't. So it seems, there should be conversion from ConsoleCP() to ACP() and then probably to UTF-8 to make postgres utilities support national chars in passwords or usernames (with createuser --interactive). I think it can be fixed as another bug too. Best regards, Alexander 10.10.2012 15:05, Noah Misch wrote: > Hi Alexander, > > > The console output code page will usually match the OEM code page, but this is > not guaranteed. For example, one can change it with chcp.exe before starting > psql. The conversion should be to the actual console output code page. After > "chcp 869", notice how printing to stdout yields question marks while your > code yields unrelated characters. > > It would be nicer still to find a way to make the output routines treat this > explicitly-opened console like stdout to a console. I could not find any > documentation around this. Digging into the CRT source code, I see that the > automatic code page conversion happens in write(). One of tests write() uses > to determine whether the destination is a console is to call GetConsoleMode() > on the HANDLE underlying the CRT file descriptor. If that call fails, write() > assumes the target is not a console. GetConsoleMode() requires GENERIC_READ > access on its subject HANDLE, but the HANDLE resulting from our fopen("con", > "w") has only GENERIC_WRITE. Therefore, write() wrongly concludes that it's > writing to a non-console. fopen("con", "w+") fails, but fopen("CONOUT$", > "w+") seems to give the outcome we need. write() recognizes that it's writing > to a console and applies the code page conversion. Let's use that. > > This gave me occasion to look at the special case for MSYS that you mentioned. > I observe the same behavior when running a native psql in a Cygwin xterm; > writes to the console succeed but do not appear anywhere. Instead of guessing > at console visibility based on an environment variable witnessing a particular > platform, let's check IsWindowVisible(GetConsoleWindow()). > > What do you think of taking that approach? > > Thanks, > nm
Attachment
Alexander Law <exclusion@gmail.com> writes: > +#ifdef WIN32 > + termin = fopen("CONIN$", "r"); > + termout = fopen("CONOUT$", "w+"); > +#else > termin = fopen(DEVTTY, "r"); > termout = fopen(DEVTTY, "w"); > +#endif > if (!termin || !termout My immediate reaction to this patch is "that's a horrible kluge, why shouldn't we change the definition of DEVTTY instead?" Is there a similar issue in other places where we use DEVTTY? Also, why did you change the termout output mode, is that important or just randomness? regards, tom lane
On Sun, Oct 14, 2012 at 10:35:04AM +0400, Alexander Law wrote: > I agree with you, CONOUT$ way is much simpler. Please look at the patch. See comments below. > Regarding msys - yes, that check was not correct. > In fact you can use "con" with msys, if you run sh.exe, not a graphical > terminal. > So the issue with con not related to msys, but to some terminal > implementations. > Namely, I see that con is not supported by rxvt, mintty and xterm (from > x.cygwin project). > (rxvt was default terminal for msys 1.0.10, so I think such behavior was > considered as msys feature because of this) > Your solution to use IsWindowVisible(GetConsoleWindow()) works for these > terminals (I've made simple test and it returns false for all of them), > but this check will not work for telnet (console app running through > telnet can use con/conout). Thanks for testing those environments. I can reproduce the distinctive behavior when a Windows telnet client connects to a Windows telnet server. When I connect to a Windows telnet server from a GNU/Linux system, I get the normal invisible-console behavior. I also get the invisible-console behavior in PowerShell ISE. > Maybe this should be considered as a distinct bug with another patch > required? (I see no ideal solution for it yet. Probably it's possible to > detect not "ostype", but these terminals, though it would not be generic > too.) Using stdin/stderr when we could have used the console is a mild loss; use cases involving redirected output will need to account for the abnormality. Interacting with a user-invisible console is a large loss; prompts will hang indefinitely. Therefore, the test should err on the side of stdin/stderr. Since any change here seems to have its own trade-offs, yes, let's leave it for a separate patch. > And there is another issue with a console charset. When writing string > to a console CRT converts it to console encoding, but when reading input > back it doesn't. So it seems, there should be conversion from > ConsoleCP() to ACP() and then probably to UTF-8 to make postgres > utilities support national chars in passwords or usernames (with > createuser --interactive). Yes, that also deserves attention. I do not know whether converting to UTF-8 is correct. Given a username <foo> containing non-ASCII characters, you should be able to input <foo> the same way for both "psql -U <foo>" and the createuser prompt. We should also be thoughtful about backward compatibility. > I think it can be fixed as another bug too. Agreed. > --- a/src/port/sprompt.c > +++ b/src/port/sprompt.c > @@ -60,8 +60,13 @@ simple_prompt(const char *prompt, int maxlen, bool echo) > * Do not try to collapse these into one "w+" mode file. Doesn't work on > * some platforms (eg, HPUX 10.20). > */ > +#ifdef WIN32 > + termin = fopen("CONIN$", "r"); > + termout = fopen("CONOUT$", "w+"); This definitely needs a block comment explaining the behaviors that led us to select this particular implementation. > +#else > termin = fopen(DEVTTY, "r"); > termout = fopen(DEVTTY, "w"); This thread has illustrated that the DEVTTY abstraction does not suffice. I think we should remove it entirely. Remove it from port.h; use literal "/dev/tty" here; re-add it as a local #define near the one remaining use, with an XXX comment indicating that the usage is broken. If it would help, I can prepare a version with the comment changes and refactoring I have in mind. > +#endif > if (!termin || !termout > #ifdef WIN32 > /* See DEVTTY comment for msys */ Thanks, nm
On Sun, Oct 14, 2012 at 12:10:42PM -0400, Tom Lane wrote: > Alexander Law <exclusion@gmail.com> writes: > > > +#ifdef WIN32 > > + termin = fopen("CONIN$", "r"); > > + termout = fopen("CONOUT$", "w+"); > > +#else > > termin = fopen(DEVTTY, "r"); > > termout = fopen(DEVTTY, "w"); > > +#endif > > if (!termin || !termout > > My immediate reaction to this patch is "that's a horrible kluge, why > shouldn't we change the definition of DEVTTY instead?" You could make DEVTTY_IN, DEVTTY_IN_MODE, DEVTTY_OUT and DEVTTY_OUT_MODE to capture all the differences. That doesn't strike me as an improvement, and no other function would use them at present. As I explained in my reply to Alexander, we should instead remove DEVTTY. > Is there a > similar issue in other places where we use DEVTTY? Yes. However, the other use of DEVTTY arises only with readline support, not typical of native Windows builds. > Also, why did you change the termout output mode, is that important > or just randomness? It's essential: http://archives.postgresql.org/message-id/20121010110555.GA21405@tornado.leadboat.com nm
On Mon, Oct 15, 2012 at 05:41:36AM -0400, Noah Misch wrote: > > --- a/src/port/sprompt.c > > +++ b/src/port/sprompt.c > > @@ -60,8 +60,13 @@ simple_prompt(const char *prompt, int maxlen, bool echo) > > * Do not try to collapse these into one "w+" mode file. Doesn't work on > > * some platforms (eg, HPUX 10.20). > > */ > > +#ifdef WIN32 > > + termin = fopen("CONIN$", "r"); > > + termout = fopen("CONOUT$", "w+"); > > This definitely needs a block comment explaining the behaviors that led us to > select this particular implementation. > > > +#else > > termin = fopen(DEVTTY, "r"); > > termout = fopen(DEVTTY, "w"); > > This thread has illustrated that the DEVTTY abstraction does not suffice. I > think we should remove it entirely. Remove it from port.h; use literal > "/dev/tty" here; re-add it as a local #define near the one remaining use, with > an XXX comment indicating that the usage is broken. > > If it would help, I can prepare a version with the comment changes and > refactoring I have in mind. Following an off-list ack from Alexander, here is that version. No functional differences from Alexander's latest version, and I have verified that it still fixes the original test case. I'm marking this Ready for Committer. To test this on an English (United States) copy of Windows 7, I made two configuration changes in the "Region and Language" control panel. On the "Administrative" tab, choose "Change system locale..." and select Russian (Russia). After the reboot, choose "Russian (Russia)" on the "Format" tab. (Neither of these changes will affect the display language of most Windows UI components.) Finally, run "initdb -W testdatadir". Before the patch, the password prompt contained some line-drawing characters and other garbage. Afterward, it matches the string in src/bin/initdb/po/ru.po. Thanks, nm
Attachment
Noah Misch escribi=F3: > Following an off-list ack from Alexander, here is that version. No fun= ctional > differences from Alexander's latest version, and I have verified that i= t still > fixes the original test case. I'm marking this Ready for Committer. This seems good to me, but I'm not comfortable committing Windows stuff. Andrew, Magnus, are you able to handle this? --=20 =C1lvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services