Thread: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
[bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
"MauMau"
Date:
Hello, I've been suffering from PostgreSQL's problems related to character encoding for some time. I really wish to solve those problems, because they make troubleshooting difficult. I'm going to propose fixes for them, and I would appreciate if you could help release the official patches as soon as possible. The first issue is that the messages from strerror() become "???" in a typical locale/encoding combination. I found this was reported in 2010, but it was not solved. problem with glibc strerror messages translation (was: Could not open file pg_xlog/000000010....) http://www.postgresql.org/message-id/87pqvezp3w.fsf@home.progtech.ru The steps to reproduce the problem are: $ export LANG=ja_JP.UTF-8 $ initdb -E UTF8 --no-locale --lc-messages=ja_JP $ pg_ctl start $ psql -d postgres -c "CREATE TABLE a (col int)" $ psql -d postgres -c "SELECT pg_relation_filepath('a')" ... This outputs something like base/xxx/yyy $ mv $PGDATA/base/xxx/yyy a $ psql -d postgres -c "SELECT * FROM a" ... This outputs, in Japanese, a message meaning "could not open file "base/xxx/yyy": ???". The problem is that strerror() returns "???", which hides the cause of the trouble. The cause is that gettext() called by strerror() tries to convert UTF-8 messages obtained from libc.mo to ASCII. This is because postgres calls setlocale(LC_CTYPE, "C") when it connects to the database. Thus, I attached a patch (strerror_codeset.patch). This simple patch just sets the codeset for libc catalog the same as postgres catalog. As noted in the comment, I understand this is a kludge based on an undocumented fact (the catalog for strerror() is libc.mo), and may not work on all environments. However, this will help many people who work in non-English regions. Please just don't reject this because of implementation cleanness. If there is a better idea which can be implemented easily, I'd be happy to hear that. I'm also attaching another patch, errno_str.patch, which adds the numeric value of errno to %m in ereport() like: could not open file "base/xxx/yyy": errno=2: No such file or directory When talking with operating system experts, numeric errno values are sometimes more useful and easy to communicate than their corresponding strings. This is a closely related but a separate proposal. I want the first patch to be backported at least to 9.2. Regards MauMau
Attachment
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Tom Lane
Date:
"MauMau" <maumau307@gmail.com> writes: > I've been suffering from PostgreSQL's problems related to character encoding > for some time. I really wish to solve those problems, because they make > troubleshooting difficult. I'm going to propose fixes for them, and I would > appreciate if you could help release the official patches as soon as > possible. I don't find either of these patches to be a particularly good idea. There is certainly no way we'd risk back-patching something with as many potential side-effects as fooling with libc's textdomain. I wonder though if we could attack the specific behavior you're complaining of by testing to see if strerror() returned "???", and substituting the numeric value for that, ie * Some strerror()s return an empty string for out-of-range errno. This is * ANSI C spec compliant, but not exactly useful.*/ - if (str == NULL || *str == '\0') + if (str == NULL || *str == '\0' || strcmp(str, "???") == 0){ snprintf(errorstr_buf, sizeof(errorstr_buf), /*------ This would only work if glibc always returns that exact string for a codeset translation failure, but a look into the glibc sources should quickly confirm that. BTW: personally, I would say that what you're looking at is a glibc bug. I always thought the contract of gettext was to return the ASCII version if it fails to produce a translated version. That might not be what the end user really wants to see, but surely returning something like "???" is completely useless to anybody. regards, tom lane
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Andres Freund
Date:
On 2013-09-06 10:37:16 -0400, Tom Lane wrote: > "MauMau" <maumau307@gmail.com> writes: > > I've been suffering from PostgreSQL's problems related to character encoding > > for some time. I really wish to solve those problems, because they make > > troubleshooting difficult. I'm going to propose fixes for them, and I would > > appreciate if you could help release the official patches as soon as > > possible. > > I don't find either of these patches to be a particularly good idea. > There is certainly no way we'd risk back-patching something with as > many potential side-effects as fooling with libc's textdomain. I have no clue about the gettext stuff but I am in favor of including the raw errno in strerror() messages (no backpatching tho). When doing support it's a PITA to get translated strings for those. I can lookup postgres' own translated messages in the source easy enough, but that doesn't work all that well for OS supplied messages. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes: > I have no clue about the gettext stuff but I am in favor of including > the raw errno in strerror() messages (no backpatching tho). I dislike that on grounds of readability and translatability; and I'm also of the opinion that errno codes aren't really consistent enough across platforms to be all that trustworthy for remote diagnostic purposes. I'm fine with printing the code if strerror fails to produce anything useful --- but not if it succeeds. regards, tom lane
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Andres Freund
Date:
On 2013-09-06 10:52:03 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > I have no clue about the gettext stuff but I am in favor of including > > the raw errno in strerror() messages (no backpatching tho). > > I dislike that on grounds of readability and translatability; and > I'm also of the opinion that errno codes aren't really consistent > enough across platforms to be all that trustworthy for remote diagnostic > purposes. Well, it's easier to get access to mappings between errno and meaning of foreign systems than to get access to their translations in my experience. If we'd add the errno inside %m processing, I don't see how it's a problem for translation? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Greg Stark
Date:
On Fri, Sep 6, 2013 at 3:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-09-06 10:52:03 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I have no clue about the gettext stuff but I am in favor of including
> > the raw errno in strerror() messages (no backpatching tho).
>
> I dislike that on grounds of readability and translatability; and
> I'm also of the opinion that errno codes aren't really consistent
> enough across platforms to be all that trustworthy for remote diagnostic
> purposes.
Historically they weren't even the same on Linux acros architectures. This was to support running native binaries from the incumbent platform (SunOS, OSF, BSD) under emulation on each architecture. I don't see any evidence of that any more but I'm not sure I'm looking in the right place.
Well, it's easier to get access to mappings between errno and meaning of
foreign systems than to get access to their translations in my
experience.
That's definitely true. There are only a few possible platforms and it's not hard to convert an errno to an error string on a given platform. Converting a translated string in some language you can't read to an untranslated string is another matter.
What would be nicer would be to display the C define, EINVAL, EPERM, etc. Afaik there's no portable way to do that though. I suppose we could just have a small array or hash table of all the errors we know about and look it up.
--
greg
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes: > What would be nicer would be to display the C define, EINVAL, EPERM, etc. > Afaik there's no portable way to do that though. I suppose we could just > have a small array or hash table of all the errors we know about and look > it up. Yeah, I was just thinking the same thing. We could do switch (errno){ case EINVAL: str = "EINVAL"; break; case ENOENT: str = "ENOENT"; break; ... #ifdef EFOOBAR case EFOOBAR: str = "EFOOBAR"; break; #endif ... for all the common or even less-common names, and only fall back on printing a numeric value if it's something really unusual. But I still maintain that we should only do this if we can't get a useful string out of strerror(). There isn't any way to cram this information into the current usage of %m without doing damage to the readability and translatability of the string. Our style & translatability guidelines specifically recommend against assembling messages out of fragments, and also against sticking in parenthetical additions. I suppose we could think about inventing another error field rather than damaging the readability of the primary message string, ie teach elog that if %m is used it should emit an additional line along the lines ofERRNO: EINVAL However the cost of adding a new column to CSV log format might exceed its value. regards, tom lane
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
"MauMau"
Date:
Thank you for your opinions and ideas. From: "Tom Lane" <tgl@sss.pgh.pa.us> > Greg Stark <stark@mit.edu> writes: >> What would be nicer would be to display the C define, EINVAL, EPERM, etc. >> Afaik there's no portable way to do that though. I suppose we could just >> have a small array or hash table of all the errors we know about and look >> it up. > > Yeah, I was just thinking the same thing. We could do > > switch (errno) > { > case EINVAL: str = "EINVAL"; break; > case ENOENT: str = "ENOENT"; break; > ... > #ifdef EFOOBAR > case EFOOBAR: str = "EFOOBAR"; break; > #endif > ... > > for all the common or even less-common names, and only fall back on > printing a numeric value if it's something really unusual. > > But I still maintain that we should only do this if we can't get a useful > string out of strerror(). OK, I'll take this approach. That is: str = strerror(errnum); if (str == NULL || *str == '\0' || *str == '?') {switch (errnum){case EINVAL: str = "errno=EINVAL"; break;case ENOENT: str = "errno=ENOENT"; break;...#ifdef EFOOBARcaseEFOOBAR: str = "EFOOBAR"; break;#endifdefault: snprintf(errorstr_buf, sizeof(errorstr_buf), _("operating systemerror %d"), errnum); str = errorstr_buf;} } The number of questionmarks probably depends on the original message, so I won't strcmp() against "???". From: "Tom Lane" <tgl@sss.pgh.pa.us> > There is certainly no way we'd risk back-patching something with as > many potential side-effects as fooling with libc's textdomain. Agreed. It should be better to avoid making use of undocumented behavior (i.e. strerror() uses libc.mo), if we can take another approach. > BTW: personally, I would say that what you're looking at is a glibc bug. > I always thought the contract of gettext was to return the ASCII version > if it fails to produce a translated version. That might not be what the > end user really wants to see, but surely returning something like "???" > is completely useless to anybody. I think so, too. Under the same condition, PostgreSQL built with Oracle Studio on Solaris outputs correct Japanese for strerror(), and English is output on Windows. I'll contact glibc team to ask for improvement. From: "Tom Lane" <tgl@sss.pgh.pa.us> > I dislike that on grounds of readability and translatability; and > I'm also of the opinion that errno codes aren't really consistent > enough across platforms to be all that trustworthy for remote diagnostic > purposes. I'm fine with printing the code if strerror fails to > produce anything useful --- but not if it succeeds. I don't think this is a concern, because we should ask trouble reporters about the operating system where they are running the database server. From: "Tom Lane" <tgl@sss.pgh.pa.us> > There isn't any way to cram this information > into the current usage of %m without doing damage to the readability and > translatability of the string. Our style & translatability guidelines > specifically recommend against assembling messages out of fragments, > and also against sticking in parenthetical additions. From: "Andres Freund" <andres@2ndquadrant.com> > If we'd add the errno inside %m processing, I don't see how it's > a problem for translation? I'm for Andres. I don't see any problem if we don't translate "errno=%d". I'll submit a revised patch again next week. However, I believe my original approach is better, because it outputs user-friendly Japanese message instead of "errno=ENOENT". Plus, outputing both errno value and its descriptive text is more useful, because the former is convenient for OS/library experts and the latter is convenient for PostgreSQL users. Any better idea would be much appreciated. Regards MauMau
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
"MauMau"
Date:
From: "MauMau" <maumau307@gmail.com> > OK, I'll take this approach. That is: I did as Tom san suggested. Please review the attached patch. I chose as common errnos by selecting those which are used in PosttgreSQL source code out of the error numbers defined in POSIX 2013. As I said, lack of %m string has been making troubleshooting difficult, so I wish this to be backported at least 9.2. Regards MauMau
Attachment
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Peter Eisentraut
Date:
On 9/6/13 10:37 AM, Tom Lane wrote: > BTW: personally, I would say that what you're looking at is a glibc bug. > I always thought the contract of gettext was to return the ASCII version > if it fails to produce a translated version. That might not be what the > end user really wants to see, but surely returning something like "???" > is completely useless to anybody. The question marks come from iconv. Take a look at what this prints: iconv po/ja.po -f utf-8 -t us-ascii//translit If you use GNU libiconv, this will print a bunch of question marks. Other implementations will probably not understand //translit and just fail the conversion. I think the use of //translit by gettext is poor judgement, because my experiments show that the quality of the results is poor and not useful for a user interface. My suggestion in this matter is to disable gettext processing when LC_CTYPE is set to C. We could log a warning when LC_MESSAGES is set to something and LC_CTYPE is set to C. Or just do the warning and keep logging. Something like that.
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes: > My suggestion in this matter is to disable gettext processing when > LC_CTYPE is set to C. We could log a warning when LC_MESSAGES is set to > something and LC_CTYPE is set to C. Or just do the warning and keep > logging. Something like that. Meh. Seems that would only prevent one specific instance of the general problem that strerror can fail to translate its result. Other locale combinations might create the same kind of failure. More generally, though, is strerror actually using gettext at all, or some homegrown implementation? As I said upthread, I would expect that gettext("foo") returns the given ASCII string "foo" if it fails to create a translated version. This is evidently not what's happening in strerror. It's way past time to look into the glibc sources and see what it's actually doing... regards, tom lane
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Peter Eisentraut
Date:
On 9/9/13 10:25 AM, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: >> My suggestion in this matter is to disable gettext processing when >> LC_CTYPE is set to C. We could log a warning when LC_MESSAGES is set to >> something and LC_CTYPE is set to C. Or just do the warning and keep >> logging. Something like that. > > Meh. Seems that would only prevent one specific instance of the general > problem that strerror can fail to translate its result. Other locale > combinations might create the same kind of failure. True. There isn't much we can do, really. If your LC_MESSAGES and LC_CTYPE don't get along, you get what you asked for. This isn't specific to PostgreSQL: $ LC_CTYPE=C LC_MESSAGES=ja_JP.utf8 ls --foo ls: ???????????`--foo'?? ???? `ls --help' ????????. > More generally, though, is strerror actually using gettext at all, or > some homegrown implementation? As I said upthread, I would expect that > gettext("foo") returns the given ASCII string "foo" if it fails to create > a translated version. This is evidently not what's happening in strerror. That is correct. It returns the original string if it cannot find a translation or the character conversion of the translation fails. But the character conversion to "US-ASCII//TRANSLIT" does not fail. It just produces an undesirable result. If you patch the gettext source to remove the //TRANSLIT, you will get the result you want.
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes: > On 9/9/13 10:25 AM, Tom Lane wrote: >> Meh. Seems that would only prevent one specific instance of the general >> problem that strerror can fail to translate its result. Other locale >> combinations might create the same kind of failure. > True. There isn't much we can do, really. If your LC_MESSAGES and > LC_CTYPE don't get along, you get what you asked for. This isn't > specific to PostgreSQL: So should we just say this is pilot error? It may be, but if we can work around it with a reasonably small amount of effort/risk, I think it's appropriate to do that. The proposal to reject a strerror result that starts with '?' sounds plausible to me. regards, tom lane
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Noah Misch
Date:
On Mon, Sep 09, 2013 at 08:29:58AM -0400, Peter Eisentraut wrote: > On 9/6/13 10:37 AM, Tom Lane wrote: > > BTW: personally, I would say that what you're looking at is a glibc bug. > > I always thought the contract of gettext was to return the ASCII version > > if it fails to produce a translated version. That might not be what the > > end user really wants to see, but surely returning something like "???" > > is completely useless to anybody. > > The question marks come from iconv. Take a look at what this prints: > > iconv po/ja.po -f utf-8 -t us-ascii//translit > > If you use GNU libiconv, this will print a bunch of question marks. Actually, GNU libiconv's iconv() decides that //translit is unimplementable for some of the characters in that file, and it fails the conversion. GNU libc's iconv(), on the other hand, emits the question marks. > I think the use of //translit by gettext is poor judgement, because my > experiments show that the quality of the results is poor and not useful > for a user interface. It depends on the quality of the //translit implementation. GNU libiconv's seems pretty good. It gives up for Japanese or Russian characters, so you get the English messages. For Polish, GNU libiconv transliterates like this: msgstr "nie można usunąć pliku lub katalogu \"%s\": %s\n" msgstr "nie mozna usuna'c pliku lub katalogu \"%s\": %s\n" That's fair, considering what it has to work with. Ideally, (a) GNU libc should import the smarter transliteration code from GNU libiconv, and (b) GNU gettext should check for weak //translit implementations and not use //translit under such circumstances. > My suggestion in this matter is to disable gettext processing when > LC_CTYPE is set to C. We could log a warning when LC_MESSAGES is set to > something and LC_CTYPE is set to C. Or just do the warning and keep > logging. Something like that. In an ENCODING=UTF8, LC_CTYPE=C database, no transliteration should need to happen, and no transliteration does happen for the PG messages. I think MauMau's original bind_textdomain_codeset() proposal was on the right track. We would need to do that for every relevant 3rd-party message domain, though. Ick. This suggests to me that gettext really needs an API for overriding the default codeset pertaining to message domains not subjected to bind_textdomain_codeset(). In the meantime, adding bind_textdomain_codeset() calls for known localized dependencies seems like a fine coping mechanism. If we can reasonably detect when gettext is supplying useless ????? messages, that's good, too. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes: > ... I think > MauMau's original bind_textdomain_codeset() proposal was on the right track. It might well be. My objection was to the proposal for back-patching it when we have little idea of the possible side-effects. I would be fine with handling that as a 9.4-only patch (preferably with the usual review process). > We would need to do that for every relevant 3rd-party message domain, though. > Ick. Yeah, and another question is whether 3rd-party code might not do its own bind_textdomain_codeset() call with what it thinks is the right setting, thereby overriding our attempted fix. Still, libc is certainly the source of the vast majority of potentially-translated messages that we might be passing through to users, so fixing it would be a step forward. regards, tom lane
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
"MauMau"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us> > Noah Misch <noah@leadboat.com> writes: >> ... I think >> MauMau's original bind_textdomain_codeset() proposal was on the right >> track. > > It might well be. My objection was to the proposal for back-patching it > when we have little idea of the possible side-effects. I would be fine > with handling that as a 9.4-only patch (preferably with the usual review > process). > Still, libc is certainly the source of the vast majority of > potentially-translated messages that we might be passing through to users, > so fixing it would be a step forward. We are using 9.1/9.2 and 9.2 is probably dominant, so I would be relieved with either of the following choices: 1. Take the approach that doesn't use bind_textdomain_codeset("libc") (i.e. the second version of errno_str.patch) for 9.4 and older releases. 2. Use bind_textdomain_codeset("libc") (i.e. take strerror_codeset.patch) for 9.4, and take the non-bind_textdomain_codeset approach for older releases. Regards MauMau
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Peter Eisentraut
Date:
On 9/6/13 9:40 AM, MauMau wrote: > $ psql -d postgres -c "SELECT * FROM a" > ... This outputs, in Japanese, a message meaning "could not open file > "base/xxx/yyy": ???". > > The problem is that strerror() returns "???", which hides the cause of > the trouble. > > The cause is that gettext() called by strerror() tries to convert UTF-8 > messages obtained from libc.mo to ASCII. This is because postgres calls > setlocale(LC_CTYPE, "C") when it connects to the database. Does anyone know why the PostgreSQL-supplied part of the error message does not get messed up?
Re: Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Peter Eisentraut
Date:
On 9/9/13 2:57 PM, Noah Misch wrote: > Actually, GNU libiconv's iconv() decides that //translit is unimplementable > for some of the characters in that file, and it fails the conversion. GNU > libc's iconv(), on the other hand, emits the question marks. That can't be right, because the examples I produced earlier (which produced question marks) were produced on OS X with GNU libiconv.
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Peter Eisentraut
Date:
On 9/9/13 4:42 PM, MauMau wrote: > We are using 9.1/9.2 and 9.2 is probably dominant, so I would be > relieved with either of the following choices: > > 1. Take the approach that doesn't use bind_textdomain_codeset("libc") > (i.e. the second version of errno_str.patch) for 9.4 and older releases. > > 2. Use bind_textdomain_codeset("libc") (i.e. take > strerror_codeset.patch) for 9.4, and take the > non-bind_textdomain_codeset approach for older releases. I think we are not going to backpatch any of this. There is a clear workaround: fix your locale settings.
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
"MauMau"
Date:
From: "Peter Eisentraut" <peter_e@gmx.net> > Does anyone know why the PostgreSQL-supplied part of the error message > does not get messed up? That is because bind_textdomain_codeset() is called for postgres.mo in src/backend/utils/mb/mbutils.c, specifying the database encoding as the second argument. This is done at session start. Regards MauMau
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
"MauMau"
Date:
From: "Peter Eisentraut" <peter_e@gmx.net> > On 9/9/13 4:42 PM, MauMau wrote: > 1. Take the approach that doesn't use bind_textdomain_codeset("libc") >> (i.e. the second version of errno_str.patch) for 9.4 and older releases. >> >> 2. Use bind_textdomain_codeset("libc") (i.e. take >> strerror_codeset.patch) for 9.4, and take the >> non-bind_textdomain_codeset approach for older releases. > > I think we are not going to backpatch any of this. There is a clear > workaround: fix your locale settings. No, it's a hard workaround to take: 1. Recreate the database with LC_CTYPE = ja_JP.UTF-8. This changes various behaviors such as ORDER BY, index scan, and the performance of LIKE clause. This is almost impossible. 2. Change lc_messages in postgresql.conf to 'C'. This is OK for me as I can read/write English to some extent (though poor). But English is difficult for some (or many?) Japanese. So I hesitate to ask the users to do so. Regards MauMau
Re: Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Noah Misch
Date:
On Mon, Sep 09, 2013 at 05:49:38PM -0400, Peter Eisentraut wrote: > On 9/9/13 2:57 PM, Noah Misch wrote: > > Actually, GNU libiconv's iconv() decides that //translit is unimplementable > > for some of the characters in that file, and it fails the conversion. GNU > > libc's iconv(), on the other hand, emits the question marks. > > That can't be right, because the examples I produced earlier (which > produced question marks) were produced on OS X with GNU libiconv. Hmm. I get the "good" behavior (decline to transliterate Japanese) with these "iconv --version" strings: iconv (GNU libiconv 1.11) [/usr/bin/iconv on Mac OS X 10.7] iconv (GNU libiconv 1.14) [recently-updated fink] iconv (GNU libiconv 1.14) [recently-updated Cygwin] I also saw that on OpenBSD and NetBSD, though I'm not in an immediate position to check the libiconv versions there. I get the "bad" behavior (question marks) on these: iconv (GNU libc) 2.12 [Centos 6.4] iconv (GNU libc) 2.3.4 [CentOS 4.4] iconv (Ubuntu EGLIBC 2.15-0ubuntu10.4) 2.15 [Ubuntu 12.04] iconv (GNU libc) 2.5 [Ubuntu 7.04] That sure looked like GNU libc vs. GNU libiconv, but I guess I'm missing some other factor. What is your GNU libiconv version that emits question marks? Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Noah Misch
Date:
On Tue, Sep 10, 2013 at 05:42:06AM +0900, MauMau wrote: > From: "Tom Lane" <tgl@sss.pgh.pa.us> >> Noah Misch <noah@leadboat.com> writes: >>> ... I think >>> MauMau's original bind_textdomain_codeset() proposal was on the right >>> track. >> >> It might well be. My objection was to the proposal for back-patching it >> when we have little idea of the possible side-effects. Agreed. > We are using 9.1/9.2 and 9.2 is probably dominant, so I would be relieved > with either of the following choices: > > 1. Take the approach that doesn't use bind_textdomain_codeset("libc") > (i.e. the second version of errno_str.patch) for 9.4 and older releases. > > 2. Use bind_textdomain_codeset("libc") (i.e. take strerror_codeset.patch) > for 9.4, and take the non-bind_textdomain_codeset approach for older > releases. I like (2), at least at a high level. The concept of errno_str.patch is safe enough to back-patch. One can verify that it only changes behavior when strerror() returns NULL, an empty string, or something that begins with '?'. I can't see resenting the change when that has happened. Note that you can work around the problem today by linking PostgreSQL with a better iconv() implementation. Question-mark-damaged messages are not limited to strerror(). A combination like lc_messages=ja_JP, encoding=LATIN1, lc_ctype=en_US will produce question marks for PG and libc messages even with the bind_textdomain_codeset("libc") change. Is it worth doing anything about that? That one looks self-inflicted in comparison to the lc_messages=ja_JP, encoding=UTF8, lc_ctype=C case. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Greg Stark
Date:
On Mon, Sep 9, 2013 at 11:27 PM, MauMau <maumau307@gmail.com> wrote: > 1. Recreate the database with LC_CTYPE = ja_JP.UTF-8. This changes various > behaviors such as ORDER BY, index scan, and the performance of LIKE clause. > This is almost impossible. Wait, why does the ctype of the database affect the ctype of the messages? Shouldn't these be two separate things? One describes the character set being used to store data in the database and the other the character set the log file and clients are in. That said, we do interpolate a lot of database strings into messages. -- greg
Re: [bug fix] strerror() returns ??? in a UTF-8/C database withLC_MESSAGES=non-ASCII
From
"MauMau"
Date:
From: "Noah Misch" <noah@leadboat.com> > I like (2), at least at a high level. The concept of errno_str.patch is > safe > enough to back-patch. One can verify that it only changes behavior when > strerror() returns NULL, an empty string, or something that begins with > '?'. > I can't see resenting the change when that has happened. Thanks for reviewing the patch. > Question-mark-damaged messages are not limited to strerror(). A > combination > like lc_messages=ja_JP, encoding=LATIN1, lc_ctype=en_US will produce > question > marks for PG and libc messages even with the > bind_textdomain_codeset("libc") > change. Is it worth doing anything about that? That one looks > self-inflicted > in comparison to the lc_messages=ja_JP, encoding=UTF8, lc_ctype=C case. Year, that might be a bit self-inflicted. But the problem may not happen with lc_messages=ja_JP.UTF-8 and lc_ctype=en_US.UTF-8. Anyway, I want to see this as a separate issue. Regards MauMau
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
"MauMau"
Date:
From: "Greg Stark" <stark@mit.edu> > Wait, why does the ctype of the database affect the ctype of the > messages? Shouldn't these be two separate things? One describes the > character set being used to store data in the database and the other > the character set the log file and clients are in. At session start, PostgreSQL sets the ctype of the database to be the process's LC_CTYPE locale category in src/backend/utils/init/postinit.c: ctype = NameStr(dbform->datctype); ...if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL) The LC_CTYPE locale category determines the character encoding for messages obtained by gettext(). This is gettext()'s specification. Regards MauMau
Re: Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Peter Eisentraut
Date:
On 9/9/13 9:54 PM, Noah Misch wrote: > On Mon, Sep 09, 2013 at 05:49:38PM -0400, Peter Eisentraut wrote: >> > On 9/9/13 2:57 PM, Noah Misch wrote: >>> > > Actually, GNU libiconv's iconv() decides that //translit is unimplementable >>> > > for some of the characters in that file, and it fails the conversion. GNU >>> > > libc's iconv(), on the other hand, emits the question marks. >> > >> > That can't be right, because the examples I produced earlier (which >> > produced question marks) were produced on OS X with GNU libiconv. > Hmm. I get the "good" behavior (decline to transliterate Japanese) with these > "iconv --version" strings: I might have messed up my testing. You are probably right.
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Tom Lane
Date:
"MauMau" <maumau307@gmail.com> writes: > I did as Tom san suggested. Please review the attached patch. I chose as > common errnos by selecting those which are used in PosttgreSQL source code > out of the error numbers defined in POSIX 2013. I've committed this with some editorialization (mostly, I used a case statement not a constant array, because that's more like the other places that switch on errnos in this file). > As I said, lack of %m string has been making troubleshooting difficult, so I > wish this to be backported at least 9.2. I'm waiting to see whether the buildfarm likes this before considering back-patching. regards, tom lane
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
"MauMau"
Date:
Hi, Tom san, From: "Tom Lane" <tgl@sss.pgh.pa.us> > I've committed this with some editorialization (mostly, I used a case > statement not a constant array, because that's more like the other places > that switch on errnos in this file). > >> As I said, lack of %m string has been making troubleshooting difficult, >> so I >> wish this to be backported at least 9.2. > > I'm waiting to see whether the buildfarm likes this before considering > back-patching. I'm very sorry to respond so late. Thank you so much for committingthe patch. I liked your code and comments. I'll be glad if you could back-port this. Personally, in practice, 9.1 and later will be sufficient. Regards MauMau
Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
From
Andres Freund
Date:
On 2013-12-02 19:36:01 +0900, MauMau wrote: > I'll be glad if you could back-port this. Personally, in practice, 9.1 and > later will be sufficient. Already happened: Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: REL9_3_STABLE [e3480438e] 2013-11-07 16:33:18 -0500 Branch: REL9_2_STABLE [64f5962fe] 2013-11-07 16:33:25 -0500 Branch: REL9_1_STABLE [8cfd4c6a1] 2013-11-07 16:33:28 -0500 Branch: REL9_0_STABLE [8103f49c1] 2013-11-07 16:33:34 -0500 Branch: REL8_4_STABLE [3eb777671] 2013-11-07 16:33:39 -0500 Be more robust when strerror() doesn't give a useful result. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services