Thread: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

Hello,

I've been suffering from PostgreSQL's problems related to character encoding
for some time.  I really wish to solve those problems, because they make
troubleshooting difficult.  I'm going to propose fixes for them, and I would
appreciate if you could help release the official patches as soon as
possible.

The first issue is that the messages from strerror() become "???" in a
typical locale/encoding combination.  I found this was reported in 2010, but
it was not solved.

problem with glibc strerror messages translation (was: Could not open file
pg_xlog/000000010....)
http://www.postgresql.org/message-id/87pqvezp3w.fsf@home.progtech.ru

The steps to reproduce the problem are:

$ export LANG=ja_JP.UTF-8
$ initdb -E UTF8 --no-locale --lc-messages=ja_JP
$ pg_ctl start
$ psql -d postgres -c "CREATE TABLE a (col int)"
$ psql -d postgres -c "SELECT pg_relation_filepath('a')"
... This outputs something like base/xxx/yyy
$ mv $PGDATA/base/xxx/yyy a
$ psql -d postgres -c "SELECT * FROM a"
... This outputs, in Japanese, a message meaning "could not open file
"base/xxx/yyy": ???".

The problem is that strerror() returns "???", which hides the cause of the
trouble.

The cause is that gettext() called by strerror() tries to convert UTF-8
messages obtained from libc.mo to ASCII.  This is because postgres calls
setlocale(LC_CTYPE, "C") when it connects to the database.

Thus, I attached a patch (strerror_codeset.patch).  This simple patch just
sets the codeset for libc catalog the same as postgres catalog.  As noted in
the comment, I understand this is a kludge based on an undocumented fact
(the catalog for strerror() is libc.mo), and may not work on all
environments.  However, this will help many people who work in non-English
regions.  Please just don't reject this because of implementation cleanness.
If there is a better idea which can be implemented easily, I'd be happy to
hear that.


I'm also attaching another patch, errno_str.patch, which adds the numeric
value of errno to %m in ereport() like:

could not open file "base/xxx/yyy": errno=2: No such file or directory

When talking with operating system experts, numeric errno values are
sometimes more useful and easy to communicate than their corresponding
strings.  This is a closely related but a separate proposal.

I want the first patch to be backported at least to 9.2.

Regards
MauMau

Attachment
"MauMau" <maumau307@gmail.com> writes:
> I've been suffering from PostgreSQL's problems related to character encoding 
> for some time.  I really wish to solve those problems, because they make 
> troubleshooting difficult.  I'm going to propose fixes for them, and I would 
> appreciate if you could help release the official patches as soon as 
> possible.

I don't find either of these patches to be a particularly good idea.
There is certainly no way we'd risk back-patching something with as
many potential side-effects as fooling with libc's textdomain.

I wonder though if we could attack the specific behavior you're
complaining of by testing to see if strerror() returned "???", and
substituting the numeric value for that, ie
 * Some strerror()s return an empty string for out-of-range errno. This is * ANSI C spec compliant, but not exactly
useful.*/
 
-    if (str == NULL || *str == '\0')
+    if (str == NULL || *str == '\0' || strcmp(str, "???") == 0){    snprintf(errorstr_buf, sizeof(errorstr_buf),
/*------

This would only work if glibc always returns that exact string for a
codeset translation failure, but a look into the glibc sources should
quickly confirm that.

BTW: personally, I would say that what you're looking at is a glibc bug.
I always thought the contract of gettext was to return the ASCII version
if it fails to produce a translated version.  That might not be what the
end user really wants to see, but surely returning something like "???"
is completely useless to anybody.
        regards, tom lane



On 2013-09-06 10:37:16 -0400, Tom Lane wrote:
> "MauMau" <maumau307@gmail.com> writes:
> > I've been suffering from PostgreSQL's problems related to character encoding 
> > for some time.  I really wish to solve those problems, because they make 
> > troubleshooting difficult.  I'm going to propose fixes for them, and I would 
> > appreciate if you could help release the official patches as soon as 
> > possible.
> 
> I don't find either of these patches to be a particularly good idea.
> There is certainly no way we'd risk back-patching something with as
> many potential side-effects as fooling with libc's textdomain.

I have no clue about the gettext stuff but I am in favor of including
the raw errno in strerror() messages (no backpatching tho). When doing
support it's a PITA to get translated strings for those. I can lookup
postgres' own translated messages in the source easy enough, but that
doesn't work all that well for OS supplied messages.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> I have no clue about the gettext stuff but I am in favor of including
> the raw errno in strerror() messages (no backpatching tho).

I dislike that on grounds of readability and translatability; and
I'm also of the opinion that errno codes aren't really consistent
enough across platforms to be all that trustworthy for remote diagnostic
purposes.  I'm fine with printing the code if strerror fails to
produce anything useful --- but not if it succeeds.
        regards, tom lane



On 2013-09-06 10:52:03 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I have no clue about the gettext stuff but I am in favor of including
> > the raw errno in strerror() messages (no backpatching tho).
> 
> I dislike that on grounds of readability and translatability; and
> I'm also of the opinion that errno codes aren't really consistent
> enough across platforms to be all that trustworthy for remote diagnostic
> purposes.

Well, it's easier to get access to mappings between errno and meaning of
foreign systems than to get access to their translations in my
experience.

If we'd add the errno inside %m processing, I don't see how it's
a problem for translation?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services




On Fri, Sep 6, 2013 at 3:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-09-06 10:52:03 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I have no clue about the gettext stuff but I am in favor of including
> > the raw errno in strerror() messages (no backpatching tho).
>
> I dislike that on grounds of readability and translatability; and
> I'm also of the opinion that errno codes aren't really consistent
> enough across platforms to be all that trustworthy for remote diagnostic
> purposes.

Historically they weren't even the same on Linux acros architectures. This was to support running native binaries from the incumbent platform (SunOS, OSF, BSD) under emulation on each architecture. I don't see any evidence of that any more but I'm not sure I'm looking in the right place.
 
Well, it's easier to get access to mappings between errno and meaning of
foreign systems than to get access to their translations in my
experience.

That's definitely true. There are only a few possible platforms and it's not hard to convert an errno to an error string on a given platform. Converting a translated string in some language you can't read to an untranslated string is another matter.

What would be nicer would be to display the C define, EINVAL, EPERM, etc. Afaik there's no portable way to do that though. I suppose we could just have a small array or hash table of all the errors we know about and look it up.

--
greg
Greg Stark <stark@mit.edu> writes:
> What would be nicer would be to display the C define, EINVAL, EPERM, etc.
> Afaik there's no portable way to do that though. I suppose we could just
> have a small array or hash table of all the errors we know about and look
> it up.

Yeah, I was just thinking the same thing.  We could do
switch (errno){    case EINVAL: str = "EINVAL"; break;    case ENOENT: str = "ENOENT"; break;    ...
#ifdef EFOOBAR    case EFOOBAR: str = "EFOOBAR"; break;
#endif    ...

for all the common or even less-common names, and only fall back on
printing a numeric value if it's something really unusual.

But I still maintain that we should only do this if we can't get a useful
string out of strerror().  There isn't any way to cram this information
into the current usage of %m without doing damage to the readability and
translatability of the string.  Our style & translatability guidelines
specifically recommend against assembling messages out of fragments,
and also against sticking in parenthetical additions.

I suppose we could think about inventing another error field rather
than damaging the readability of the primary message string, ie teach
elog that if %m is used it should emit an additional line along the lines
ofERRNO:  EINVAL
However the cost of adding a new column to CSV log format might exceed its
value.
        regards, tom lane



Thank you for your opinions and ideas.

From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Greg Stark <stark@mit.edu> writes:
>> What would be nicer would be to display the C define, EINVAL, EPERM, etc.
>> Afaik there's no portable way to do that though. I suppose we could just
>> have a small array or hash table of all the errors we know about and look
>> it up.
>
> Yeah, I was just thinking the same thing.  We could do
>
> switch (errno)
> {
> case EINVAL: str = "EINVAL"; break;
> case ENOENT: str = "ENOENT"; break;
> ...
> #ifdef EFOOBAR
> case EFOOBAR: str = "EFOOBAR"; break;
> #endif
> ...
>
> for all the common or even less-common names, and only fall back on
> printing a numeric value if it's something really unusual.
>
> But I still maintain that we should only do this if we can't get a useful
> string out of strerror().

OK, I'll take this approach.  That is:

str = strerror(errnum);
if (str == NULL || *str == '\0' || *str == '?')
{switch (errnum){case EINVAL: str = "errno=EINVAL"; break;case ENOENT: str = "errno=ENOENT"; break;...#ifdef
EFOOBARcaseEFOOBAR: str = "EFOOBAR"; break;#endifdefault: snprintf(errorstr_buf, sizeof(errorstr_buf),    _("operating
systemerror %d"), errnum); str = errorstr_buf;}
 
}

The number of questionmarks probably depends on the original message, so I 
won't strcmp() against "???".


From: "Tom Lane" <tgl@sss.pgh.pa.us>
> There is certainly no way we'd risk back-patching something with as
> many potential side-effects as fooling with libc's textdomain.

Agreed.  It should be better to avoid making use of undocumented behavior 
(i.e. strerror() uses libc.mo), if we can take another approach.


> BTW: personally, I would say that what you're looking at is a glibc bug.
> I always thought the contract of gettext was to return the ASCII version
> if it fails to produce a translated version.  That might not be what the
> end user really wants to see, but surely returning something like "???"
> is completely useless to anybody.

I think so, too.  Under the same condition, PostgreSQL built with Oracle 
Studio on Solaris outputs correct Japanese for strerror(), and English is 
output on Windows.  I'll contact glibc team to ask for improvement.



From: "Tom Lane" <tgl@sss.pgh.pa.us>
> I dislike that on grounds of readability and translatability; and
> I'm also of the opinion that errno codes aren't really consistent
> enough across platforms to be all that trustworthy for remote diagnostic
> purposes.  I'm fine with printing the code if strerror fails to
> produce anything useful --- but not if it succeeds.

I don't think this is a concern, because we should ask trouble reporters 
about the operating system where they are running the database server.


From: "Tom Lane" <tgl@sss.pgh.pa.us>
> There isn't any way to cram this information
> into the current usage of %m without doing damage to the readability and
> translatability of the string.  Our style & translatability guidelines
> specifically recommend against assembling messages out of fragments,
> and also against sticking in parenthetical additions.

From: "Andres Freund" <andres@2ndquadrant.com>
> If we'd add the errno inside %m processing, I don't see how it's
> a problem for translation?

I'm for Andres.  I don't see any problem if we don't translate "errno=%d".


I'll submit a revised patch again next week.  However, I believe my original 
approach is better, because it outputs user-friendly Japanese message 
instead of "errno=ENOENT".  Plus, outputing both errno value and its 
descriptive text is more useful, because the former is convenient for 
OS/library experts and the latter is convenient for PostgreSQL users.  Any 
better idea would be much appreciated.


Regards
MauMau




From: "MauMau" <maumau307@gmail.com>
> OK, I'll take this approach.  That is:


I did as Tom san suggested.  Please review the attached patch.  I chose as
common errnos by selecting those which are used in PosttgreSQL source code
out of the error numbers defined in POSIX 2013.

As I said, lack of %m string has been making troubleshooting difficult, so I
wish this to be backported at least 9.2.

Regards
MauMau

Attachment

Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

From
Peter Eisentraut
Date:
On 9/6/13 10:37 AM, Tom Lane wrote:
> BTW: personally, I would say that what you're looking at is a glibc bug.
> I always thought the contract of gettext was to return the ASCII version
> if it fails to produce a translated version.  That might not be what the
> end user really wants to see, but surely returning something like "???"
> is completely useless to anybody.

The question marks come from iconv.  Take a look at what this prints:

iconv po/ja.po -f utf-8 -t us-ascii//translit

If you use GNU libiconv, this will print a bunch of question marks.
Other implementations will probably not understand //translit and just
fail the conversion.

I think the use of //translit by gettext is poor judgement, because my
experiments show that the quality of the results is poor and not useful
for a user interface.

My suggestion in this matter is to disable gettext processing when
LC_CTYPE is set to C.  We could log a warning when LC_MESSAGES is set to
something and LC_CTYPE is set to C.  Or just do the warning and keep
logging.  Something like that.




Peter Eisentraut <peter_e@gmx.net> writes:
> My suggestion in this matter is to disable gettext processing when
> LC_CTYPE is set to C.  We could log a warning when LC_MESSAGES is set to
> something and LC_CTYPE is set to C.  Or just do the warning and keep
> logging.  Something like that.

Meh.  Seems that would only prevent one specific instance of the general
problem that strerror can fail to translate its result.  Other locale
combinations might create the same kind of failure.

More generally, though, is strerror actually using gettext at all, or
some homegrown implementation?  As I said upthread, I would expect that
gettext("foo") returns the given ASCII string "foo" if it fails to create
a translated version.  This is evidently not what's happening in strerror.

It's way past time to look into the glibc sources and see what it's
actually doing...
        regards, tom lane



Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

From
Peter Eisentraut
Date:
On 9/9/13 10:25 AM, Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> My suggestion in this matter is to disable gettext processing when
>> LC_CTYPE is set to C.  We could log a warning when LC_MESSAGES is set to
>> something and LC_CTYPE is set to C.  Or just do the warning and keep
>> logging.  Something like that.
> 
> Meh.  Seems that would only prevent one specific instance of the general
> problem that strerror can fail to translate its result.  Other locale
> combinations might create the same kind of failure.

True.  There isn't much we can do, really.  If your LC_MESSAGES and
LC_CTYPE don't get along, you get what you asked for.  This isn't
specific to PostgreSQL:

$ LC_CTYPE=C LC_MESSAGES=ja_JP.utf8 ls --foo
ls: ???????????`--foo'??
???? `ls --help' ????????.

> More generally, though, is strerror actually using gettext at all, or
> some homegrown implementation?  As I said upthread, I would expect that
> gettext("foo") returns the given ASCII string "foo" if it fails to create
> a translated version.  This is evidently not what's happening in strerror.

That is correct.  It returns the original string if it cannot find a
translation or the character conversion of the translation fails.  But
the character conversion to "US-ASCII//TRANSLIT" does not fail.  It just
produces an undesirable result.  If you patch the gettext source to
remove the //TRANSLIT, you will get the result you want.





Peter Eisentraut <peter_e@gmx.net> writes:
> On 9/9/13 10:25 AM, Tom Lane wrote:
>> Meh.  Seems that would only prevent one specific instance of the general
>> problem that strerror can fail to translate its result.  Other locale
>> combinations might create the same kind of failure.

> True.  There isn't much we can do, really.  If your LC_MESSAGES and
> LC_CTYPE don't get along, you get what you asked for.  This isn't
> specific to PostgreSQL:

So should we just say this is pilot error?  It may be, but if we can work
around it with a reasonably small amount of effort/risk, I think it's
appropriate to do that.  The proposal to reject a strerror result that
starts with '?' sounds plausible to me.
        regards, tom lane



On Mon, Sep 09, 2013 at 08:29:58AM -0400, Peter Eisentraut wrote:
> On 9/6/13 10:37 AM, Tom Lane wrote:
> > BTW: personally, I would say that what you're looking at is a glibc bug.
> > I always thought the contract of gettext was to return the ASCII version
> > if it fails to produce a translated version.  That might not be what the
> > end user really wants to see, but surely returning something like "???"
> > is completely useless to anybody.
> 
> The question marks come from iconv.  Take a look at what this prints:
> 
> iconv po/ja.po -f utf-8 -t us-ascii//translit
> 
> If you use GNU libiconv, this will print a bunch of question marks.

Actually, GNU libiconv's iconv() decides that //translit is unimplementable
for some of the characters in that file, and it fails the conversion.  GNU
libc's iconv(), on the other hand, emits the question marks.

> I think the use of //translit by gettext is poor judgement, because my
> experiments show that the quality of the results is poor and not useful
> for a user interface.

It depends on the quality of the //translit implementation.  GNU libiconv's
seems pretty good.  It gives up for Japanese or Russian characters, so you get
the English messages.  For Polish, GNU libiconv transliterates like this:

msgstr "nie można usunąć pliku lub katalogu \"%s\": %s\n"
msgstr "nie mozna usuna'c pliku lub katalogu \"%s\": %s\n"

That's fair, considering what it has to work with.  Ideally, (a) GNU libc
should import the smarter transliteration code from GNU libiconv, and (b) GNU
gettext should check for weak //translit implementations and not use
//translit under such circumstances.

> My suggestion in this matter is to disable gettext processing when
> LC_CTYPE is set to C.  We could log a warning when LC_MESSAGES is set to
> something and LC_CTYPE is set to C.  Or just do the warning and keep
> logging.  Something like that.

In an ENCODING=UTF8, LC_CTYPE=C database, no transliteration should need to
happen, and no transliteration does happen for the PG messages.  I think
MauMau's original bind_textdomain_codeset() proposal was on the right track.
We would need to do that for every relevant 3rd-party message domain, though.
Ick.  This suggests to me that gettext really needs an API for overriding the
default codeset pertaining to message domains not subjected to
bind_textdomain_codeset().  In the meantime, adding bind_textdomain_codeset()
calls for known localized dependencies seems like a fine coping mechanism.

If we can reasonably detect when gettext is supplying useless ????? messages,
that's good, too.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Noah Misch <noah@leadboat.com> writes:
> ... I think
> MauMau's original bind_textdomain_codeset() proposal was on the right track.

It might well be.  My objection was to the proposal for back-patching it
when we have little idea of the possible side-effects.  I would be fine
with handling that as a 9.4-only patch (preferably with the usual review
process).

> We would need to do that for every relevant 3rd-party message domain, though.
> Ick.

Yeah, and another question is whether 3rd-party code might not do its own
bind_textdomain_codeset() call with what it thinks is the right setting,
thereby overriding our attempted fix.

Still, libc is certainly the source of the vast majority of
potentially-translated messages that we might be passing through to users,
so fixing it would be a step forward.
        regards, tom lane



From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Noah Misch <noah@leadboat.com> writes:
>> ... I think
>> MauMau's original bind_textdomain_codeset() proposal was on the right 
>> track.
>
> It might well be.  My objection was to the proposal for back-patching it
> when we have little idea of the possible side-effects.  I would be fine
> with handling that as a 9.4-only patch (preferably with the usual review
> process).

> Still, libc is certainly the source of the vast majority of
> potentially-translated messages that we might be passing through to users,
> so fixing it would be a step forward.


We are using 9.1/9.2 and 9.2 is probably dominant, so I would be relieved 
with either of the following choices:

1. Take the approach that doesn't use bind_textdomain_codeset("libc") (i.e. 
the second version of errno_str.patch) for 9.4 and older releases.

2. Use bind_textdomain_codeset("libc") (i.e. take strerror_codeset.patch) 
for 9.4, and take the non-bind_textdomain_codeset approach for older 
releases.


Regards
MauMau




Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

From
Peter Eisentraut
Date:
On 9/6/13 9:40 AM, MauMau wrote:
> $ psql -d postgres -c "SELECT * FROM a"
> ... This outputs, in Japanese, a message meaning "could not open file
> "base/xxx/yyy": ???".
> 
> The problem is that strerror() returns "???", which hides the cause of
> the trouble.
> 
> The cause is that gettext() called by strerror() tries to convert UTF-8
> messages obtained from libc.mo to ASCII.  This is because postgres calls
> setlocale(LC_CTYPE, "C") when it connects to the database.

Does anyone know why the PostgreSQL-supplied part of the error message
does not get messed up?



On 9/9/13 2:57 PM, Noah Misch wrote:
> Actually, GNU libiconv's iconv() decides that //translit is unimplementable
> for some of the characters in that file, and it fails the conversion.  GNU
> libc's iconv(), on the other hand, emits the question marks.

That can't be right, because the examples I produced earlier (which
produced question marks) were produced on OS X with GNU libiconv.




Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

From
Peter Eisentraut
Date:
On 9/9/13 4:42 PM, MauMau wrote:
> We are using 9.1/9.2 and 9.2 is probably dominant, so I would be
> relieved with either of the following choices:
> 
> 1. Take the approach that doesn't use bind_textdomain_codeset("libc")
> (i.e. the second version of errno_str.patch) for 9.4 and older releases.
> 
> 2. Use bind_textdomain_codeset("libc") (i.e. take
> strerror_codeset.patch) for 9.4, and take the
> non-bind_textdomain_codeset approach for older releases.

I think we are not going to backpatch any of this.  There is a clear
workaround: fix your locale settings.




From: "Peter Eisentraut" <peter_e@gmx.net>
> Does anyone know why the PostgreSQL-supplied part of the error message
> does not get messed up?

That is because bind_textdomain_codeset() is called for postgres.mo in 
src/backend/utils/mb/mbutils.c, specifying the database encoding as the 
second argument.  This is done at session start.

Regards
MauMau




From: "Peter Eisentraut" <peter_e@gmx.net>
> On 9/9/13 4:42 PM, MauMau wrote:
> 1. Take the approach that doesn't use bind_textdomain_codeset("libc")
>> (i.e. the second version of errno_str.patch) for 9.4 and older releases.
>>
>> 2. Use bind_textdomain_codeset("libc") (i.e. take
>> strerror_codeset.patch) for 9.4, and take the
>> non-bind_textdomain_codeset approach for older releases.
>
> I think we are not going to backpatch any of this.  There is a clear
> workaround: fix your locale settings.

No, it's a hard workaround to take:

1. Recreate the database with LC_CTYPE = ja_JP.UTF-8.  This changes various 
behaviors such as ORDER BY, index scan, and the performance of LIKE clause. 
This is almost impossible.

2. Change lc_messages in postgresql.conf to 'C'.  This is OK for me as I can 
read/write English to some extent (though poor).  But English is difficult 
for some (or many?) Japanese.

So I hesitate to ask the users to do so.

Regards
MauMau






On Mon, Sep 09, 2013 at 05:49:38PM -0400, Peter Eisentraut wrote:
> On 9/9/13 2:57 PM, Noah Misch wrote:
> > Actually, GNU libiconv's iconv() decides that //translit is unimplementable
> > for some of the characters in that file, and it fails the conversion.  GNU
> > libc's iconv(), on the other hand, emits the question marks.
> 
> That can't be right, because the examples I produced earlier (which
> produced question marks) were produced on OS X with GNU libiconv.

Hmm.  I get the "good" behavior (decline to transliterate Japanese) with these
"iconv --version" strings:

iconv (GNU libiconv 1.11) [/usr/bin/iconv on Mac OS X 10.7]
iconv (GNU libiconv 1.14) [recently-updated fink]
iconv (GNU libiconv 1.14) [recently-updated Cygwin]

I also saw that on OpenBSD and NetBSD, though I'm not in an immediate position
to check the libiconv versions there.  I get the "bad" behavior (question
marks) on these:

iconv (GNU libc) 2.12 [Centos 6.4]
iconv (GNU libc) 2.3.4 [CentOS 4.4]
iconv (Ubuntu EGLIBC 2.15-0ubuntu10.4) 2.15 [Ubuntu 12.04]
iconv (GNU libc) 2.5 [Ubuntu 7.04]

That sure looked like GNU libc vs. GNU libiconv, but I guess I'm missing some
other factor.  What is your GNU libiconv version that emits question marks?

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



On Tue, Sep 10, 2013 at 05:42:06AM +0900, MauMau wrote:
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
>> Noah Misch <noah@leadboat.com> writes:
>>> ... I think
>>> MauMau's original bind_textdomain_codeset() proposal was on the right 
>>> track.
>>
>> It might well be.  My objection was to the proposal for back-patching it
>> when we have little idea of the possible side-effects.

Agreed.

> We are using 9.1/9.2 and 9.2 is probably dominant, so I would be relieved 
> with either of the following choices:
>
> 1. Take the approach that doesn't use bind_textdomain_codeset("libc") 
> (i.e. the second version of errno_str.patch) for 9.4 and older releases.
>
> 2. Use bind_textdomain_codeset("libc") (i.e. take strerror_codeset.patch) 
> for 9.4, and take the non-bind_textdomain_codeset approach for older  
> releases.

I like (2), at least at a high level.  The concept of errno_str.patch is safe
enough to back-patch.  One can verify that it only changes behavior when
strerror() returns NULL, an empty string, or something that begins with '?'.
I can't see resenting the change when that has happened.

Note that you can work around the problem today by linking PostgreSQL with a
better iconv() implementation.

Question-mark-damaged messages are not limited to strerror().  A combination
like lc_messages=ja_JP, encoding=LATIN1, lc_ctype=en_US will produce question
marks for PG and libc messages even with the bind_textdomain_codeset("libc")
change.  Is it worth doing anything about that?  That one looks self-inflicted
in comparison to the lc_messages=ja_JP, encoding=UTF8, lc_ctype=C case.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



On Mon, Sep 9, 2013 at 11:27 PM, MauMau <maumau307@gmail.com> wrote:
> 1. Recreate the database with LC_CTYPE = ja_JP.UTF-8.  This changes various
> behaviors such as ORDER BY, index scan, and the performance of LIKE clause.
> This is almost impossible.


Wait, why does the ctype of the database affect the ctype of the
messages? Shouldn't these be two separate things? One describes the
character set being used to store data in the database and the other
the character set the log file and clients are in.

That said, we do interpolate a lot of database strings into messages.

-- 
greg



From: "Noah Misch" <noah@leadboat.com>
> I like (2), at least at a high level.  The concept of errno_str.patch is 
> safe
> enough to back-patch.  One can verify that it only changes behavior when
> strerror() returns NULL, an empty string, or something that begins with 
> '?'.
> I can't see resenting the change when that has happened.

Thanks for reviewing the patch.


> Question-mark-damaged messages are not limited to strerror().  A 
> combination
> like lc_messages=ja_JP, encoding=LATIN1, lc_ctype=en_US will produce 
> question
> marks for PG and libc messages even with the 
> bind_textdomain_codeset("libc")
> change.  Is it worth doing anything about that?  That one looks 
> self-inflicted
> in comparison to the lc_messages=ja_JP, encoding=UTF8, lc_ctype=C case.

Year, that might be a bit self-inflicted.  But the problem may not happen 
with lc_messages=ja_JP.UTF-8 and lc_ctype=en_US.UTF-8.  Anyway, I want to 
see this as a separate issue.


Regards
MauMau




From: "Greg Stark" <stark@mit.edu>
> Wait, why does the ctype of the database affect the ctype of the
> messages? Shouldn't these be two separate things? One describes the
> character set being used to store data in the database and the other
> the character set the log file and clients are in.


At session start, PostgreSQL sets the ctype of the database to be the 
process's LC_CTYPE locale category in src/backend/utils/init/postinit.c:
ctype = NameStr(dbform->datctype);
...if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL)

The LC_CTYPE locale category determines the character encoding for messages 
obtained by gettext().  This is gettext()'s specification.


Regards
MauMau




On 9/9/13 9:54 PM, Noah Misch wrote:
> On Mon, Sep 09, 2013 at 05:49:38PM -0400, Peter Eisentraut wrote:
>> > On 9/9/13 2:57 PM, Noah Misch wrote:
>>> > > Actually, GNU libiconv's iconv() decides that //translit is unimplementable
>>> > > for some of the characters in that file, and it fails the conversion.  GNU
>>> > > libc's iconv(), on the other hand, emits the question marks.
>> > 
>> > That can't be right, because the examples I produced earlier (which
>> > produced question marks) were produced on OS X with GNU libiconv.
> Hmm.  I get the "good" behavior (decline to transliterate Japanese) with these
> "iconv --version" strings:

I might have messed up my testing.  You are probably right.



"MauMau" <maumau307@gmail.com> writes:
> I did as Tom san suggested.  Please review the attached patch.  I chose as 
> common errnos by selecting those which are used in PosttgreSQL source code 
> out of the error numbers defined in POSIX 2013.

I've committed this with some editorialization (mostly, I used a case
statement not a constant array, because that's more like the other places
that switch on errnos in this file).

> As I said, lack of %m string has been making troubleshooting difficult, so I 
> wish this to be backported at least 9.2.

I'm waiting to see whether the buildfarm likes this before considering
back-patching.
        regards, tom lane



Hi, Tom san,

From: "Tom Lane" <tgl@sss.pgh.pa.us>
> I've committed this with some editorialization (mostly, I used a case
> statement not a constant array, because that's more like the other places
> that switch on errnos in this file).
>
>> As I said, lack of %m string has been making troubleshooting difficult, 
>> so I
>> wish this to be backported at least 9.2.
>
> I'm waiting to see whether the buildfarm likes this before considering
> back-patching.

I'm very sorry to respond so late.  Thank you so much for committingthe 
patch.  I liked your code and comments.

I'll be glad if you could back-port this.  Personally, in practice, 9.1 and 
later will be sufficient.

Regards
MauMau




On 2013-12-02 19:36:01 +0900, MauMau wrote:
> I'll be glad if you could back-port this.  Personally, in practice, 9.1 and
> later will be sufficient.

Already happened:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: REL9_3_STABLE [e3480438e] 2013-11-07 16:33:18 -0500
Branch: REL9_2_STABLE [64f5962fe] 2013-11-07 16:33:25 -0500
Branch: REL9_1_STABLE [8cfd4c6a1] 2013-11-07 16:33:28 -0500
Branch: REL9_0_STABLE [8103f49c1] 2013-11-07 16:33:34 -0500
Branch: REL8_4_STABLE [3eb777671] 2013-11-07 16:33:39 -0500
   Be more robust when strerror() doesn't give a useful result.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services