On Mon, Sep 09, 2013 at 08:29:58AM -0400, Peter Eisentraut wrote:
> On 9/6/13 10:37 AM, Tom Lane wrote:
> > BTW: personally, I would say that what you're looking at is a glibc bug.
> > I always thought the contract of gettext was to return the ASCII version
> > if it fails to produce a translated version. That might not be what the
> > end user really wants to see, but surely returning something like "???"
> > is completely useless to anybody.
>
> The question marks come from iconv. Take a look at what this prints:
>
> iconv po/ja.po -f utf-8 -t us-ascii//translit
>
> If you use GNU libiconv, this will print a bunch of question marks.
Actually, GNU libiconv's iconv() decides that //translit is unimplementable
for some of the characters in that file, and it fails the conversion. GNU
libc's iconv(), on the other hand, emits the question marks.
> I think the use of //translit by gettext is poor judgement, because my
> experiments show that the quality of the results is poor and not useful
> for a user interface.
It depends on the quality of the //translit implementation. GNU libiconv's
seems pretty good. It gives up for Japanese or Russian characters, so you get
the English messages. For Polish, GNU libiconv transliterates like this:
msgstr "nie można usunąć pliku lub katalogu \"%s\": %s\n"
msgstr "nie mozna usuna'c pliku lub katalogu \"%s\": %s\n"
That's fair, considering what it has to work with. Ideally, (a) GNU libc
should import the smarter transliteration code from GNU libiconv, and (b) GNU
gettext should check for weak //translit implementations and not use
//translit under such circumstances.
> My suggestion in this matter is to disable gettext processing when
> LC_CTYPE is set to C. We could log a warning when LC_MESSAGES is set to
> something and LC_CTYPE is set to C. Or just do the warning and keep
> logging. Something like that.
In an ENCODING=UTF8, LC_CTYPE=C database, no transliteration should need to
happen, and no transliteration does happen for the PG messages. I think
MauMau's original bind_textdomain_codeset() proposal was on the right track.
We would need to do that for every relevant 3rd-party message domain, though.
Ick. This suggests to me that gettext really needs an API for overriding the
default codeset pertaining to message domains not subjected to
bind_textdomain_codeset(). In the meantime, adding bind_textdomain_codeset()
calls for known localized dependencies seems like a fine coping mechanism.
If we can reasonably detect when gettext is supplying useless ????? messages,
that's good, too.
Thanks,
nm
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com