Thread: More message encoding woes

More message encoding woes

From
Heikki Linnakangas
Date:
latin1db=# SELECT version();                                      version 

----------------------------------------------------------------------------------- PostgreSQL 8.3.7 on
i686-pc-linux-gnu,compiled by GCC gcc (Debian 
 
4.3.3-5) 4.3.3
(1 row)

latin1db=# SELECT name, setting FROM pg_settings where name like 'lc%' 
OR name like '%encoding';      name       | setting
-----------------+--------- client_encoding | utf8 lc_collate      | C lc_ctype        | C lc_messages     | es_ES
lc_monetary    | C lc_numeric      | C lc_time         | C server_encoding | LATIN1
 
(8 rows)

latin1db=# SELECT * FROM foo;
ERROR:  no existe la relación «foo»

The accented characters are garbled. When I try the same with a database 
that's in UTF8 in the same cluster, it works:

utf8db=# SELECT name, setting FROM pg_settings where name like 'lc%' OR 
name like '%encoding';      name       | setting
-----------------+--------- client_encoding | UTF8 lc_collate      | C lc_ctype        | C lc_messages     | es_ES
lc_monetary    | C lc_numeric      | C lc_time         | C server_encoding | UTF8
 
(8 rows)

utf8db=# SELECT * FROM foo;
ERROR:  no existe la relación «foo»

What is happening is that gettext() returns the message in the encoding 
determined by LC_CTYPE, while we expect it to return it in the database 
encoding. Starting with PG 8.3 we enforce that the encoding specified in 
LC_CTYPE matches the database encoding, but not for the C locale.

In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding() 
which fixes that, but we only do it on Windows. In earlier versions we 
called it on all platforms, but only for UTF-8. It seems that we should 
call bind_textdomain_codeset on all platforms and all encodings. 
However, there seems to be a reason why we only do it for Windows on CVS 
HEAD: we need a mapping from our encoding ID to the OS codeset name, and 
the OS codeset names vary.

How can we make this more robust?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding() 
> which fixes that, but we only do it on Windows. In earlier versions we 
> called it on all platforms, but only for UTF-8. It seems that we should 
> call bind_textdomain_codeset on all platforms and all encodings. 

Yes, this problem has been recognized for some time.

> However, there seems to be a reason why we only do it for Windows on CVS 
> HEAD: we need a mapping from our encoding ID to the OS codeset name, and 
> the OS codeset names vary.

> How can we make this more robust?

One possibility is to assume that the output of nl_langinfo(CODESET)
will be recognized by bind_textdomain_codeset().  Whether that actually
works can only be determined by experiment.

Another idea is to try the values listed in our encoding_match_list[]
until bind_textdomain_codeset succeeds.  The problem here is that the
GNU documentation is *exceedingly* vague about whether
bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
when given a bad encoding name.  (I guess we could look at the source
code.)
        regards, tom lane


Re: More message encoding woes

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Another idea is to try the values listed in our encoding_match_list[]
> until bind_textdomain_codeset succeeds.  The problem here is that the
> GNU documentation is *exceedingly* vague about whether
> bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
> when given a bad encoding name.  (I guess we could look at the source
> code.)

Unfortunately it doesn't give any error. The value passed to it is just
stored, and isn't used until gettext(). Quick testing shows that if you
give an invalid encoding name, gettext will simply refrain from
translating anything and revert to English.

We could exploit that to determine if the codeset name we gave
bind_textdomain_codeset was valid: pick a string that is known to be
translated in all translations, like "syntax error", and see if
gettext("syntax error") returns the original string. Something along the
lines of:

const char *teststring = "syntax error";
encoding_match *m = encoding_match_list;
while(m->system_enc_name)
{  if (m->pg_enc_code != GetDatabaseEncoding())    continue;  bind_textdomain_codeset("postgres");  if
(gettext(teststring)!= teststring)    break; /* found! */
 
}


This feels rather hacky, but if we only do that with the combination of 
LC_CTYPE=C and LC_MESSAGES=other than C that we have a problem with, I 
think it would be ok. The current behavior is highly unlikely to give 
correct results, so I don't think we can do much worse than that.

Another possibility is to just refrain from translating anything if 
LC_CTYPE=C. If the above loop fails to find anything that works, that's 
what we should fall back to IMHO.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



Re: More message encoding woes

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Another idea is to try the values listed in our encoding_match_list[]
>> until bind_textdomain_codeset succeeds.  The problem here is that the
>> GNU documentation is *exceedingly* vague about whether
>> bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
>> when given a bad encoding name.  (I guess we could look at the source
>> code.)

> Unfortunately it doesn't give any error.

(Man, why are the APIs in this problem space so universally awful?)

Where does it get the default codeset from?  Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?
        regards, tom lane


Re: More message encoding woes

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Where does it get the default codeset from?  Maybe we could constrain
> that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

LC_CTYPE. In 8.3 and up where we constrain that to match the database 
encoding, we only have a problem with the C locale.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Where does it get the default codeset from?  Maybe we could constrain
>> that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

> LC_CTYPE. In 8.3 and up where we constrain that to match the database 
> encoding, we only have a problem with the C locale.

... and even if we wanted to fiddle with it, that just moves the problem
over to finding an LC_CTYPE value that matches the database encoding
:-(.

Yup, it's a mess.  We'd have done this long ago if it were easy.

Could we get away with just unconditionally calling
bind_textdomain_codeset with *our* canonical spelling of the encoding
name?  If it works, great, and if it doesn't, you get English.
        regards, tom lane


Re: More message encoding woes

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> Where does it get the default codeset from?  Maybe we could constrain
>>> that to match the database encoding, the way we do for LC_COLLATE/CTYPE?
> 
>> LC_CTYPE. In 8.3 and up where we constrain that to match the database 
>> encoding, we only have a problem with the C locale.
> 
> ... and even if we wanted to fiddle with it, that just moves the problem
> over to finding an LC_CTYPE value that matches the database encoding
> :-(.
> 
> Yup, it's a mess.  We'd have done this long ago if it were easy.
> 
> Could we get away with just unconditionally calling
> bind_textdomain_codeset with *our* canonical spelling of the encoding
> name?  If it works, great, and if it doesn't, you get English.

Yeah, that's better than nothing.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Could we get away with just unconditionally calling
>> bind_textdomain_codeset with *our* canonical spelling of the encoding
>> name?  If it works, great, and if it doesn't, you get English.

> Yeah, that's better than nothing.

A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6
says that it would not work quite well enough.  The encoding names are
similar but not identical --- in particular I notice a lot of
discrepancies about dash versus underscore vs no separator at all.

What we need is an API equivalent to "iconv --list", but I'm not seeing
one :-(.  Do we need to go so far as to try to run that program?
Its output format is poorly standardized, among other problems ...
        regards, tom lane


Re: More message encoding woes

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> What we need is an API equivalent to "iconv --list", but I'm not seeing
> one :-(.

There's also "locale -m". Looking at the implementation of that, it just 
lists what's in /usr/share/i18n/charmaps. Not too portable either..

>  Do we need to go so far as to try to run that program?
> Its output format is poorly standardized, among other problems ...

And doing that at every backend startup is too slow.

I would be happy to just revert to English if the OS doesn't recognize 
the name we use for the encoding. What sucks about that most is that the 
user has no way to specify the right encoding name even if he knows it. 
I don't think we want to introduce a new GUC for that.

One idea is to extract the encoding from LC_MESSAGES. Then call 
pg_get_encoding_from_locale() on that and check that it matches 
server_encoding. If it does, great, pass it to 
bind_textdomain_codeset(). If it doesn't, throw an error.

It stretches the conventional meaning LC_MESSAGES/LC_CTYPE a bit, since 
LC_CTYPE usually specifies the codeset to use, but I think it's quite 
intuitive.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Zdenek Kotala
Date:
Tom Lane píše v po 30. 03. 2009 v 14:04 -0400:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > Tom Lane wrote:
> >> Could we get away with just unconditionally calling
> >> bind_textdomain_codeset with *our* canonical spelling of the encoding
> >> name?  If it works, great, and if it doesn't, you get English.
> 
> > Yeah, that's better than nothing.
> 
> A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6
> says that it would not work quite well enough.  The encoding names are
> similar but not identical --- in particular I notice a lot of
> discrepancies about dash versus underscore vs no separator at all.

The same problem is with collation when you try restore database on
different OS. :(
Zdenek 



Re: More message encoding woes

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> One idea is to extract the encoding from LC_MESSAGES. Then call 
> pg_get_encoding_from_locale() on that and check that it matches 
> server_encoding. If it does, great, pass it to 
> bind_textdomain_codeset(). If it doesn't, throw an error.

I tried to implement this but it gets complicated. First of all, we can 
only throw an error when lc_messages is set interactively. If it's set 
in postgresql.conf, it might be valid for some databases but not for 
others with different encoding. And that makes per-user lc_messages 
setting quite hard too.

Another complication is what to do if e.g. plpgsql or a 3rd party module 
have called pg_bindtextdomain, when lc_messages=C and we don't yet know 
the system name for the database encoding, and you later set 
lc_messages='fi_FI.iso8859-1', in a latin1 database. In order to 
retroactively set the codeset, we'd have to remember all the calls to 
pg_bindtextdomain. Not impossible, for sure, but more work.

I'm leaning towards the idea of trying out all the spellings of the 
database encoding we have in encoding_match_list. That gives the best 
user experience, as it just works, and it doesn't seem that complicated.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I'm leaning towards the idea of trying out all the spellings of the 
> database encoding we have in encoding_match_list. That gives the best 
> user experience, as it just works, and it doesn't seem that complicated.

How were you going to check --- use that idea of translating a string
that's known to have a translation?  OK, but you'd better document
somewhere where translators will read it "you must translate this string
first of all".  Maybe use a special string "Translate Me First" that
doesn't actually need to be end-user-visible, just so no one sweats over
getting it right in context.  (I can see "syntax error" being
problematic in some translations, since translators will know it is
always just a fragment of a larger message ...)
        regards, tom lane


Re: More message encoding woes

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> I'm leaning towards the idea of trying out all the spellings of the 
>> database encoding we have in encoding_match_list. That gives the best 
>> user experience, as it just works, and it doesn't seem that complicated.
> 
> How were you going to check --- use that idea of translating a string
> that's known to have a translation?  OK, but you'd better document
> somewhere where translators will read it "you must translate this string
> first of all".  Maybe use a special string "Translate Me First" that
> doesn't actually need to be end-user-visible, just so no one sweats over
> getting it right in context.

Yep, something like that. There seems to be a magic empty string 
translation at the beginning of every po file that returns the 
meta-information about the translation, like translation author and 
date. Assuming that works reliably, I'll use that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Maybe use a special string "Translate Me First" that
>> doesn't actually need to be end-user-visible, just so no one sweats over
>> getting it right in context.

> Yep, something like that. There seems to be a magic empty string 
> translation at the beginning of every po file that returns the 
> meta-information about the translation, like translation author and 
> date. Assuming that works reliably, I'll use that.

At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding.  I suppose that would result in
failure, when we'd prefer it not to.  A single-purpose string could be
documented as "whatever you translate this to should be pure ASCII,
never mind if it's sensible".
        regards, tom lane


Re: More message encoding woes

From
Alvaro Herrera
Date:
Tom Lane wrote:

> At first that sounded like an ideal answer, but I can see a gotcha:
> suppose the translation's author's name contains some characters that
> don't convert to the database encoding.  I suppose that would result in
> failure, when we'd prefer it not to.  A single-purpose string could be
> documented as "whatever you translate this to should be pure ASCII,
> never mind if it's sensible".

One problem with this idea is that it may be hard to coerce gettext into
putting a particular string at the top of the file :-(

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: More message encoding woes

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> At first that sounded like an ideal answer, but I can see a gotcha:
>> suppose the translation's author's name contains some characters that
>> don't convert to the database encoding.  I suppose that would result in
>> failure, when we'd prefer it not to.  A single-purpose string could be
>> documented as "whatever you translate this to should be pure ASCII,
>> never mind if it's sensible".

> One problem with this idea is that it may be hard to coerce gettext into
> putting a particular string at the top of the file :-(

I doubt we can, which is why the documentation needs to tell translators
about it.
        regards, tom lane


Re: More message encoding woes

From
Peter Eisentraut
Date:
On Monday 30 March 2009 21:04:00 Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > Tom Lane wrote:
> >> Could we get away with just unconditionally calling
> >> bind_textdomain_codeset with *our* canonical spelling of the encoding
> >> name?  If it works, great, and if it doesn't, you get English.
> >
> > Yeah, that's better than nothing.
>
> A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6
> says that it would not work quite well enough.  The encoding names are
> similar but not identical --- in particular I notice a lot of
> discrepancies about dash versus underscore vs no separator at all.

I seem to recall that the encoding names are normalized by the C library 
somewhere, but I can't find the documentation now.  It might be worth trying 
anyway -- the above might not in fact be a problem.



Re: More message encoding woes

From
Peter Eisentraut
Date:
On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Where does it get the default codeset from?  Maybe we could constrain
> > that to match the database encoding, the way we do for LC_COLLATE/CTYPE?
>
> LC_CTYPE. In 8.3 and up where we constrain that to match the database
> encoding, we only have a problem with the C locale.

Why don't we apply the same restriction to the C locale then?


Re: More message encoding woes

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote:
>> LC_CTYPE. In 8.3 and up where we constrain that to match the database
>> encoding, we only have a problem with the C locale.

> Why don't we apply the same restriction to the C locale then?

(1) what would you constrain it to?

(2) historically we've allowed C locale to be used with any encoding,
and there are a *lot* of users depending on that (particularly in the
Far East, I gather).
        regards, tom lane


Re: More message encoding woes

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> Maybe use a special string "Translate Me First" that
>>> doesn't actually need to be end-user-visible, just so no one sweats over
>>> getting it right in context.
>
>> Yep, something like that. There seems to be a magic empty string
>> translation at the beginning of every po file that returns the
>> meta-information about the translation, like translation author and
>> date. Assuming that works reliably, I'll use that.
>
> At first that sounded like an ideal answer, but I can see a gotcha:
> suppose the translation's author's name contains some characters that
> don't convert to the database encoding.  I suppose that would result in
> failure, when we'd prefer it not to.  A single-purpose string could be
> documented as "whatever you translate this to should be pure ASCII,
> never mind if it's sensible".

I just tried that, and it seems that gettext() does transliteration, so
any characters that have no counterpart in the database encoding will be
replaced with something similar, or question marks. Assuming that's
universal across platforms, and I think it is, using the empty string
should work.

It also means that you can use lc_messages='ja' with
server_encoding='latin1', but it will be unreadable because all the
non-ascii characters are replaced with question marks. For something
like lc_messages='es_ES' and server_encoding='koi8-r', it will still
look quite nice.

Attached is a patch I've been testing. Seems to work quite well. It
would be nice if someone could test it on Windows, which seems to be a
bit special in this regard.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 118a6fe..390a7cf 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -290,6 +290,7 @@ locale_messages_assign(const char *value, bool doit, GucSource source)
         if (!pg_perm_setlocale(LC_MESSAGES, value))
             if (source != PGC_S_DEFAULT)
                 return NULL;
+        pg_init_gettext_codeset();
     }
 #ifndef WIN32
     else
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 03d86ca..47ebe1b 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1242,7 +1242,7 @@ pg_bindtextdomain(const char *domain)

         get_locale_path(my_exec_path, locale_path);
         bindtextdomain(domain, locale_path);
-        pg_bind_textdomain_codeset(domain, GetDatabaseEncoding());
+        pg_register_textdomain(domain);
     }
 #endif
 }
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index bf66321..970cb83 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -842,46 +842,6 @@ cliplen(const char *str, int len, int limit)
     return l;
 }

-#if defined(ENABLE_NLS) && defined(WIN32)
-static const struct codeset_map {
-    int    encoding;
-    const char *codeset;
-} codeset_map_array[] = {
-    {PG_UTF8, "UTF-8"},
-    {PG_LATIN1, "LATIN1"},
-    {PG_LATIN2, "LATIN2"},
-    {PG_LATIN3, "LATIN3"},
-    {PG_LATIN4, "LATIN4"},
-    {PG_ISO_8859_5, "ISO-8859-5"},
-    {PG_ISO_8859_6, "ISO_8859-6"},
-    {PG_ISO_8859_7, "ISO-8859-7"},
-    {PG_ISO_8859_8, "ISO-8859-8"},
-    {PG_LATIN5, "LATIN5"},
-    {PG_LATIN6, "LATIN6"},
-    {PG_LATIN7, "LATIN7"},
-    {PG_LATIN8, "LATIN8"},
-    {PG_LATIN9, "LATIN-9"},
-    {PG_LATIN10, "LATIN10"},
-    {PG_KOI8R, "KOI8-R"},
-    {PG_WIN1250, "CP1250"},
-    {PG_WIN1251, "CP1251"},
-    {PG_WIN1252, "CP1252"},
-    {PG_WIN1253, "CP1253"},
-    {PG_WIN1254, "CP1254"},
-    {PG_WIN1255, "CP1255"},
-    {PG_WIN1256, "CP1256"},
-    {PG_WIN1257, "CP1257"},
-    {PG_WIN1258, "CP1258"},
-    {PG_WIN866, "CP866"},
-    {PG_WIN874, "CP874"},
-    {PG_EUC_CN, "EUC-CN"},
-    {PG_EUC_JP, "EUC-JP"},
-    {PG_EUC_KR, "EUC-KR"},
-    {PG_EUC_TW, "EUC-TW"},
-    {PG_EUC_JIS_2004, "EUC-JP"}
-};
-#endif /* WIN32 */
-
 void
 SetDatabaseEncoding(int encoding)
 {
@@ -892,28 +852,132 @@ SetDatabaseEncoding(int encoding)
     Assert(DatabaseEncoding->encoding == encoding);

 #ifdef ENABLE_NLS
-    pg_bind_textdomain_codeset(textdomain(NULL), encoding);
+    pg_init_gettext_codeset();
+    pg_register_textdomain(textdomain(NULL));
 #endif
 }

+static char **registered_textdomains = NULL;
+static const char *system_codeset = "invalid";
+
 /*
- * On Windows, we need to explicitly bind gettext to the correct
- * encoding, because gettext() tends to get confused.
+ * Register a gettext textdomain with the backend. We will call
+ * bind_textdomain_codeset() for it to ensure that translated strings
+ * are returned in the right encoding.
  */
 void
-pg_bind_textdomain_codeset(const char *domainname, int encoding)
+pg_register_textdomain(const char *domainname)
 {
-#if defined(ENABLE_NLS) && defined(WIN32)
+#if defined(ENABLE_NLS)
     int     i;
+    MemoryContext old_cxt;
+
+    old_cxt = MemoryContextSwitchTo(TopMemoryContext);
+    if (registered_textdomains == NULL)
+    {
+        registered_textdomains = palloc(sizeof(char *) * 1);
+        registered_textdomains[0] = NULL;
+    }

-    for (i = 0; i < lengthof(codeset_map_array); i++)
+    for (i = 0; registered_textdomains[i] != NULL; i++)
     {
-        if (codeset_map_array[i].encoding == encoding)
+        /* Ignore if already bound */
+        if (strcmp(registered_textdomains[i], domainname) == 0)
+            return;
+    }
+    registered_textdomains = repalloc(registered_textdomains,
+                                      (i + 2) * sizeof(char *));
+    registered_textdomains[i] = pstrdup(domainname);
+    registered_textdomains[i + 1] = NULL;
+
+    MemoryContextSwitchTo(old_cxt);
+
+    if (GetDatabaseEncoding() != PG_SQL_ASCII)
+    {
+        if (bind_textdomain_codeset(domainname,    system_codeset) == NULL)
+            elog(LOG, "bind_textdomain_codeset failed");
+    }
+#endif
+}
+
+/*
+ * Set the codeset used for strings returned by gettext() to match the
+ * database encoding.
+ *
+ * In theory this should only depend on the database encoding, but because
+ * of the way use gettext() to find the corresponding OS codeset name, we
+ * also need LC_MESSAGES to be set correctly for this to work. Because of
+ * that, pg_init_gettext_codeset() should be called after any changes to
+ * LC_MESSAGES.
+ */
+void
+pg_init_gettext_codeset(void)
+{
+#if defined(ENABLE_NLS)
+    int        i;
+
+    /*
+     * SQL_ASCII encoding is special. In that case we do nothing, and let
+     * gettext() to pick the codeset from LC_CTYPE.
+     */
+    if (GetDatabaseEncoding() == PG_SQL_ASCII)
+        return;
+
+    /*
+     * Find a codeset name for the database encoding that
+     * bind_textdomain_codeset() recognizes.
+     *
+     * Unfortunately there's no handy interface to list all the codesets
+     * in the system. 'locale -m' or 'iconv --list' do that, but we don't
+     * want to call external programs here. So we try every alias for the
+     * encoding that we know until we find one that works.
+     *
+     * Unfortunately bind_textdomain_codeset() doesn't return any error code
+     * when given an invalid codeset name, so we have to work a bit harder
+     * to check if a codeset name works. We call gettext("") after
+     * bind_textdomain_codeset(), and check that it returned a translated
+     * string other than "". Empty string is a special value in .po files
+     * that is present in all translations: it translates into a string with
+     * meta-information about the translation, like author and creation date.
+     */
+    system_codeset = NULL;
+    for (i = 0; encoding_match_list[i].system_enc_name; i++)
+    {
+        if (encoding_match_list[i].pg_enc_code != GetDatabaseEncoding())
+            continue;
+
+        if (bind_textdomain_codeset(textdomain(NULL),
+                        encoding_match_list[i].system_enc_name) != NULL)
         {
-            if (bind_textdomain_codeset(domainname,
-                                        codeset_map_array[i].codeset) == NULL)
+            const char *str = gettext("");
+            if (strcmp(str, "") != 0)
+            {
+                /* great, it worked */
+                system_codeset = encoding_match_list[i].system_enc_name;
+                break;
+            }
+        }
+    }
+
+    if (system_codeset == NULL)
+    {
+        elog(DEBUG1, "failed to find a system codeset name for encoding \"%s\"",
+             GetDatabaseEncodingName());
+        system_codeset = "invalid";
+    }
+
+    /*
+     * Bind all textdomains in use to the new codeset. This is done even if
+     * no valid codeset name was found, to force gettext() to revert to
+     * ascii English.
+     */
+    if (registered_textdomains != NULL)
+    {
+        for (i = 0; registered_textdomains[i] != NULL; i++)
+        {
+            if (bind_textdomain_codeset(registered_textdomains[i],
+                                        system_codeset) == NULL)
                 elog(LOG, "bind_textdomain_codeset failed");
-            break;
         }
     }
 #endif
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 76322c9..8fcfa52 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -392,7 +392,8 @@ extern const char *pg_get_client_encoding_name(void);
 extern void SetDatabaseEncoding(int encoding);
 extern int    GetDatabaseEncoding(void);
 extern const char *GetDatabaseEncodingName(void);
-extern void pg_bind_textdomain_codeset(const char *domainname, int encoding);
+extern void pg_register_textdomain(const char *domainname);
+extern void pg_init_gettext_codeset(void);

 extern int    pg_valid_client_encoding(const char *name);
 extern int    pg_valid_server_encoding(const char *name);
diff --git a/src/include/port.h b/src/include/port.h
index 0557dd2..cbd72bd 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -422,6 +422,14 @@ extern void qsort_arg(void *base, size_t nel, size_t elsize,
           qsort_arg_comparator cmp, void *arg);

 /* port/chklocale.c */
+
+struct encoding_match
+{
+    int pg_enc_code;
+    const char *system_enc_name;
+};
+extern const struct encoding_match encoding_match_list[];
+
 extern int    pg_get_encoding_from_locale(const char *ctype);

 #endif   /* PG_PORT_H */
diff --git a/src/port/chklocale.c b/src/port/chklocale.c
index 78410df..4469e89 100644
--- a/src/port/chklocale.c
+++ b/src/port/chklocale.c
@@ -35,15 +35,12 @@
  * numbers (CPnnn).
  *
  * Note that we search the table with pg_strcasecmp(), so variant
- * capitalizations don't need their own entries.
+ * capitalizations don't need their own entries. XXX: Now that we also
+ * use this to map from pg encoding code to system name, do we need to
+ * include different capitalizations?
  */
-struct encoding_match
-{
-    enum pg_enc pg_enc_code;
-    const char *system_enc_name;
-};

-static const struct encoding_match encoding_match_list[] = {
+const struct encoding_match encoding_match_list[] = {
     {PG_EUC_JP, "EUC-JP"},
     {PG_EUC_JP, "eucJP"},
     {PG_EUC_JP, "IBM-eucJP"},

Re: More message encoding woes

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:

> > One problem with this idea is that it may be hard to coerce gettext into
> > putting a particular string at the top of the file :-(
> 
> I doubt we can, which is why the documentation needs to tell translators
> about it.

I doubt that documenting the issue will be enough (in fact I'm pretty
sure it won't).  Maybe we can just supply the string translated in our
POT files, and add a comment that the translator is not supposed to
touch it.  This doesn't seem all that difficult -- I think it just
requires that we add a msgmerge step to "make update-po" that uses a
file on which the message has already been translated.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: More message encoding woes

From
Hiroshi Inoue
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>> Tom Lane wrote:
>>>> Maybe use a special string "Translate Me First" that
>>>> doesn't actually need to be end-user-visible, just so no one sweats 
>>>> over
>>>> getting it right in context.
>>
>>> Yep, something like that. There seems to be a magic empty string 
>>> translation at the beginning of every po file that returns the 
>>> meta-information about the translation, like translation author and 
>>> date. Assuming that works reliably, I'll use that.
>>
>> At first that sounded like an ideal answer, but I can see a gotcha:
>> suppose the translation's author's name contains some characters that
>> don't convert to the database encoding.  I suppose that would result in
>> failure, when we'd prefer it not to.  A single-purpose string could be
>> documented as "whatever you translate this to should be pure ASCII,
>> never mind if it's sensible".
> 
> I just tried that, and it seems that gettext() does transliteration, so 
> any characters that have no counterpart in the database encoding will be 
> replaced with something similar, or question marks.> Assuming that's
> universal across platforms, and I think it is, using the empty string 
> should work.
> 
> It also means that you can use lc_messages='ja' with 
> server_encoding='latin1', but it will be unreadable because all the 
> non-ascii characters are replaced with question marks.

It doesn't occur in the current Windows environment. As for Windows
gnu gettext which we are using, we would see the original msgid when
iconv can't convert the msgstr to the target codeset.

set client_encoding to utf_8;
SET
show server_encoding; server_encoding
----------------- LATIN1
(1 row)

show lc_messages;    lc_messages
-------------------- Japanese_Japan.932 (1 row)

1;
ERROR:  syntax error at or near "1"
LINE 1: 1;


OTOH when the sever encoding is utf8 then

set client_encoding to utf_8;
SET
show server_encoding; server_encoding
----------------- UTF8
(1 row)

show lc_messages;    lc_messages
-------------------- Japanese_Japan.932
(1 row)

1;
ERROR:  "1"またはその近辺で構文エラー
LINE 1: 1;      ^


Re: More message encoding woes

From
Tom Lane
Date:
Hiroshi Inoue <inoue@tpf.co.jp> writes:
> Heikki Linnakangas wrote:
>> I just tried that, and it seems that gettext() does transliteration, so 
>> any characters that have no counterpart in the database encoding will be 
>> replaced with something similar, or question marks.

> It doesn't occur in the current Windows environment. As for Windows
> gnu gettext which we are using, we would see the original msgid when
> iconv can't convert the msgstr to the target codeset.

Well, if iconv has no conversion to the codeset at all then there is no
point in selecting that particular codeset setting anyway.  The question
was about whether we can distinguish "no conversion available" from
"conversion available, but the test string has some unconvertible
characters".
        regards, tom lane


Re: More message encoding woes

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> Hiroshi Inoue <inoue@tpf.co.jp> writes:
>> Heikki Linnakangas wrote:
>>> I just tried that, and it seems that gettext() does transliteration, so 
>>> any characters that have no counterpart in the database encoding will be 
>>> replaced with something similar, or question marks.
> 
>> It doesn't occur in the current Windows environment. As for Windows
>> gnu gettext which we are using, we would see the original msgid when
>> iconv can't convert the msgstr to the target codeset.
> 
> Well, if iconv has no conversion to the codeset at all then there is no
> point in selecting that particular codeset setting anyway.  The question
> was about whether we can distinguish "no conversion available" from
> "conversion available, but the test string has some unconvertible
> characters".

What I meant is we would see no '?' when we use Windows gnu gettext.
Whether conversion available or not depends on individual msgids.
For example, when the Japanese msgstr corresponding to a msgid has
no characters other than ASCII accidentally, Windows gnu gettext will
use the msgstr not the original msgid.


Re: More message encoding woes

From
Hiroshi Inoue
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>> Tom Lane wrote:
>>>> Maybe use a special string "Translate Me First" that
>>>> doesn't actually need to be end-user-visible, just so no one sweats 
>>>> over
>>>> getting it right in context.
>>
>>> Yep, something like that. There seems to be a magic empty string 
>>> translation at the beginning of every po file that returns the 
>>> meta-information about the translation, like translation author and 
>>> date. Assuming that works reliably, I'll use that.
>>
>> At first that sounded like an ideal answer, but I can see a gotcha:
>> suppose the translation's author's name contains some characters that
>> don't convert to the database encoding.  I suppose that would result in
>> failure, when we'd prefer it not to.  A single-purpose string could be
>> documented as "whatever you translate this to should be pure ASCII,
>> never mind if it's sensible".
> 
> I just tried that, and it seems that gettext() does transliteration, so 
> any characters that have no counterpart in the database encoding will be 
> replaced with something similar, or question marks. Assuming that's 
> universal across platforms, and I think it is, using the empty string 
> should work.
> 
> It also means that you can use lc_messages='ja' with 
> server_encoding='latin1', but it will be unreadable because all the 
> non-ascii characters are replaced with question marks. For something 
> like lc_messages='es_ES' and server_encoding='koi8-r', it will still 
> look quite nice.
> 
> Attached is a patch I've been testing. Seems to work quite well. It 
> would be nice if someone could test it on Windows, which seems to be a 
> bit special in this regard.

Unfortunately it doesn't seem to work on Windows.

First any combination of valid lc_messages and non-existent encoding
passes the test  strcmp(gettext(""), "") != 0 .
Second for example the combination of ja(lc_messages) and ISO-8859-1
passes the the test but the test fails after I changed the last_trans
lator part of ja message catalog to contain Japanese kanji characters.

regards,
Hiroshi Inoue


Re: More message encoding woes

From
Peter Eisentraut
Date:
On Monday 30 March 2009 15:52:37 Heikki Linnakangas wrote:
> What is happening is that gettext() returns the message in the encoding
> determined by LC_CTYPE, while we expect it to return it in the database
> encoding. Starting with PG 8.3 we enforce that the encoding specified in
> LC_CTYPE matches the database encoding, but not for the C locale.
>
> In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
> which fixes that, but we only do it on Windows. In earlier versions we
> called it on all platforms, but only for UTF-8. It seems that we should
> call bind_textdomain_codeset on all platforms and all encodings.
> However, there seems to be a reason why we only do it for Windows on CVS
> HEAD: we need a mapping from our encoding ID to the OS codeset name, and
> the OS codeset names vary.
>
> How can we make this more robust?

Another approach might be to create a new configuration parameter that 
basically tells what encoding to call bind_textdomain_codeset() with, say 
server_encoding_for_gettext.  If that is not set, you just use server_encoding 
as is and hope that gettext() takes it (which it would in most cases, I 
guess).


Re: More message encoding woes

From
Hiroshi Inoue
Date:
Hiroshi Inoue wrote:
> Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>>> Tom Lane wrote:
>>>>> Maybe use a special string "Translate Me First" that
>>>>> doesn't actually need to be end-user-visible, just so no one sweats 
>>>>> over
>>>>> getting it right in context.
>>>
>>>> Yep, something like that. There seems to be a magic empty string 
>>>> translation at the beginning of every po file that returns the 
>>>> meta-information about the translation, like translation author and 
>>>> date. Assuming that works reliably, I'll use that.
>>>
>>> At first that sounded like an ideal answer, but I can see a gotcha:
>>> suppose the translation's author's name contains some characters that
>>> don't convert to the database encoding.  I suppose that would result in
>>> failure, when we'd prefer it not to.  A single-purpose string could be
>>> documented as "whatever you translate this to should be pure ASCII,
>>> never mind if it's sensible".
>>
>> I just tried that, and it seems that gettext() does transliteration, 
>> so any characters that have no counterpart in the database encoding 
>> will be replaced with something similar, or question marks. Assuming 
>> that's universal across platforms, and I think it is, using the empty 
>> string should work.
>>
>> It also means that you can use lc_messages='ja' with 
>> server_encoding='latin1', but it will be unreadable because all the 
>> non-ascii characters are replaced with question marks. For something 
>> like lc_messages='es_ES' and server_encoding='koi8-r', it will still 
>> look quite nice.
>>
>> Attached is a patch I've been testing. Seems to work quite well. It 
>> would be nice if someone could test it on Windows, which seems to be a 
>> bit special in this regard.
> 
> Unfortunately it doesn't seem to work on Windows.

Is it unappropriate to call iconv_open() to check if the codeset is valid for bind_textdomain_codeset()?

regards,
Hiroshi Inoue



Re: More message encoding woes

From
Peter Eisentraut
Date:
On Monday 30 March 2009 15:52:37 Heikki Linnakangas wrote:
> In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
> which fixes that, but we only do it on Windows. In earlier versions we
> called it on all platforms, but only for UTF-8. It seems that we should
> call bind_textdomain_codeset on all platforms and all encodings.
> However, there seems to be a reason why we only do it for Windows on CVS
> HEAD: we need a mapping from our encoding ID to the OS codeset name, and
> the OS codeset names vary.

In practice you get either the GNU or the Solaris version of gettext, and at 
least the GNU version can cope with all the encoding names that the currently 
Windows-only code path produces.  So enabling the Windows code path for all 
platforms when ENABLE_NLS is on and LC_CTYPE is C would appear to work in 
sufficiently many cases.


Re: More message encoding woes

From
Heikki Linnakangas
Date:
Peter Eisentraut wrote:
> In practice you get either the GNU or the Solaris version of gettext, and at 
> least the GNU version can cope with all the encoding names that the currently 
> Windows-only code path produces. 

It doesn't. On my laptop running Debian testing:

hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.UTF-8 gettext
gettext: ei riittävästi argumentteja
hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.LATIN1 gettext
gettext: missing arguments
hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.ISO-8859-1 gettext
gettext: ei riitt�v�sti argumentteja

Using the name for the latin1 encoding in the currently Windows-only 
mapping table, "LATIN1", you get no translation because that name is not 
recognized by the system. Using the other name "ISO-8859-1", it works. 
"LATIN1" is not listed in the output of locale -m either.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Peter Eisentraut
Date:
On Tuesday 07 April 2009 11:21:25 Heikki Linnakangas wrote:
> Peter Eisentraut wrote:
> > In practice you get either the GNU or the Solaris version of gettext, and
> > at least the GNU version can cope with all the encoding names that the
> > currently Windows-only code path produces.
>
> It doesn't. On my laptop running Debian testing:
>
> hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.UTF-8 gettext
> gettext: ei riittävästi argumentteja
> hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.LATIN1 gettext
> gettext: missing arguments

That is because no locale by the name fi_FI.LATIN1 exists.

> hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.ISO-8859-1 gettext
> gettext: ei riitt�v�sti argumentteja
>
> Using the name for the latin1 encoding in the currently Windows-only
> mapping table, "LATIN1", you get no translation because that name is not
> recognized by the system. Using the other name "ISO-8859-1", it works.
> "LATIN1" is not listed in the output of locale -m either.

You are looking in the wrong place.  What we need is for iconv to recognize
the encoding name used by PostgreSQL.  iconv --list is the primary hint for
that.

The locale names provided by the operating system are arbitrary and unrelated.


Re: More message encoding woes

From
Heikki Linnakangas
Date:
Hiroshi Inoue wrote:
> Heikki Linnakangas wrote:
>> I just tried that, and it seems that gettext() does transliteration, 
>> so any characters that have no counterpart in the database encoding 
>> will be replaced with something similar, or question marks. Assuming 
>> that's universal across platforms, and I think it is, using the empty 
>> string should work.
>>
>> It also means that you can use lc_messages='ja' with 
>> server_encoding='latin1', but it will be unreadable because all the 
>> non-ascii characters are replaced with question marks. For something 
>> like lc_messages='es_ES' and server_encoding='koi8-r', it will still 
>> look quite nice.
>>
>> Attached is a patch I've been testing. Seems to work quite well. It
>> would be nice if someone could test it on Windows, which seems to be a 
>> bit special in this regard.
> 
> Unfortunately it doesn't seem to work on Windows.
> 
> First any combination of valid lc_messages and non-existent encoding
> passes the test  strcmp(gettext(""), "") != 0 .

Now that's strange. Can you check what gettext("") returns in that case 
then?

> Second for example the combination of ja(lc_messages) and ISO-8859-1
> passes the the test but the test fails after I changed the last_trans
> lator part of ja message catalog to contain Japanese kanji characters.

Yeah, the inconsistency is not nice. In practice, though, if you try to 
use an encoding that can't represent kanji characters with Japanese, 
you're better off falling back to English than displaying strings full 
of question marks. The same goes for all other languages as well, IMHO. 
If you're going to fall back to English for some translations (and in 
practice "some" is a pretty high percentage) because the encoding is 
missing a character and transliteration is not working, you might as 
well not bother translating at all.

If we add the dummy translations to all .po files, we could force 
fallback-to-English in situations like that by including some or all of 
the non-ASCII characters used in the language in the dummy translation.

I'm thinking of going ahead with this approach, without the dummy 
translation, after we have resolved the first issue on Windows. We can 
add the dummy translations later if needed, but I don't think anyone 
will care.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Heikki Linnakangas
Date:
Peter Eisentraut wrote:
> On Tuesday 07 April 2009 11:21:25 Heikki Linnakangas wrote:
>> Using the name for the latin1 encoding in the currently Windows-only
>> mapping table, "LATIN1", you get no translation because that name is not
>> recognized by the system. Using the other name "ISO-8859-1", it works.
>> "LATIN1" is not listed in the output of locale -m either.
>
> You are looking in the wrong place.  What we need is for iconv to recognize
> the encoding name used by PostgreSQL.  iconv --list is the primary hint for
> that.
>
> The locale names provided by the operating system are arbitrary and unrelated.

Oh, ok. I guess we can do the simple fix you proposed then.

Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
"pg_get_encoding_from_locale(NULL) == encoding" which is more close to
what we actually want. The downside is that
pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
that we don't need to keep this in sync with the rules we have in CREATE
DATABASE that enforce that locale matches encoding.

This doesn't include the cleanup to make the mapping table easier to
maintain that Magnus was going to have a look at before I started this
thread.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/utils/mb/mbutils.c
--- b/src/backend/utils/mb/mbutils.c
***************
*** 890,896 **** cliplen(const char *str, int len, int limit)
      return l;
  }

! #if defined(ENABLE_NLS) && defined(WIN32)
  static const struct codeset_map {
      int    encoding;
      const char *codeset;
--- 890,896 ----
      return l;
  }

! #if defined(ENABLE_NLS)
  static const struct codeset_map {
      int    encoding;
      const char *codeset;
***************
*** 929,935 **** static const struct codeset_map {
      {PG_EUC_TW, "EUC-TW"},
      {PG_EUC_JIS_2004, "EUC-JP"}
  };
! #endif /* WIN32 */

  void
  SetDatabaseEncoding(int encoding)
--- 929,935 ----
      {PG_EUC_TW, "EUC-TW"},
      {PG_EUC_JIS_2004, "EUC-JP"}
  };
! #endif /* ENABLE_NLS */

  void
  SetDatabaseEncoding(int encoding)
***************
*** 946,960 **** SetDatabaseEncoding(int encoding)
  }

  /*
!  * On Windows, we need to explicitly bind gettext to the correct
!  * encoding, because gettext() tends to get confused.
   */
  void
  pg_bind_textdomain_codeset(const char *domainname, int encoding)
  {
! #if defined(ENABLE_NLS) && defined(WIN32)
      int     i;

      for (i = 0; i < lengthof(codeset_map_array); i++)
      {
          if (codeset_map_array[i].encoding == encoding)
--- 946,975 ----
  }

  /*
!  * Bind gettext to the correct encoding.
   */
  void
  pg_bind_textdomain_codeset(const char *domainname, int encoding)
  {
! #if defined(ENABLE_NLS)
      int     i;

+     /*
+      * gettext() uses the encoding specified by LC_CTYPE by default,
+      * so if that matches the database encoding, we don't need to do
+      * anything. This is not for performance, but because if
+      * bind_textdomain_codeset() doesn't recognize the codeset name we
+      * pass it, it will fall back to English and we don't want that to
+      * happen unnecessarily.
+      *
+      * On Windows, though, gettext() tends to get confused so we always
+      * bind it.
+      */
+ #ifndef WIN32
+     if (pg_get_encoding_from_locale(NULL) == encoding)
+         return;
+ #endif
+
      for (i = 0; i < lengthof(codeset_map_array); i++)
      {
          if (codeset_map_array[i].encoding == encoding)

Re: More message encoding woes

From
Hiroshi Inoue
Date:
Heikki Linnakangas wrote:
> Hiroshi Inoue wrote:
>> Heikki Linnakangas wrote:
>>> I just tried that, and it seems that gettext() does transliteration, 
>>> so any characters that have no counterpart in the database encoding 
>>> will be replaced with something similar, or question marks. Assuming 
>>> that's universal across platforms, and I think it is, using the empty 
>>> string should work.
>>>
>>> It also means that you can use lc_messages='ja' with 
>>> server_encoding='latin1', but it will be unreadable because all the 
>>> non-ascii characters are replaced with question marks. For something 
>>> like lc_messages='es_ES' and server_encoding='koi8-r', it will still 
>>> look quite nice.
>>>
>>> Attached is a patch I've been testing. Seems to work quite well. It
>>> would be nice if someone could test it on Windows, which seems to be 
>>> a bit special in this regard.
>>
>> Unfortunately it doesn't seem to work on Windows.
>>
>> First any combination of valid lc_messages and non-existent encoding
>> passes the test  strcmp(gettext(""), "") != 0 .
> 
> Now that's strange. Can you check what gettext("") returns in that case 
> then?

Translated but not converted string. I'm not sure if it's a bug or not.
I can see no description what should be returned in such case.

>> Second for example the combination of ja(lc_messages) and ISO-8859-1
>> passes the the test but the test fails after I changed the last_trans
>> lator part of ja message catalog to contain Japanese kanji characters.
> 
> Yeah, the inconsistency is not nice. In practice, though, if you try to 
> use an encoding that can't represent kanji characters with Japanese, 
> you're better off falling back to English than displaying strings full 
> of question marks. The same goes for all other languages as well, IMHO. 
> If you're going to fall back to English for some translations (and in 
> practice "some" is a pretty high percentage) because the encoding is 
> missing a character and transliteration is not working, you might as 
> well not bother translating at all.

What is wrong with checking if the codeset is valid using iconv_open()?

regards,
Hiroshi Inoue



Re: More message encoding woes

From
Heikki Linnakangas
Date:
Hiroshi Inoue wrote:
> What is wrong with checking if the codeset is valid using iconv_open()?

That would probably work as well. We'd have to decide what we'd try to 
convert from with iconv_open(). Utf-8 might be a safe choice. We don't 
currently use iconv_open() anywhere in the backend, though, so I'm 
hesitant to add a dependency for this. GNU gettext() uses iconv, but I'm 
not sure if that's true for all gettext() implementations.

Peter's suggestion seems the best ATM, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Hiroshi Inoue wrote:
>> What is wrong with checking if the codeset is valid using iconv_open()?

> That would probably work as well. We'd have to decide what we'd try to 
> convert from with iconv_open().

The problem I have with that is that you are now guessing at *two*
platform-specific encoding names not one, plus hoping there is a
conversion between the two.

If we knew the encoding name embedded in the .mo file we wanted to use,
then it would be sensible to try to use that as the source codeset.

> GNU gettext() uses iconv, but I'm 
> not sure if that's true for all gettext() implementations.

Yeah, that's another problem.
        regards, tom lane


Re: More message encoding woes

From
Peter Eisentraut
Date:
On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote:
> Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
> "pg_get_encoding_from_locale(NULL) == encoding" which is more close to
> what we actually want. The downside is that
> pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
> that we don't need to keep this in sync with the rules we have in CREATE
> DATABASE that enforce that locale matches encoding.

I would have figured we can skip this whole thing when LC_CTYPE != C, because 
it should be guaranteed that LC_CTYPE matches the database encoding in this 
case, no?

Other than that, I think this patch is good.



Re: More message encoding woes

From
Heikki Linnakangas
Date:
Peter Eisentraut wrote:
> On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote:
>> Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
>> "pg_get_encoding_from_locale(NULL) == encoding" which is more close to
>> what we actually want. The downside is that
>> pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
>> that we don't need to keep this in sync with the rules we have in CREATE
>> DATABASE that enforce that locale matches encoding.
> 
> I would have figured we can skip this whole thing when LC_CTYPE != C, because 
> it should be guaranteed that LC_CTYPE matches the database encoding in this 
> case, no?

Yes, except if pg_get_encoding_from_locale() couldn't figure out what PG 
encoding LC_CTYPE corresponds to. We let CREATE DATABASE to go ahead in 
that case, trusting that the user knows what he's doing. I suppose we 
can extend that trust to this case too, and assume that the encoding of 
LC_CTYPE actually matches the database encoding.

Or if the encoding is UTF-8 and you're running on Windows, although on 
Windows we want to always call bind_textdomain_codeset(). Or if the 
database encoding is SQL_ASCII, although in that case we don't want to 
call bind_textdomain_codeset() either.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: More message encoding woes

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Hiroshi Inoue wrote:
>>> What is wrong with checking if the codeset is valid using iconv_open()?
> 
>> That would probably work as well. We'd have to decide what we'd try to 
>> convert from with iconv_open().
> 
> The problem I have with that is that you are now guessing at *two*
> platform-specific encoding names not one, plus hoping there is a
> conversion between the two.

AFAIK iconv_open() supports all combinations of the valid encoding
values. Or we may be able to check it using the same encoding for
both from and to.

regards,
Hiroshi Inoue



Re: More message encoding woes

From
Heikki Linnakangas
Date:
Peter Eisentraut wrote:
> On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote:
>> Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
>> "pg_get_encoding_from_locale(NULL) == encoding" which is more close to
>> what we actually want. The downside is that
>> pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
>> that we don't need to keep this in sync with the rules we have in CREATE
>> DATABASE that enforce that locale matches encoding.
> 
> I would have figured we can skip this whole thing when LC_CTYPE != C, because 
> it should be guaranteed that LC_CTYPE matches the database encoding in this 
> case, no?

Ok, committed it like that after all.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com