Thread: More message encoding woes
latin1db=# SELECT version(); version ----------------------------------------------------------------------------------- PostgreSQL 8.3.7 on i686-pc-linux-gnu,compiled by GCC gcc (Debian 4.3.3-5) 4.3.3 (1 row) latin1db=# SELECT name, setting FROM pg_settings where name like 'lc%' OR name like '%encoding'; name | setting -----------------+--------- client_encoding | utf8 lc_collate | C lc_ctype | C lc_messages | es_ES lc_monetary | C lc_numeric | C lc_time | C server_encoding | LATIN1 (8 rows) latin1db=# SELECT * FROM foo; ERROR: no existe la relación «foo» The accented characters are garbled. When I try the same with a database that's in UTF8 in the same cluster, it works: utf8db=# SELECT name, setting FROM pg_settings where name like 'lc%' OR name like '%encoding'; name | setting -----------------+--------- client_encoding | UTF8 lc_collate | C lc_ctype | C lc_messages | es_ES lc_monetary | C lc_numeric | C lc_time | C server_encoding | UTF8 (8 rows) utf8db=# SELECT * FROM foo; ERROR: no existe la relación «foo» What is happening is that gettext() returns the message in the encoding determined by LC_CTYPE, while we expect it to return it in the database encoding. Starting with PG 8.3 we enforce that the encoding specified in LC_CTYPE matches the database encoding, but not for the C locale. In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding() which fixes that, but we only do it on Windows. In earlier versions we called it on all platforms, but only for UTF-8. It seems that we should call bind_textdomain_codeset on all platforms and all encodings. However, there seems to be a reason why we only do it for Windows on CVS HEAD: we need a mapping from our encoding ID to the OS codeset name, and the OS codeset names vary. How can we make this more robust? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding() > which fixes that, but we only do it on Windows. In earlier versions we > called it on all platforms, but only for UTF-8. It seems that we should > call bind_textdomain_codeset on all platforms and all encodings. Yes, this problem has been recognized for some time. > However, there seems to be a reason why we only do it for Windows on CVS > HEAD: we need a mapping from our encoding ID to the OS codeset name, and > the OS codeset names vary. > How can we make this more robust? One possibility is to assume that the output of nl_langinfo(CODESET) will be recognized by bind_textdomain_codeset(). Whether that actually works can only be determined by experiment. Another idea is to try the values listed in our encoding_match_list[] until bind_textdomain_codeset succeeds. The problem here is that the GNU documentation is *exceedingly* vague about whether bind_textdomain_codeset behaves sanely (ie throws a recognizable error) when given a bad encoding name. (I guess we could look at the source code.) regards, tom lane
Tom Lane wrote: > Another idea is to try the values listed in our encoding_match_list[] > until bind_textdomain_codeset succeeds. The problem here is that the > GNU documentation is *exceedingly* vague about whether > bind_textdomain_codeset behaves sanely (ie throws a recognizable error) > when given a bad encoding name. (I guess we could look at the source > code.) Unfortunately it doesn't give any error. The value passed to it is just stored, and isn't used until gettext(). Quick testing shows that if you give an invalid encoding name, gettext will simply refrain from translating anything and revert to English. We could exploit that to determine if the codeset name we gave bind_textdomain_codeset was valid: pick a string that is known to be translated in all translations, like "syntax error", and see if gettext("syntax error") returns the original string. Something along the lines of: const char *teststring = "syntax error"; encoding_match *m = encoding_match_list; while(m->system_enc_name) { if (m->pg_enc_code != GetDatabaseEncoding()) continue; bind_textdomain_codeset("postgres"); if (gettext(teststring)!= teststring) break; /* found! */ } This feels rather hacky, but if we only do that with the combination of LC_CTYPE=C and LC_MESSAGES=other than C that we have a problem with, I think it would be ok. The current behavior is highly unlikely to give correct results, so I don't think we can do much worse than that. Another possibility is to just refrain from translating anything if LC_CTYPE=C. If the above loop fails to find anything that works, that's what we should fall back to IMHO. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Another idea is to try the values listed in our encoding_match_list[] >> until bind_textdomain_codeset succeeds. The problem here is that the >> GNU documentation is *exceedingly* vague about whether >> bind_textdomain_codeset behaves sanely (ie throws a recognizable error) >> when given a bad encoding name. (I guess we could look at the source >> code.) > Unfortunately it doesn't give any error. (Man, why are the APIs in this problem space so universally awful?) Where does it get the default codeset from? Maybe we could constrain that to match the database encoding, the way we do for LC_COLLATE/CTYPE? regards, tom lane
Tom Lane wrote: > Where does it get the default codeset from? Maybe we could constrain > that to match the database encoding, the way we do for LC_COLLATE/CTYPE? LC_CTYPE. In 8.3 and up where we constrain that to match the database encoding, we only have a problem with the C locale. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Where does it get the default codeset from? Maybe we could constrain >> that to match the database encoding, the way we do for LC_COLLATE/CTYPE? > LC_CTYPE. In 8.3 and up where we constrain that to match the database > encoding, we only have a problem with the C locale. ... and even if we wanted to fiddle with it, that just moves the problem over to finding an LC_CTYPE value that matches the database encoding :-(. Yup, it's a mess. We'd have done this long ago if it were easy. Could we get away with just unconditionally calling bind_textdomain_codeset with *our* canonical spelling of the encoding name? If it works, great, and if it doesn't, you get English. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> Where does it get the default codeset from? Maybe we could constrain >>> that to match the database encoding, the way we do for LC_COLLATE/CTYPE? > >> LC_CTYPE. In 8.3 and up where we constrain that to match the database >> encoding, we only have a problem with the C locale. > > ... and even if we wanted to fiddle with it, that just moves the problem > over to finding an LC_CTYPE value that matches the database encoding > :-(. > > Yup, it's a mess. We'd have done this long ago if it were easy. > > Could we get away with just unconditionally calling > bind_textdomain_codeset with *our* canonical spelling of the encoding > name? If it works, great, and if it doesn't, you get English. Yeah, that's better than nothing. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Could we get away with just unconditionally calling >> bind_textdomain_codeset with *our* canonical spelling of the encoding >> name? If it works, great, and if it doesn't, you get English. > Yeah, that's better than nothing. A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6 says that it would not work quite well enough. The encoding names are similar but not identical --- in particular I notice a lot of discrepancies about dash versus underscore vs no separator at all. What we need is an API equivalent to "iconv --list", but I'm not seeing one :-(. Do we need to go so far as to try to run that program? Its output format is poorly standardized, among other problems ... regards, tom lane
Tom Lane wrote: > What we need is an API equivalent to "iconv --list", but I'm not seeing > one :-(. There's also "locale -m". Looking at the implementation of that, it just lists what's in /usr/share/i18n/charmaps. Not too portable either.. > Do we need to go so far as to try to run that program? > Its output format is poorly standardized, among other problems ... And doing that at every backend startup is too slow. I would be happy to just revert to English if the OS doesn't recognize the name we use for the encoding. What sucks about that most is that the user has no way to specify the right encoding name even if he knows it. I don't think we want to introduce a new GUC for that. One idea is to extract the encoding from LC_MESSAGES. Then call pg_get_encoding_from_locale() on that and check that it matches server_encoding. If it does, great, pass it to bind_textdomain_codeset(). If it doesn't, throw an error. It stretches the conventional meaning LC_MESSAGES/LC_CTYPE a bit, since LC_CTYPE usually specifies the codeset to use, but I think it's quite intuitive. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane píše v po 30. 03. 2009 v 14:04 -0400: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > Tom Lane wrote: > >> Could we get away with just unconditionally calling > >> bind_textdomain_codeset with *our* canonical spelling of the encoding > >> name? If it works, great, and if it doesn't, you get English. > > > Yeah, that's better than nothing. > > A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6 > says that it would not work quite well enough. The encoding names are > similar but not identical --- in particular I notice a lot of > discrepancies about dash versus underscore vs no separator at all. The same problem is with collation when you try restore database on different OS. :( Zdenek
Heikki Linnakangas wrote: > One idea is to extract the encoding from LC_MESSAGES. Then call > pg_get_encoding_from_locale() on that and check that it matches > server_encoding. If it does, great, pass it to > bind_textdomain_codeset(). If it doesn't, throw an error. I tried to implement this but it gets complicated. First of all, we can only throw an error when lc_messages is set interactively. If it's set in postgresql.conf, it might be valid for some databases but not for others with different encoding. And that makes per-user lc_messages setting quite hard too. Another complication is what to do if e.g. plpgsql or a 3rd party module have called pg_bindtextdomain, when lc_messages=C and we don't yet know the system name for the database encoding, and you later set lc_messages='fi_FI.iso8859-1', in a latin1 database. In order to retroactively set the codeset, we'd have to remember all the calls to pg_bindtextdomain. Not impossible, for sure, but more work. I'm leaning towards the idea of trying out all the spellings of the database encoding we have in encoding_match_list. That gives the best user experience, as it just works, and it doesn't seem that complicated. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > I'm leaning towards the idea of trying out all the spellings of the > database encoding we have in encoding_match_list. That gives the best > user experience, as it just works, and it doesn't seem that complicated. How were you going to check --- use that idea of translating a string that's known to have a translation? OK, but you'd better document somewhere where translators will read it "you must translate this string first of all". Maybe use a special string "Translate Me First" that doesn't actually need to be end-user-visible, just so no one sweats over getting it right in context. (I can see "syntax error" being problematic in some translations, since translators will know it is always just a fragment of a larger message ...) regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> I'm leaning towards the idea of trying out all the spellings of the >> database encoding we have in encoding_match_list. That gives the best >> user experience, as it just works, and it doesn't seem that complicated. > > How were you going to check --- use that idea of translating a string > that's known to have a translation? OK, but you'd better document > somewhere where translators will read it "you must translate this string > first of all". Maybe use a special string "Translate Me First" that > doesn't actually need to be end-user-visible, just so no one sweats over > getting it right in context. Yep, something like that. There seems to be a magic empty string translation at the beginning of every po file that returns the meta-information about the translation, like translation author and date. Assuming that works reliably, I'll use that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Maybe use a special string "Translate Me First" that >> doesn't actually need to be end-user-visible, just so no one sweats over >> getting it right in context. > Yep, something like that. There seems to be a magic empty string > translation at the beginning of every po file that returns the > meta-information about the translation, like translation author and > date. Assuming that works reliably, I'll use that. At first that sounded like an ideal answer, but I can see a gotcha: suppose the translation's author's name contains some characters that don't convert to the database encoding. I suppose that would result in failure, when we'd prefer it not to. A single-purpose string could be documented as "whatever you translate this to should be pure ASCII, never mind if it's sensible". regards, tom lane
Tom Lane wrote: > At first that sounded like an ideal answer, but I can see a gotcha: > suppose the translation's author's name contains some characters that > don't convert to the database encoding. I suppose that would result in > failure, when we'd prefer it not to. A single-purpose string could be > documented as "whatever you translate this to should be pure ASCII, > never mind if it's sensible". One problem with this idea is that it may be hard to coerce gettext into putting a particular string at the top of the file :-( -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> At first that sounded like an ideal answer, but I can see a gotcha: >> suppose the translation's author's name contains some characters that >> don't convert to the database encoding. I suppose that would result in >> failure, when we'd prefer it not to. A single-purpose string could be >> documented as "whatever you translate this to should be pure ASCII, >> never mind if it's sensible". > One problem with this idea is that it may be hard to coerce gettext into > putting a particular string at the top of the file :-( I doubt we can, which is why the documentation needs to tell translators about it. regards, tom lane
On Monday 30 March 2009 21:04:00 Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > Tom Lane wrote: > >> Could we get away with just unconditionally calling > >> bind_textdomain_codeset with *our* canonical spelling of the encoding > >> name? If it works, great, and if it doesn't, you get English. > > > > Yeah, that's better than nothing. > > A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6 > says that it would not work quite well enough. The encoding names are > similar but not identical --- in particular I notice a lot of > discrepancies about dash versus underscore vs no separator at all. I seem to recall that the encoding names are normalized by the C library somewhere, but I can't find the documentation now. It might be worth trying anyway -- the above might not in fact be a problem.
On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote: > Tom Lane wrote: > > Where does it get the default codeset from? Maybe we could constrain > > that to match the database encoding, the way we do for LC_COLLATE/CTYPE? > > LC_CTYPE. In 8.3 and up where we constrain that to match the database > encoding, we only have a problem with the C locale. Why don't we apply the same restriction to the C locale then?
Peter Eisentraut <peter_e@gmx.net> writes: > On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote: >> LC_CTYPE. In 8.3 and up where we constrain that to match the database >> encoding, we only have a problem with the C locale. > Why don't we apply the same restriction to the C locale then? (1) what would you constrain it to? (2) historically we've allowed C locale to be used with any encoding, and there are a *lot* of users depending on that (particularly in the Far East, I gather). regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> Maybe use a special string "Translate Me First" that >>> doesn't actually need to be end-user-visible, just so no one sweats over >>> getting it right in context. > >> Yep, something like that. There seems to be a magic empty string >> translation at the beginning of every po file that returns the >> meta-information about the translation, like translation author and >> date. Assuming that works reliably, I'll use that. > > At first that sounded like an ideal answer, but I can see a gotcha: > suppose the translation's author's name contains some characters that > don't convert to the database encoding. I suppose that would result in > failure, when we'd prefer it not to. A single-purpose string could be > documented as "whatever you translate this to should be pure ASCII, > never mind if it's sensible". I just tried that, and it seems that gettext() does transliteration, so any characters that have no counterpart in the database encoding will be replaced with something similar, or question marks. Assuming that's universal across platforms, and I think it is, using the empty string should work. It also means that you can use lc_messages='ja' with server_encoding='latin1', but it will be unreadable because all the non-ascii characters are replaced with question marks. For something like lc_messages='es_ES' and server_encoding='koi8-r', it will still look quite nice. Attached is a patch I've been testing. Seems to work quite well. It would be nice if someone could test it on Windows, which seems to be a bit special in this regard. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 118a6fe..390a7cf 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -290,6 +290,7 @@ locale_messages_assign(const char *value, bool doit, GucSource source) if (!pg_perm_setlocale(LC_MESSAGES, value)) if (source != PGC_S_DEFAULT) return NULL; + pg_init_gettext_codeset(); } #ifndef WIN32 else diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c index 03d86ca..47ebe1b 100644 --- a/src/backend/utils/init/miscinit.c +++ b/src/backend/utils/init/miscinit.c @@ -1242,7 +1242,7 @@ pg_bindtextdomain(const char *domain) get_locale_path(my_exec_path, locale_path); bindtextdomain(domain, locale_path); - pg_bind_textdomain_codeset(domain, GetDatabaseEncoding()); + pg_register_textdomain(domain); } #endif } diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c index bf66321..970cb83 100644 --- a/src/backend/utils/mb/mbutils.c +++ b/src/backend/utils/mb/mbutils.c @@ -842,46 +842,6 @@ cliplen(const char *str, int len, int limit) return l; } -#if defined(ENABLE_NLS) && defined(WIN32) -static const struct codeset_map { - int encoding; - const char *codeset; -} codeset_map_array[] = { - {PG_UTF8, "UTF-8"}, - {PG_LATIN1, "LATIN1"}, - {PG_LATIN2, "LATIN2"}, - {PG_LATIN3, "LATIN3"}, - {PG_LATIN4, "LATIN4"}, - {PG_ISO_8859_5, "ISO-8859-5"}, - {PG_ISO_8859_6, "ISO_8859-6"}, - {PG_ISO_8859_7, "ISO-8859-7"}, - {PG_ISO_8859_8, "ISO-8859-8"}, - {PG_LATIN5, "LATIN5"}, - {PG_LATIN6, "LATIN6"}, - {PG_LATIN7, "LATIN7"}, - {PG_LATIN8, "LATIN8"}, - {PG_LATIN9, "LATIN-9"}, - {PG_LATIN10, "LATIN10"}, - {PG_KOI8R, "KOI8-R"}, - {PG_WIN1250, "CP1250"}, - {PG_WIN1251, "CP1251"}, - {PG_WIN1252, "CP1252"}, - {PG_WIN1253, "CP1253"}, - {PG_WIN1254, "CP1254"}, - {PG_WIN1255, "CP1255"}, - {PG_WIN1256, "CP1256"}, - {PG_WIN1257, "CP1257"}, - {PG_WIN1258, "CP1258"}, - {PG_WIN866, "CP866"}, - {PG_WIN874, "CP874"}, - {PG_EUC_CN, "EUC-CN"}, - {PG_EUC_JP, "EUC-JP"}, - {PG_EUC_KR, "EUC-KR"}, - {PG_EUC_TW, "EUC-TW"}, - {PG_EUC_JIS_2004, "EUC-JP"} -}; -#endif /* WIN32 */ - void SetDatabaseEncoding(int encoding) { @@ -892,28 +852,132 @@ SetDatabaseEncoding(int encoding) Assert(DatabaseEncoding->encoding == encoding); #ifdef ENABLE_NLS - pg_bind_textdomain_codeset(textdomain(NULL), encoding); + pg_init_gettext_codeset(); + pg_register_textdomain(textdomain(NULL)); #endif } +static char **registered_textdomains = NULL; +static const char *system_codeset = "invalid"; + /* - * On Windows, we need to explicitly bind gettext to the correct - * encoding, because gettext() tends to get confused. + * Register a gettext textdomain with the backend. We will call + * bind_textdomain_codeset() for it to ensure that translated strings + * are returned in the right encoding. */ void -pg_bind_textdomain_codeset(const char *domainname, int encoding) +pg_register_textdomain(const char *domainname) { -#if defined(ENABLE_NLS) && defined(WIN32) +#if defined(ENABLE_NLS) int i; + MemoryContext old_cxt; + + old_cxt = MemoryContextSwitchTo(TopMemoryContext); + if (registered_textdomains == NULL) + { + registered_textdomains = palloc(sizeof(char *) * 1); + registered_textdomains[0] = NULL; + } - for (i = 0; i < lengthof(codeset_map_array); i++) + for (i = 0; registered_textdomains[i] != NULL; i++) { - if (codeset_map_array[i].encoding == encoding) + /* Ignore if already bound */ + if (strcmp(registered_textdomains[i], domainname) == 0) + return; + } + registered_textdomains = repalloc(registered_textdomains, + (i + 2) * sizeof(char *)); + registered_textdomains[i] = pstrdup(domainname); + registered_textdomains[i + 1] = NULL; + + MemoryContextSwitchTo(old_cxt); + + if (GetDatabaseEncoding() != PG_SQL_ASCII) + { + if (bind_textdomain_codeset(domainname, system_codeset) == NULL) + elog(LOG, "bind_textdomain_codeset failed"); + } +#endif +} + +/* + * Set the codeset used for strings returned by gettext() to match the + * database encoding. + * + * In theory this should only depend on the database encoding, but because + * of the way use gettext() to find the corresponding OS codeset name, we + * also need LC_MESSAGES to be set correctly for this to work. Because of + * that, pg_init_gettext_codeset() should be called after any changes to + * LC_MESSAGES. + */ +void +pg_init_gettext_codeset(void) +{ +#if defined(ENABLE_NLS) + int i; + + /* + * SQL_ASCII encoding is special. In that case we do nothing, and let + * gettext() to pick the codeset from LC_CTYPE. + */ + if (GetDatabaseEncoding() == PG_SQL_ASCII) + return; + + /* + * Find a codeset name for the database encoding that + * bind_textdomain_codeset() recognizes. + * + * Unfortunately there's no handy interface to list all the codesets + * in the system. 'locale -m' or 'iconv --list' do that, but we don't + * want to call external programs here. So we try every alias for the + * encoding that we know until we find one that works. + * + * Unfortunately bind_textdomain_codeset() doesn't return any error code + * when given an invalid codeset name, so we have to work a bit harder + * to check if a codeset name works. We call gettext("") after + * bind_textdomain_codeset(), and check that it returned a translated + * string other than "". Empty string is a special value in .po files + * that is present in all translations: it translates into a string with + * meta-information about the translation, like author and creation date. + */ + system_codeset = NULL; + for (i = 0; encoding_match_list[i].system_enc_name; i++) + { + if (encoding_match_list[i].pg_enc_code != GetDatabaseEncoding()) + continue; + + if (bind_textdomain_codeset(textdomain(NULL), + encoding_match_list[i].system_enc_name) != NULL) { - if (bind_textdomain_codeset(domainname, - codeset_map_array[i].codeset) == NULL) + const char *str = gettext(""); + if (strcmp(str, "") != 0) + { + /* great, it worked */ + system_codeset = encoding_match_list[i].system_enc_name; + break; + } + } + } + + if (system_codeset == NULL) + { + elog(DEBUG1, "failed to find a system codeset name for encoding \"%s\"", + GetDatabaseEncodingName()); + system_codeset = "invalid"; + } + + /* + * Bind all textdomains in use to the new codeset. This is done even if + * no valid codeset name was found, to force gettext() to revert to + * ascii English. + */ + if (registered_textdomains != NULL) + { + for (i = 0; registered_textdomains[i] != NULL; i++) + { + if (bind_textdomain_codeset(registered_textdomains[i], + system_codeset) == NULL) elog(LOG, "bind_textdomain_codeset failed"); - break; } } #endif diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 76322c9..8fcfa52 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -392,7 +392,8 @@ extern const char *pg_get_client_encoding_name(void); extern void SetDatabaseEncoding(int encoding); extern int GetDatabaseEncoding(void); extern const char *GetDatabaseEncodingName(void); -extern void pg_bind_textdomain_codeset(const char *domainname, int encoding); +extern void pg_register_textdomain(const char *domainname); +extern void pg_init_gettext_codeset(void); extern int pg_valid_client_encoding(const char *name); extern int pg_valid_server_encoding(const char *name); diff --git a/src/include/port.h b/src/include/port.h index 0557dd2..cbd72bd 100644 --- a/src/include/port.h +++ b/src/include/port.h @@ -422,6 +422,14 @@ extern void qsort_arg(void *base, size_t nel, size_t elsize, qsort_arg_comparator cmp, void *arg); /* port/chklocale.c */ + +struct encoding_match +{ + int pg_enc_code; + const char *system_enc_name; +}; +extern const struct encoding_match encoding_match_list[]; + extern int pg_get_encoding_from_locale(const char *ctype); #endif /* PG_PORT_H */ diff --git a/src/port/chklocale.c b/src/port/chklocale.c index 78410df..4469e89 100644 --- a/src/port/chklocale.c +++ b/src/port/chklocale.c @@ -35,15 +35,12 @@ * numbers (CPnnn). * * Note that we search the table with pg_strcasecmp(), so variant - * capitalizations don't need their own entries. + * capitalizations don't need their own entries. XXX: Now that we also + * use this to map from pg encoding code to system name, do we need to + * include different capitalizations? */ -struct encoding_match -{ - enum pg_enc pg_enc_code; - const char *system_enc_name; -}; -static const struct encoding_match encoding_match_list[] = { +const struct encoding_match encoding_match_list[] = { {PG_EUC_JP, "EUC-JP"}, {PG_EUC_JP, "eucJP"}, {PG_EUC_JP, "IBM-eucJP"},
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > One problem with this idea is that it may be hard to coerce gettext into > > putting a particular string at the top of the file :-( > > I doubt we can, which is why the documentation needs to tell translators > about it. I doubt that documenting the issue will be enough (in fact I'm pretty sure it won't). Maybe we can just supply the string translated in our POT files, and add a comment that the translator is not supposed to touch it. This doesn't seem all that difficult -- I think it just requires that we add a msgmerge step to "make update-po" that uses a file on which the message has already been translated. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Heikki Linnakangas wrote: > Tom Lane wrote: >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >>> Tom Lane wrote: >>>> Maybe use a special string "Translate Me First" that >>>> doesn't actually need to be end-user-visible, just so no one sweats >>>> over >>>> getting it right in context. >> >>> Yep, something like that. There seems to be a magic empty string >>> translation at the beginning of every po file that returns the >>> meta-information about the translation, like translation author and >>> date. Assuming that works reliably, I'll use that. >> >> At first that sounded like an ideal answer, but I can see a gotcha: >> suppose the translation's author's name contains some characters that >> don't convert to the database encoding. I suppose that would result in >> failure, when we'd prefer it not to. A single-purpose string could be >> documented as "whatever you translate this to should be pure ASCII, >> never mind if it's sensible". > > I just tried that, and it seems that gettext() does transliteration, so > any characters that have no counterpart in the database encoding will be > replaced with something similar, or question marks.> Assuming that's > universal across platforms, and I think it is, using the empty string > should work. > > It also means that you can use lc_messages='ja' with > server_encoding='latin1', but it will be unreadable because all the > non-ascii characters are replaced with question marks. It doesn't occur in the current Windows environment. As for Windows gnu gettext which we are using, we would see the original msgid when iconv can't convert the msgstr to the target codeset. set client_encoding to utf_8; SET show server_encoding; server_encoding ----------------- LATIN1 (1 row) show lc_messages; lc_messages -------------------- Japanese_Japan.932 (1 row) 1; ERROR: syntax error at or near "1" LINE 1: 1; OTOH when the sever encoding is utf8 then set client_encoding to utf_8; SET show server_encoding; server_encoding ----------------- UTF8 (1 row) show lc_messages; lc_messages -------------------- Japanese_Japan.932 (1 row) 1; ERROR: "1"またはその近辺で構文エラー LINE 1: 1; ^
Hiroshi Inoue <inoue@tpf.co.jp> writes: > Heikki Linnakangas wrote: >> I just tried that, and it seems that gettext() does transliteration, so >> any characters that have no counterpart in the database encoding will be >> replaced with something similar, or question marks. > It doesn't occur in the current Windows environment. As for Windows > gnu gettext which we are using, we would see the original msgid when > iconv can't convert the msgstr to the target codeset. Well, if iconv has no conversion to the codeset at all then there is no point in selecting that particular codeset setting anyway. The question was about whether we can distinguish "no conversion available" from "conversion available, but the test string has some unconvertible characters". regards, tom lane
Tom Lane wrote: > Hiroshi Inoue <inoue@tpf.co.jp> writes: >> Heikki Linnakangas wrote: >>> I just tried that, and it seems that gettext() does transliteration, so >>> any characters that have no counterpart in the database encoding will be >>> replaced with something similar, or question marks. > >> It doesn't occur in the current Windows environment. As for Windows >> gnu gettext which we are using, we would see the original msgid when >> iconv can't convert the msgstr to the target codeset. > > Well, if iconv has no conversion to the codeset at all then there is no > point in selecting that particular codeset setting anyway. The question > was about whether we can distinguish "no conversion available" from > "conversion available, but the test string has some unconvertible > characters". What I meant is we would see no '?' when we use Windows gnu gettext. Whether conversion available or not depends on individual msgids. For example, when the Japanese msgstr corresponding to a msgid has no characters other than ASCII accidentally, Windows gnu gettext will use the msgstr not the original msgid.
Heikki Linnakangas wrote: > Tom Lane wrote: >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >>> Tom Lane wrote: >>>> Maybe use a special string "Translate Me First" that >>>> doesn't actually need to be end-user-visible, just so no one sweats >>>> over >>>> getting it right in context. >> >>> Yep, something like that. There seems to be a magic empty string >>> translation at the beginning of every po file that returns the >>> meta-information about the translation, like translation author and >>> date. Assuming that works reliably, I'll use that. >> >> At first that sounded like an ideal answer, but I can see a gotcha: >> suppose the translation's author's name contains some characters that >> don't convert to the database encoding. I suppose that would result in >> failure, when we'd prefer it not to. A single-purpose string could be >> documented as "whatever you translate this to should be pure ASCII, >> never mind if it's sensible". > > I just tried that, and it seems that gettext() does transliteration, so > any characters that have no counterpart in the database encoding will be > replaced with something similar, or question marks. Assuming that's > universal across platforms, and I think it is, using the empty string > should work. > > It also means that you can use lc_messages='ja' with > server_encoding='latin1', but it will be unreadable because all the > non-ascii characters are replaced with question marks. For something > like lc_messages='es_ES' and server_encoding='koi8-r', it will still > look quite nice. > > Attached is a patch I've been testing. Seems to work quite well. It > would be nice if someone could test it on Windows, which seems to be a > bit special in this regard. Unfortunately it doesn't seem to work on Windows. First any combination of valid lc_messages and non-existent encoding passes the test strcmp(gettext(""), "") != 0 . Second for example the combination of ja(lc_messages) and ISO-8859-1 passes the the test but the test fails after I changed the last_trans lator part of ja message catalog to contain Japanese kanji characters. regards, Hiroshi Inoue
On Monday 30 March 2009 15:52:37 Heikki Linnakangas wrote: > What is happening is that gettext() returns the message in the encoding > determined by LC_CTYPE, while we expect it to return it in the database > encoding. Starting with PG 8.3 we enforce that the encoding specified in > LC_CTYPE matches the database encoding, but not for the C locale. > > In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding() > which fixes that, but we only do it on Windows. In earlier versions we > called it on all platforms, but only for UTF-8. It seems that we should > call bind_textdomain_codeset on all platforms and all encodings. > However, there seems to be a reason why we only do it for Windows on CVS > HEAD: we need a mapping from our encoding ID to the OS codeset name, and > the OS codeset names vary. > > How can we make this more robust? Another approach might be to create a new configuration parameter that basically tells what encoding to call bind_textdomain_codeset() with, say server_encoding_for_gettext. If that is not set, you just use server_encoding as is and hope that gettext() takes it (which it would in most cases, I guess).
Hiroshi Inoue wrote: > Heikki Linnakangas wrote: >> Tom Lane wrote: >>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >>>> Tom Lane wrote: >>>>> Maybe use a special string "Translate Me First" that >>>>> doesn't actually need to be end-user-visible, just so no one sweats >>>>> over >>>>> getting it right in context. >>> >>>> Yep, something like that. There seems to be a magic empty string >>>> translation at the beginning of every po file that returns the >>>> meta-information about the translation, like translation author and >>>> date. Assuming that works reliably, I'll use that. >>> >>> At first that sounded like an ideal answer, but I can see a gotcha: >>> suppose the translation's author's name contains some characters that >>> don't convert to the database encoding. I suppose that would result in >>> failure, when we'd prefer it not to. A single-purpose string could be >>> documented as "whatever you translate this to should be pure ASCII, >>> never mind if it's sensible". >> >> I just tried that, and it seems that gettext() does transliteration, >> so any characters that have no counterpart in the database encoding >> will be replaced with something similar, or question marks. Assuming >> that's universal across platforms, and I think it is, using the empty >> string should work. >> >> It also means that you can use lc_messages='ja' with >> server_encoding='latin1', but it will be unreadable because all the >> non-ascii characters are replaced with question marks. For something >> like lc_messages='es_ES' and server_encoding='koi8-r', it will still >> look quite nice. >> >> Attached is a patch I've been testing. Seems to work quite well. It >> would be nice if someone could test it on Windows, which seems to be a >> bit special in this regard. > > Unfortunately it doesn't seem to work on Windows. Is it unappropriate to call iconv_open() to check if the codeset is valid for bind_textdomain_codeset()? regards, Hiroshi Inoue
On Monday 30 March 2009 15:52:37 Heikki Linnakangas wrote: > In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding() > which fixes that, but we only do it on Windows. In earlier versions we > called it on all platforms, but only for UTF-8. It seems that we should > call bind_textdomain_codeset on all platforms and all encodings. > However, there seems to be a reason why we only do it for Windows on CVS > HEAD: we need a mapping from our encoding ID to the OS codeset name, and > the OS codeset names vary. In practice you get either the GNU or the Solaris version of gettext, and at least the GNU version can cope with all the encoding names that the currently Windows-only code path produces. So enabling the Windows code path for all platforms when ENABLE_NLS is on and LC_CTYPE is C would appear to work in sufficiently many cases.
Peter Eisentraut wrote: > In practice you get either the GNU or the Solaris version of gettext, and at > least the GNU version can cope with all the encoding names that the currently > Windows-only code path produces. It doesn't. On my laptop running Debian testing: hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.UTF-8 gettext gettext: ei riittävästi argumentteja hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.LATIN1 gettext gettext: missing arguments hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.ISO-8859-1 gettext gettext: ei riitt�v�sti argumentteja Using the name for the latin1 encoding in the currently Windows-only mapping table, "LATIN1", you get no translation because that name is not recognized by the system. Using the other name "ISO-8859-1", it works. "LATIN1" is not listed in the output of locale -m either. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tuesday 07 April 2009 11:21:25 Heikki Linnakangas wrote: > Peter Eisentraut wrote: > > In practice you get either the GNU or the Solaris version of gettext, and > > at least the GNU version can cope with all the encoding names that the > > currently Windows-only code path produces. > > It doesn't. On my laptop running Debian testing: > > hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.UTF-8 gettext > gettext: ei riittävästi argumentteja > hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.LATIN1 gettext > gettext: missing arguments That is because no locale by the name fi_FI.LATIN1 exists. > hlinnaka@heikkilaptop:~$ LC_ALL=fi_FI.ISO-8859-1 gettext > gettext: ei riitt�v�sti argumentteja > > Using the name for the latin1 encoding in the currently Windows-only > mapping table, "LATIN1", you get no translation because that name is not > recognized by the system. Using the other name "ISO-8859-1", it works. > "LATIN1" is not listed in the output of locale -m either. You are looking in the wrong place. What we need is for iconv to recognize the encoding name used by PostgreSQL. iconv --list is the primary hint for that. The locale names provided by the operating system are arbitrary and unrelated.
Hiroshi Inoue wrote: > Heikki Linnakangas wrote: >> I just tried that, and it seems that gettext() does transliteration, >> so any characters that have no counterpart in the database encoding >> will be replaced with something similar, or question marks. Assuming >> that's universal across platforms, and I think it is, using the empty >> string should work. >> >> It also means that you can use lc_messages='ja' with >> server_encoding='latin1', but it will be unreadable because all the >> non-ascii characters are replaced with question marks. For something >> like lc_messages='es_ES' and server_encoding='koi8-r', it will still >> look quite nice. >> >> Attached is a patch I've been testing. Seems to work quite well. It >> would be nice if someone could test it on Windows, which seems to be a >> bit special in this regard. > > Unfortunately it doesn't seem to work on Windows. > > First any combination of valid lc_messages and non-existent encoding > passes the test strcmp(gettext(""), "") != 0 . Now that's strange. Can you check what gettext("") returns in that case then? > Second for example the combination of ja(lc_messages) and ISO-8859-1 > passes the the test but the test fails after I changed the last_trans > lator part of ja message catalog to contain Japanese kanji characters. Yeah, the inconsistency is not nice. In practice, though, if you try to use an encoding that can't represent kanji characters with Japanese, you're better off falling back to English than displaying strings full of question marks. The same goes for all other languages as well, IMHO. If you're going to fall back to English for some translations (and in practice "some" is a pretty high percentage) because the encoding is missing a character and transliteration is not working, you might as well not bother translating at all. If we add the dummy translations to all .po files, we could force fallback-to-English in situations like that by including some or all of the non-ASCII characters used in the language in the dummy translation. I'm thinking of going ahead with this approach, without the dummy translation, after we have resolved the first issue on Windows. We can add the dummy translations later if needed, but I don't think anyone will care. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Peter Eisentraut wrote: > On Tuesday 07 April 2009 11:21:25 Heikki Linnakangas wrote: >> Using the name for the latin1 encoding in the currently Windows-only >> mapping table, "LATIN1", you get no translation because that name is not >> recognized by the system. Using the other name "ISO-8859-1", it works. >> "LATIN1" is not listed in the output of locale -m either. > > You are looking in the wrong place. What we need is for iconv to recognize > the encoding name used by PostgreSQL. iconv --list is the primary hint for > that. > > The locale names provided by the operating system are arbitrary and unrelated. Oh, ok. I guess we can do the simple fix you proposed then. Patch attached. Instead of checking for LC_CTYPE == C, I'm checking "pg_get_encoding_from_locale(NULL) == encoding" which is more close to what we actually want. The downside is that pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is that we don't need to keep this in sync with the rules we have in CREATE DATABASE that enforce that locale matches encoding. This doesn't include the cleanup to make the mapping table easier to maintain that Magnus was going to have a look at before I started this thread. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/utils/mb/mbutils.c --- b/src/backend/utils/mb/mbutils.c *************** *** 890,896 **** cliplen(const char *str, int len, int limit) return l; } ! #if defined(ENABLE_NLS) && defined(WIN32) static const struct codeset_map { int encoding; const char *codeset; --- 890,896 ---- return l; } ! #if defined(ENABLE_NLS) static const struct codeset_map { int encoding; const char *codeset; *************** *** 929,935 **** static const struct codeset_map { {PG_EUC_TW, "EUC-TW"}, {PG_EUC_JIS_2004, "EUC-JP"} }; ! #endif /* WIN32 */ void SetDatabaseEncoding(int encoding) --- 929,935 ---- {PG_EUC_TW, "EUC-TW"}, {PG_EUC_JIS_2004, "EUC-JP"} }; ! #endif /* ENABLE_NLS */ void SetDatabaseEncoding(int encoding) *************** *** 946,960 **** SetDatabaseEncoding(int encoding) } /* ! * On Windows, we need to explicitly bind gettext to the correct ! * encoding, because gettext() tends to get confused. */ void pg_bind_textdomain_codeset(const char *domainname, int encoding) { ! #if defined(ENABLE_NLS) && defined(WIN32) int i; for (i = 0; i < lengthof(codeset_map_array); i++) { if (codeset_map_array[i].encoding == encoding) --- 946,975 ---- } /* ! * Bind gettext to the correct encoding. */ void pg_bind_textdomain_codeset(const char *domainname, int encoding) { ! #if defined(ENABLE_NLS) int i; + /* + * gettext() uses the encoding specified by LC_CTYPE by default, + * so if that matches the database encoding, we don't need to do + * anything. This is not for performance, but because if + * bind_textdomain_codeset() doesn't recognize the codeset name we + * pass it, it will fall back to English and we don't want that to + * happen unnecessarily. + * + * On Windows, though, gettext() tends to get confused so we always + * bind it. + */ + #ifndef WIN32 + if (pg_get_encoding_from_locale(NULL) == encoding) + return; + #endif + for (i = 0; i < lengthof(codeset_map_array); i++) { if (codeset_map_array[i].encoding == encoding)
Heikki Linnakangas wrote: > Hiroshi Inoue wrote: >> Heikki Linnakangas wrote: >>> I just tried that, and it seems that gettext() does transliteration, >>> so any characters that have no counterpart in the database encoding >>> will be replaced with something similar, or question marks. Assuming >>> that's universal across platforms, and I think it is, using the empty >>> string should work. >>> >>> It also means that you can use lc_messages='ja' with >>> server_encoding='latin1', but it will be unreadable because all the >>> non-ascii characters are replaced with question marks. For something >>> like lc_messages='es_ES' and server_encoding='koi8-r', it will still >>> look quite nice. >>> >>> Attached is a patch I've been testing. Seems to work quite well. It >>> would be nice if someone could test it on Windows, which seems to be >>> a bit special in this regard. >> >> Unfortunately it doesn't seem to work on Windows. >> >> First any combination of valid lc_messages and non-existent encoding >> passes the test strcmp(gettext(""), "") != 0 . > > Now that's strange. Can you check what gettext("") returns in that case > then? Translated but not converted string. I'm not sure if it's a bug or not. I can see no description what should be returned in such case. >> Second for example the combination of ja(lc_messages) and ISO-8859-1 >> passes the the test but the test fails after I changed the last_trans >> lator part of ja message catalog to contain Japanese kanji characters. > > Yeah, the inconsistency is not nice. In practice, though, if you try to > use an encoding that can't represent kanji characters with Japanese, > you're better off falling back to English than displaying strings full > of question marks. The same goes for all other languages as well, IMHO. > If you're going to fall back to English for some translations (and in > practice "some" is a pretty high percentage) because the encoding is > missing a character and transliteration is not working, you might as > well not bother translating at all. What is wrong with checking if the codeset is valid using iconv_open()? regards, Hiroshi Inoue
Hiroshi Inoue wrote: > What is wrong with checking if the codeset is valid using iconv_open()? That would probably work as well. We'd have to decide what we'd try to convert from with iconv_open(). Utf-8 might be a safe choice. We don't currently use iconv_open() anywhere in the backend, though, so I'm hesitant to add a dependency for this. GNU gettext() uses iconv, but I'm not sure if that's true for all gettext() implementations. Peter's suggestion seems the best ATM, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Hiroshi Inoue wrote: >> What is wrong with checking if the codeset is valid using iconv_open()? > That would probably work as well. We'd have to decide what we'd try to > convert from with iconv_open(). The problem I have with that is that you are now guessing at *two* platform-specific encoding names not one, plus hoping there is a conversion between the two. If we knew the encoding name embedded in the .mo file we wanted to use, then it would be sensible to try to use that as the source codeset. > GNU gettext() uses iconv, but I'm > not sure if that's true for all gettext() implementations. Yeah, that's another problem. regards, tom lane
On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote: > Patch attached. Instead of checking for LC_CTYPE == C, I'm checking > "pg_get_encoding_from_locale(NULL) == encoding" which is more close to > what we actually want. The downside is that > pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is > that we don't need to keep this in sync with the rules we have in CREATE > DATABASE that enforce that locale matches encoding. I would have figured we can skip this whole thing when LC_CTYPE != C, because it should be guaranteed that LC_CTYPE matches the database encoding in this case, no? Other than that, I think this patch is good.
Peter Eisentraut wrote: > On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote: >> Patch attached. Instead of checking for LC_CTYPE == C, I'm checking >> "pg_get_encoding_from_locale(NULL) == encoding" which is more close to >> what we actually want. The downside is that >> pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is >> that we don't need to keep this in sync with the rules we have in CREATE >> DATABASE that enforce that locale matches encoding. > > I would have figured we can skip this whole thing when LC_CTYPE != C, because > it should be guaranteed that LC_CTYPE matches the database encoding in this > case, no? Yes, except if pg_get_encoding_from_locale() couldn't figure out what PG encoding LC_CTYPE corresponds to. We let CREATE DATABASE to go ahead in that case, trusting that the user knows what he's doing. I suppose we can extend that trust to this case too, and assume that the encoding of LC_CTYPE actually matches the database encoding. Or if the encoding is UTF-8 and you're running on Windows, although on Windows we want to always call bind_textdomain_codeset(). Or if the database encoding is SQL_ASCII, although in that case we don't want to call bind_textdomain_codeset() either. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Hiroshi Inoue wrote: >>> What is wrong with checking if the codeset is valid using iconv_open()? > >> That would probably work as well. We'd have to decide what we'd try to >> convert from with iconv_open(). > > The problem I have with that is that you are now guessing at *two* > platform-specific encoding names not one, plus hoping there is a > conversion between the two. AFAIK iconv_open() supports all combinations of the valid encoding values. Or we may be able to check it using the same encoding for both from and to. regards, Hiroshi Inoue
Peter Eisentraut wrote: > On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote: >> Patch attached. Instead of checking for LC_CTYPE == C, I'm checking >> "pg_get_encoding_from_locale(NULL) == encoding" which is more close to >> what we actually want. The downside is that >> pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is >> that we don't need to keep this in sync with the rules we have in CREATE >> DATABASE that enforce that locale matches encoding. > > I would have figured we can skip this whole thing when LC_CTYPE != C, because > it should be guaranteed that LC_CTYPE matches the database encoding in this > case, no? Ok, committed it like that after all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com