Thread: Internationalisation of database content (text columns)

Internationalisation of database content (text columns)

From
Wolfgang Keller
Date:
(Since I got no replies on the novice list, I repost this question here)

Hello,

I'm working on a database schema which contains lots of "type code lookup" tables. The entries of these tables are also
hierarchicallyrelated among themselves (subtype/supertype), to store rather large and quite complex taxonomies. 

Currently the schema does not provide for translation of the entries in these tables, although it is meant to be used
forinter-company data exchange, i.e. for use by many different people having all different native languages. When
searchingfor "best practice" guides about the topic of transparently translating database content, I came across about
thisexample: 

http://rwec.co.uk/pg-atom/i18n.html

The basic method looks pretty intuitive to me, except that I don't understand why it wouldn't be more obvious to use a
compositetype, with language codes as column names, instead of an array for the custom "translated text" data type. 

Any opinions from more experienced "mahouts"? Any obvious drawbacks of using such a composite type? Other than the fact
thatthe schema has to be modified to add languages, which wouldn't be that desastrous in this case. 

Are there any other (better?) "best practice" methods for the internationalisation of database content? Maybe methods
thatuse capabilities (e.g. text search/thesaurus?) already built-into PostgreSQL?  

BTW: Methods that use a single table to hold all translations for all strings in all tables of the entire schema are
notvery useful in this case, since it can't be excluded that depending on the context (i.e. the specific semantics of
thespecific "type code lookup" table) the translation of one and the same string in one language will be different in
otherlanguages. 

TIA,

Sincerely,

Wolfgang Keller

Re: Internationalisation of database content (text columns)

From
Karsten Hilbert
Date:
On Fri, Oct 08, 2010 at 12:12:54PM +0200, Wolfgang Keller wrote:

> I'm working on a database schema which contains lots of
> "type code lookup" tables. The entries of these tables are
> also hierarchically related among themselves
> (subtype/supertype), to store rather large and quite complex
> taxonomies.

.From my experience it depends. Either you've got a "coding
system" which you need to provide localized "translations"
for or else you've got arbitrary type code lookups.

With coding systems it is typically not really a translation
of the coded term but rather *another* term people attach to
the same code - incidentally when using another language.
Terms in one language change while they don't change in
another. Think of the code as defining a class with all the
local language terms being ever-so-slightly different things
all belonging into that class (eg. while "back pain" and
"Kreuzschmerz" aren't considered translations of each other
*medically* they can well be considered to group under the
same ICD-10 code). Thus I've found this general scheme to
work well:

create table coded_term (
    pk serial primary key,
    code text,
    term text,
    lang text,
    fk_coding_system integer
        references coding_system(pk),
    unique(code, term, lang, fk_coding_system)
);

(it can be argued whether lang should fold into coding_system)

If it's about arbitrary lookup values for codes I am using a
gettext version rewritten in pgsql similar to this:

create table lut_colors (
    pk serial primary key
    color text
);

create view v_lut_colors as
select
    pk
        as pk_lut_color,
    color
        as color,
    _(color)
        as l10n_color
from
    lut_colors;

(you don't need the view or you don't need it in this way but
 it's useful)

Now you guessed it: _() is a plpgsql function which does a
translation table lookup based on the database account (or a
passed in user name) and a pre-configured (or passed in)
language per said account/user name. It falls back from,
say, "de_DE" to, say, "de" to returning the original string.

The translation table is filled this way:

    select i18n_upd_tx('de_DE', 'blue', 'blau');
    select i18n_upd_tx('de_DE', 'grey', 'grau');

It doesn't really matter which language is used as the
"original" lookup language as long as a translation exists
for the desired target language:

    select i18n_upd_tx('en', '1ö34kjafg8', 'yellow');

will properly make

    select _('1ö34kjafg8', 'en');

return "yellow".

All the code for this is to be found in the git repository
for GNUmed at gitorious:

    http://gitorious.org/gnumed

> BTW: Methods that use a single table to hold all
> translations for all strings in all tables of the entire
> schema are not very useful in this case, since it can't be
> excluded that depending on the context (i.e. the specific
> semantics of the specific "type code lookup" table) the
> translation of one and the same string in one language will
> be different in other languages.

Well, either add in a context field to the _()/i18n_upd_tx()
approach or consider using the coding system approach. You
might even figure out a way to use the tableoid in the
translation function:

create view v_lut_colors as
select
    pk
        as pk_lut_color,
    color
        as color,
    _(color, lut_colors.tableoid)
        as l10n_color
from
    lut_colors;


This would require applying the tableoid when adding
translations though.


Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346