Thread: advice on schema for multilingual text

advice on schema for multilingual text

From
"Daniel McBrearty"
Date:
Hi

I have a website that has multilingual text stored in the database. Currently I just have a flat table (lets called it "translations"), one row per text item, one column per language. This works OK, for now, but I am looking at a redesign. Mostly I want to keep information about the languages in teh db as well, so that look like an extra table, one row per lang.

The problem that now arises is that there is an expected correlation between the "languages" and "translations" tables - there should be a row in languages for each column of translations. AFAIK (could well be wrong, I am no expert in db theory) there is no real way to express in the ddl. Not ideal.

An alternative layout would now be to lose the "translations" table, and have two tables in place; one called "base_text" containing the text to be translated, and another called, say, "tx_text" which contains the translations. Each row of "tx_text" references both "base_text" and also "languages".

This looks like a nice layout, as there is an abstract rep of the languages, and we lose the "translations" table which can get very wide. It's nice that the schema doesn't actually change to add a new language.

BUT there are certain invariants that need to be enforced. The main one is this:

There must only be one row in "site_text" for any given language referencing a given row of "base_text".

How can I enforce this? Also have to bear in mind that there COULD easily be two translators working on the same language. It is imperative that they are not able to simultaneously save a translation of the same base text.

I would also like to have a column in "languages", type boolean, called "is_base" - this says what the base language is. Here, only ONE row can have a true value. (Obviously it has default value of false and is not null).

Another invariant now comes in - the language referenced by every row of "site_text" MUST have "is_base" set to FALSE.

If anyone can tell me how best to express to handle this stuff in postgresql, I'd be grateful. Also general comments on whether this is a good schema or not are welcome.

regards, and thanks in advance

Daniel



--
Daniel McBrearty
email : danielmcbrearty at gmail.com
www.engoi.com : the multi - language vocab trainer
BTW : 0873928131

Re: advice on schema for multilingual text

From
Michael Glaesemann
Date:
On Apr 9, 2006, at 0:31 , Daniel McBrearty wrote:

> Hi
>
> I have a website that has multilingual text stored in the database.
> Currently I just have a flat table (lets called it "translations"),
> one row per text item, one column per language. This works OK, for
> now, but I am looking at a redesign. Mostly I want to keep
> information about the languages in teh db as well, so that look
> like an extra table, one row per lang.
>
> The problem that now arises is that there is an expected
> correlation between the "languages" and "translations" tables -
> there should be a row in languages for each column of translations.
> AFAIK (could well be wrong, I am no expert in db theory) there is
> no real way to express in the ddl. Not ideal.
>
> An alternative layout would now be to lose the "translations"
> table, and have two tables in place; one called "base_text"
> containing the text to be translated, and another called, say,
> "tx_text" which contains the translations. Each row of "tx_text"
> references both "base_text" and also "languages".
>
> This looks like a nice layout, as there is an abstract rep of the
> languages, and we lose the "translations" table which can get very
> wide. It's nice that the schema doesn't actually change to add a
> new language.
>
> BUT there are certain invariants that need to be enforced. The main
> one is this:
>
> There must only be one row in "site_text" for any given language
> referencing a given row of "base_text".

You'd want a unique index on (base_text, language), like:

create table tx_text
(
    base_text text not null
        references base_text(base_text)
    , language text not null
        references languages (language)
    , primary key (base_text, language)
    , tx_text text not null
);

> I would also like to have a column in "languages", type boolean,
> called "is_base" - this says what the base language is. Here, only
> ONE row can have a true value. (Obviously it has default value of
> false and is not null).

Here you want a partial unique index on languages where is_base is true
create table languages
(
    language text primary key
    , is_base boolean not null
);

create unique index languages_only_one_true_base_idx
on languages (is_base)
where is_base;


> Another invariant now comes in - the language referenced by every
> row of "site_text" MUST have "is_base" set to FALSE.

I can think of two ways to do this: one is to write a trigger to
enforce this, something like:

create function non_base_language_translations_check
returns trigger()
language plpgsql as $$
begin
    if exists (
        select *
        from tx_text
        natural join languages
        where not is_base
        )
    then
        raise exception
        'Language of translation text must not be a base language.';
    end if;
end;
$$;

Then use this function on triggers that fire on insert and update on
tx_text and on update on languages.

Another is to include the is_base column in tx_text (with both
language and is_base referencing languages) and use a check
constraint to make sure is_base is false:

create table tx_text
(
    base_text text not null
        references base_text(base_text)
    , language text not null
    , is_base boolean not null check not is_base
    , foreign key (language, is_base) references languages (language,
is_base)
        on update cascade
    , primary key (base_text, language)
    , tx_text text not null
);

The latter method is denormalized, which is not something I usually
recommend. I don't know how the constraint checking overhead of using
a trigger compares with using the foreign key and check constraint.

A third idea would be to have two languages tables: one with a single
row for the base language and another for the target languages (You
might even do this using table inheritance, though I haven't thought
this completely through). tx_text would reference the
target_languages table (or child table, as the case may be).

I'm sure others have opinions on this as well.

Hope this helps.

Michael Glaesemann
grzm myrealbox com




Re: advice on schema for multilingual text

From
Tino Wildenhain
Date:
Daniel McBrearty wrote:
> Hi
>
> I have a website that has multilingual text stored in the database.
> Currently I just have a flat table (lets called it "translations"), one
> row per text item, one column per language. This works OK, for now, but
> I am looking at a redesign. Mostly I want to keep information about the
> languages in teh db as well, so that look like an extra table, one row
> per lang.

Not sure about your base and site text ;) But I can tell you how I
solved a similar problem:

I used tables:

languages (language_id serial, description_id int4) references
descriptions (description_id)

descriptions (description_id serial,acronym text)

descriptiontexts (description_id int4, language_id int4, description
text) references descriptions (description_id), languages (language_id)

(pseudocode)

so the languages table keeps all the languages I use in the database,
(as ids) for example:

languages
language_id | description_id
          1                1
          2                2
          3                3

descriptions
description_id | acronym
             1   lang_en
             2   lang_de
             3   lang_ru

discriptiontexts
description_id | language_id | description
             1             1   english
             1             2   Englisch
             2             1   german
             2             2   Deutsch

thats for the self documentation - now you can
translate sentences (give them an acronym for
use, like 'customer_basket_close' and let your
translaters translate them with respective language_id
into descriptiontexts table. This way for a given
expression (description_id) there can only be one
entry per language.

HTH
Tino Wildenhain

Re: advice on schema for multilingual text

From
Karsten Hilbert
Date:
On Sun, May 21, 2006 at 06:55:18PM +0200, Tino Wildenhain wrote:

> > I have a website that has multilingual text stored in the database.
> > Currently I just have a flat table (lets called it "translations"), one
> > row per text item, one column per language. This works OK, for now, but
> > I am looking at a redesign. Mostly I want to keep information about the
> > languages in teh db as well, so that look like an extra table, one row
> > per lang.

I have re-implemented gettext in (pl/pg(SQL)) if you are
interested. The language for a particular db user can be
switched on the fly (but it's bound to the user so is the
same for all concurrent connections of this user which may
not be useful for web apps).

It roughly works like so:

 select set_curr_lang('de_DE');

to set the language for the current db user to de_DE
(German). Then

 select i18n.i18n('some text');

to insert a string to be translated. Then

 select i18n_upd_tx('some text', 'etwas Text', 'de_DE');

to insert the translation (for locale de_DE). Then in your
queries do

 select _(some_column) from ... where ...;

and it will return the text from some_column either
translated into German or in the language it was inserted
into the database in if there's no translation available.

I routinely use this in views where I do

create view ... as
select
    ...
    label,
    _(label) as l10n_label,
    ...
from
    ...
;

All this is isolated into a schema "i18n":

 http://cvs.savannah.gnu.org/viewcvs/gnumed/gnumed/server/sql/gmI18N.sql?rev=1.25&root=gnumed&view=log
 http://cvs.savannah.gnu.org/viewcvs/gnumed/gnumed/server/sql/gmI18N-dynamic.sql?rev=1.3&root=gnumed&view=log

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346