Thread: Use arrays to store multilanguage texts

Use arrays to store multilanguage texts

From
Michal Táborský
Date:
I am wondering, if it's effective to use text arrays to store
multilanguage information.

We used to do it like this:

CREATE TABLE product (
    id serial NOT NULL,
    price float4,
    ...
)

CREATE TABLE product_txt (
    product_id integer,
    language_id integer, -- 1=English, 2=Czech, ...
    name text,
    description text,
    ...
    PRIMARY KEY (product_id, language_id)
)

Then in queries for English version we used joins:

SELECT product.*, product_txt.*
FROM product JOIN product_txt ON product.id=product_txt.product_id
WHERE product_txt.language_id=1

It works as is supposed to, but of course there is some database
overhead and mainly it is much more complicated to handle data this way.
Since 7.4 introduced arrays, I was thinking about using them for storing
multilanguage strings. Like this:

CREATE TABLE product (
    id serial NOT NULL,
    price float4,
    name text[],
    description text[]
)

Then I'd just do:

SELECT id, price, name[1], description[1] FROM product

Much simpler and IMHO faster (but I'm not a pg-hacker). I never had time
to test it much, but now we are going to build a new database with
multilanguage strings and I am seriously considering using arrays.

The only question, which remains unanswered is how is it with indexes. I
mean--most of the queries on such table are ordered by name, for
example. Therefore I'd do this:

SELECT id, price, name[1], description[1] FROM product ORDER BY name[1]

Is it possible to build an index, which will be used in such query? I
had no luck with CREATE INDEX product_name1 ON product (r[1]), but maybe
the syntax is just somehow different.

Are there any more drawbacks or limitations, that maybe I am not aware
of, which would discourage you from doing the multilanguage support this
way? Should we consider this or stick to the old ways?

Thanks for your attention.

--
Michal Taborsky
http://www.taborsky.cz


Attachment

Re: Use arrays to store multilanguage texts

From
Greg Stark
Date:
Michal Táborský <michal@taborsky.cz> writes:

> I am wondering, if it's effective to use text arrays to store
> multilanguage information.
>
> We used to do it like this:

I went through almost exactly the same design evolution. My initial design had
dozens of "text_id" fields littered throughout the database and had a
"localized_text" table with the expansion of every text_id for every language.

It was a total headache.

>     name text[],
>     description text[]

That's exactly what I have now.

> Then I'd just do:
>
> SELECT id, price, name[1], description[1] FROM product

Well I defined functions lang_?? for each language code, so that the index of
each language wasn't hard coded into the application.

> The only question, which remains unanswered is how is it with indexes. I
> mean--most of the queries on such table are ordered by name, for
> example. Therefore I'd do this:

Well in my case I was never using an index on the localized text columns
anyways.

> SELECT id, price, name[1], description[1] FROM product ORDER BY name[1]
>
> Is it possible to build an index, which will be used in such query? I
> had no luck with CREATE INDEX product_name1 ON product (r[1]), but maybe
> the syntax is just somehow different.

It is in 7.4. You might need another set of parentheses to make it parse, like
"ON product ((r[1]))". In 7.3 you would need to make a function for each
language and index on lang_en(r). As I said I think that's not a bad idea
anyways.

> Are there any more drawbacks or limitations, that maybe I am not aware
> of, which would discourage you from doing the multilanguage support this
> way? Should we consider this or stick to the old ways?

My main worry is what happens if I have to add a new language. I'll have to
update every record of every table to add a new element to the arrays. There's
no centralized place that "knows" about all the translated texts, I have to
know all the tables and columns that will have to be fixed.

Similarly it's impossible to query the database for "tell me all the strings
that haven't been translated yet". I would have to do an individual query for
every column of this type.

Note also that in my case I'm entering the data for both languages (we only
have two so far) at the same time in the same form. Most forms in the
application just have two columns and the user has to fill out the same
information in each language.

If you wanted a more traditional l10n interface where the user has to enter
translations for every string in the application then again the inability to
pull out all the strings might be a problem.

But in my model the convenience of being able to update both languages in the
same update and not having to worry about the order in which inserts are done
and keeping track of ids to ensure relationships, are huge. I can't imagine
having continued in the other track.

The benefit of having the translation available immediately without having to
do an extra join depends a lot on the application. In my case I had dozens of
fields to look up and often in the midst of fairly complex queries. I was
worried about performance and also about it raising the likelihood of the
optimizer getting confused. Also it made utter messes of the queries.

--
greg

Re: Use arrays to store multilanguage texts

From
Joe Conway
Date:
Michal Táborský wrote:
> I am wondering, if it's effective to use text arrays to store
> multilanguage information.

[...snip...]

> SELECT id, price, name[1], description[1] FROM product ORDER BY name[1]
>
> Is it possible to build an index, which will be used in such query? I
> had no luck with CREATE INDEX product_name1 ON product (r[1]), but maybe
> the syntax is just somehow different.

Maybe something like this:

CREATE TABLE product (
         id serial NOT NULL,
         price float4,
         name text[],
         description text[]
);


insert into product (price, name, description)
  values (10,
          array['apples-english','apples-spanish','apples-german'],
          array['big bunch of apples-english','...-spanish','...-german']
  );


insert into product (price, name, description)
  values (42,
          array['pears-english'],
          array['big bunch of pears-english']
  );


create or replace function get_lang(text) returns int as '
  select case
    when $1 = ''english'' then 1
    when $1 = ''spanish'' then 2
    when $1 = ''german'' then 3
    else 1
  end
' language sql strict immutable;


create or replace function get_lang_str(text[], int)
  returns text as '
    select coalesce($1[$2], $1[1])
' language sql strict immutable;


create index product_idx1_english on
product(get_lang_str(name,get_lang('english')));
create index product_idx1_spanish on
product(get_lang_str(name,get_lang('spanish')));
create index product_idx1_german on
product(get_lang_str(name,get_lang('german')));

set enable_seqscan to off;


explain analyze
  select
    id,
    price,
    get_lang_str(name,get_lang('spanish')) as name,
    get_lang_str(description,get_lang('spanish')) as description
  from
    product
  order by
    get_lang_str(name,get_lang('spanish'));
                                                           QUERY PLAN



-------------------------------------------------------------------------------------------------------------------------------
  Index Scan using product_idx1_spanish on product  (cost=0.00..6.03
rows=2 width=72) (actual time=0.147..0.193 rows=2 loops=1)
  Total runtime: 0.246 ms
(2 rows)

select
   id,
   price,
   get_lang_str(name,get_lang('foo')) as name,
   get_lang_str(description,get_lang('foo')) as description
from
   product
order by
   get_lang_str(name,get_lang('foo'));
  id | price |      name      |         description
----+-------+----------------+-----------------------------
   1 |    10 | apples-english | big bunch of apples-english
   2 |    42 | pears-english  | big bunch of pears-english
(2 rows)

select
   id,
   price,
   get_lang_str(name,get_lang('spanish')) as name,
   get_lang_str(description,get_lang('spanish')) as description
from
   product
order by
   get_lang_str(name,get_lang('spanish'));
  id | price |      name      |        description
----+-------+----------------+----------------------------
   1 |    10 | apples-spanish | ...-spanish
   2 |    42 | pears-english  | big bunch of pears-english
(2 rows)


> Are there any more drawbacks or limitations, that maybe I am not aware
> of, which would discourage you from doing the multilanguage support this
> way? Should we consider this or stick to the old ways?

Notice I was trying to be careful about cases where a bad language
string is used, or the array is missing languages other than english.
I.e. I picked english as the default language and would need to ensure
every array had at least that string.

I'm not sure how maintainable it will be. You'd have to play with it for
a while and decide for yourself.

It would be interesting to see a speed comparison in a real-life
application -- so if you give it a try, please let us know how it turns out.

HTH,

Joe


Re: Use arrays to store multilanguage texts

From
Karsten Hilbert
Date:
> > I am wondering, if it's effective to use text arrays to store
> > multilanguage information.
We solved it like this:

http://savannah.gnu.org/cgi-bin/viewcvs/*checkout*/gnumed/gnumed/gnumed/server/sql/gmI18N.sql?rev=1.15

- i18n_curr_lang holds the (ISO) language string per user
- i18n_keys holds the strings to be translated
- i18n_translations holds the translations
- function i18n() inserts a text value into i18n_keys thus
  marking it for translation, this we use during insertion of
  data in the "primary language" (think gettext, think English)
- function _() returns a translated text value (or it's
  primary language equivalent), we use that in queries and view
  definitions
- v_missing_translations is a view of, well, ...

Then here:


http://savannah.gnu.org/cgi-bin/viewcvs/*checkout*/gnumed/gnumed/gnumed/server/locale/dump-missing-db_translations.py?rev=1.3

we've got a Python tool that will connect to your database and
create schema files ready for re-insertion via psql after
adding the missing translations. Kind of like a gettext po
file.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Use arrays to store multilanguage texts

From
Greg Stark
Date:
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

> > > I am wondering, if it's effective to use text arrays to store
> > > multilanguage information.
> We solved it like this:
>
> http://savannah.gnu.org/cgi-bin/viewcvs/*checkout*/gnumed/gnumed/gnumed/server/sql/gmI18N.sql?rev=1.15

This doesn't work for me I get

/cgi-bin/viewcvs/gnumed/gnumed/gnumed/serve: unknown location

> - i18n_curr_lang holds the (ISO) language string per user

I'm interested in this part. Is there a facility for having variables on the
postgres side that you can set once at the beginning of a session and then
refer to in functions?

I instead made my application insert lang_en,lang_fr etc into the queries as
appropriate. This made the application code a little uglier than necessary,
but it's not too bad, I put the function name in a variable and call
$xlate(column) in a double-quoted string.

I think having a global postgres variable would be a nice shortcut, though I
fear it would make it harder to do things like index lang_en() and lang_fr().

Do you cheat and make your _() IMMUTABLE? It's only immutable if you guarantee
that i18n_currr_lang can never change in the middle of a query. But even then
you're lying to the server and indexing _() would do strange things, for
example. In fact I think it would produce corrupted indexes if it worked at
all.

> - v_missing_translations is a view of, well, ...

This is the one thing that the array solution can't do. But in my case it's
dynamic text and both languages are being provided at the same time by the
same people. I'm not going to have a translator going through batches of them
like gettext.

Actually I also have a separate solution for the static application text that
is going to be translated in the gettext style. In that case there is a
translations table much like gettext. However all the logic for lookups is
handled on the application side in a Smarty prefilter.

--
greg

Re: Use arrays to store multilanguage texts

From
Karsten Hilbert
Date:
http://savannah.gnu.org/cgi-bin/viewcvs/*checkout*/gnumed/gnumed/gnumed/server/sql/gmI18N.sql?rev=1.15

Works for me. You'll have to make sure you get the entire
string for the URL. It may have been wrapped by your mail
handling agent.

> > - i18n_curr_lang holds the (ISO) language string per user
>
> I'm interested in this part. Is there a facility for having variables on the
> postgres side that you can set once at the beginning of a session and then
> refer to in functions?
We set them the first time a user connects to a particular
database. Unless they are changed later (possible anytime with
immediate transparent switching of languages in queries and
views) the language will stay the same across sessions/
clients/ connections.

> I instead made my application insert lang_en,lang_fr etc into the queries as
> appropriate. This made the application code a little uglier than necessary,
> but it's not too bad, I put the function name in a variable and call
> $xlate(column) in a double-quoted string.

We then use select _(fruits.name) style queries. The _()
function uses the language in i18n_curr_lang (for
CURRENT_USER).

> I think having a global postgres variable would be a nice shortcut, though I
> fear it would make it harder to do things like index lang_en() and lang_fr().
I'd think on could index i18n_translations just fine:

create unique index on i18n_translations(trans) where lang='xx';

> Do you cheat and make your _() IMMUTABLE? It's only immutable if you guarantee
> that i18n_currr_lang can never change in the middle of a query.
Given that it'd take a writing query that should run in a
transaction it shouldn't change within a read query. However,
I'm not too sure about views, like we use translated columns
in views like this:

create view a_view as
select
    bla as bla,
    _(bla) as l10n_bla
...

> But even then
> you're lying to the server and indexing _() would do strange things, for
> example. In fact I think it would produce corrupted indexes if it worked at
> all.
I haven't experienced any of that but it may well happen -
dunno.

> > - v_missing_translations is a view of, well, ...
>
> This is the one thing that the array solution can't do. But in my case it's
> dynamic text and both languages are being provided at the same time by the
> same people. I'm not going to have a translator going through batches of them
> like gettext.
Well, you wouldn't *have* to use that view, it's just for
convenience. You could just as well insert the dynamic text
into i18n_translations (lang, orig, trans) whenever dynamic
text is inserted. We mainly use that (+ i18n()) for gettext
style batch translation of static application text. Except
that all the lookup is handled by _() on the server right
inside the query :-)

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346