Thread: Use arrays to store multilanguage texts
I am wondering, if it's effective to use text arrays to store multilanguage information. We used to do it like this: CREATE TABLE product ( id serial NOT NULL, price float4, ... ) CREATE TABLE product_txt ( product_id integer, language_id integer, -- 1=English, 2=Czech, ... name text, description text, ... PRIMARY KEY (product_id, language_id) ) Then in queries for English version we used joins: SELECT product.*, product_txt.* FROM product JOIN product_txt ON product.id=product_txt.product_id WHERE product_txt.language_id=1 It works as is supposed to, but of course there is some database overhead and mainly it is much more complicated to handle data this way. Since 7.4 introduced arrays, I was thinking about using them for storing multilanguage strings. Like this: CREATE TABLE product ( id serial NOT NULL, price float4, name text[], description text[] ) Then I'd just do: SELECT id, price, name[1], description[1] FROM product Much simpler and IMHO faster (but I'm not a pg-hacker). I never had time to test it much, but now we are going to build a new database with multilanguage strings and I am seriously considering using arrays. The only question, which remains unanswered is how is it with indexes. I mean--most of the queries on such table are ordered by name, for example. Therefore I'd do this: SELECT id, price, name[1], description[1] FROM product ORDER BY name[1] Is it possible to build an index, which will be used in such query? I had no luck with CREATE INDEX product_name1 ON product (r[1]), but maybe the syntax is just somehow different. Are there any more drawbacks or limitations, that maybe I am not aware of, which would discourage you from doing the multilanguage support this way? Should we consider this or stick to the old ways? Thanks for your attention. -- Michal Taborsky http://www.taborsky.cz
Attachment
Michal Táborský <michal@taborsky.cz> writes: > I am wondering, if it's effective to use text arrays to store > multilanguage information. > > We used to do it like this: I went through almost exactly the same design evolution. My initial design had dozens of "text_id" fields littered throughout the database and had a "localized_text" table with the expansion of every text_id for every language. It was a total headache. > name text[], > description text[] That's exactly what I have now. > Then I'd just do: > > SELECT id, price, name[1], description[1] FROM product Well I defined functions lang_?? for each language code, so that the index of each language wasn't hard coded into the application. > The only question, which remains unanswered is how is it with indexes. I > mean--most of the queries on such table are ordered by name, for > example. Therefore I'd do this: Well in my case I was never using an index on the localized text columns anyways. > SELECT id, price, name[1], description[1] FROM product ORDER BY name[1] > > Is it possible to build an index, which will be used in such query? I > had no luck with CREATE INDEX product_name1 ON product (r[1]), but maybe > the syntax is just somehow different. It is in 7.4. You might need another set of parentheses to make it parse, like "ON product ((r[1]))". In 7.3 you would need to make a function for each language and index on lang_en(r). As I said I think that's not a bad idea anyways. > Are there any more drawbacks or limitations, that maybe I am not aware > of, which would discourage you from doing the multilanguage support this > way? Should we consider this or stick to the old ways? My main worry is what happens if I have to add a new language. I'll have to update every record of every table to add a new element to the arrays. There's no centralized place that "knows" about all the translated texts, I have to know all the tables and columns that will have to be fixed. Similarly it's impossible to query the database for "tell me all the strings that haven't been translated yet". I would have to do an individual query for every column of this type. Note also that in my case I'm entering the data for both languages (we only have two so far) at the same time in the same form. Most forms in the application just have two columns and the user has to fill out the same information in each language. If you wanted a more traditional l10n interface where the user has to enter translations for every string in the application then again the inability to pull out all the strings might be a problem. But in my model the convenience of being able to update both languages in the same update and not having to worry about the order in which inserts are done and keeping track of ids to ensure relationships, are huge. I can't imagine having continued in the other track. The benefit of having the translation available immediately without having to do an extra join depends a lot on the application. In my case I had dozens of fields to look up and often in the midst of fairly complex queries. I was worried about performance and also about it raising the likelihood of the optimizer getting confused. Also it made utter messes of the queries. -- greg
Michal Táborský wrote: > I am wondering, if it's effective to use text arrays to store > multilanguage information. [...snip...] > SELECT id, price, name[1], description[1] FROM product ORDER BY name[1] > > Is it possible to build an index, which will be used in such query? I > had no luck with CREATE INDEX product_name1 ON product (r[1]), but maybe > the syntax is just somehow different. Maybe something like this: CREATE TABLE product ( id serial NOT NULL, price float4, name text[], description text[] ); insert into product (price, name, description) values (10, array['apples-english','apples-spanish','apples-german'], array['big bunch of apples-english','...-spanish','...-german'] ); insert into product (price, name, description) values (42, array['pears-english'], array['big bunch of pears-english'] ); create or replace function get_lang(text) returns int as ' select case when $1 = ''english'' then 1 when $1 = ''spanish'' then 2 when $1 = ''german'' then 3 else 1 end ' language sql strict immutable; create or replace function get_lang_str(text[], int) returns text as ' select coalesce($1[$2], $1[1]) ' language sql strict immutable; create index product_idx1_english on product(get_lang_str(name,get_lang('english'))); create index product_idx1_spanish on product(get_lang_str(name,get_lang('spanish'))); create index product_idx1_german on product(get_lang_str(name,get_lang('german'))); set enable_seqscan to off; explain analyze select id, price, get_lang_str(name,get_lang('spanish')) as name, get_lang_str(description,get_lang('spanish')) as description from product order by get_lang_str(name,get_lang('spanish')); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Index Scan using product_idx1_spanish on product (cost=0.00..6.03 rows=2 width=72) (actual time=0.147..0.193 rows=2 loops=1) Total runtime: 0.246 ms (2 rows) select id, price, get_lang_str(name,get_lang('foo')) as name, get_lang_str(description,get_lang('foo')) as description from product order by get_lang_str(name,get_lang('foo')); id | price | name | description ----+-------+----------------+----------------------------- 1 | 10 | apples-english | big bunch of apples-english 2 | 42 | pears-english | big bunch of pears-english (2 rows) select id, price, get_lang_str(name,get_lang('spanish')) as name, get_lang_str(description,get_lang('spanish')) as description from product order by get_lang_str(name,get_lang('spanish')); id | price | name | description ----+-------+----------------+---------------------------- 1 | 10 | apples-spanish | ...-spanish 2 | 42 | pears-english | big bunch of pears-english (2 rows) > Are there any more drawbacks or limitations, that maybe I am not aware > of, which would discourage you from doing the multilanguage support this > way? Should we consider this or stick to the old ways? Notice I was trying to be careful about cases where a bad language string is used, or the array is missing languages other than english. I.e. I picked english as the default language and would need to ensure every array had at least that string. I'm not sure how maintainable it will be. You'd have to play with it for a while and decide for yourself. It would be interesting to see a speed comparison in a real-life application -- so if you give it a try, please let us know how it turns out. HTH, Joe
> > I am wondering, if it's effective to use text arrays to store > > multilanguage information. We solved it like this: http://savannah.gnu.org/cgi-bin/viewcvs/*checkout*/gnumed/gnumed/gnumed/server/sql/gmI18N.sql?rev=1.15 - i18n_curr_lang holds the (ISO) language string per user - i18n_keys holds the strings to be translated - i18n_translations holds the translations - function i18n() inserts a text value into i18n_keys thus marking it for translation, this we use during insertion of data in the "primary language" (think gettext, think English) - function _() returns a translated text value (or it's primary language equivalent), we use that in queries and view definitions - v_missing_translations is a view of, well, ... Then here: http://savannah.gnu.org/cgi-bin/viewcvs/*checkout*/gnumed/gnumed/gnumed/server/locale/dump-missing-db_translations.py?rev=1.3 we've got a Python tool that will connect to your database and create schema files ready for re-insertion via psql after adding the missing translations. Kind of like a gettext po file. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes: > > > I am wondering, if it's effective to use text arrays to store > > > multilanguage information. > We solved it like this: > > http://savannah.gnu.org/cgi-bin/viewcvs/*checkout*/gnumed/gnumed/gnumed/server/sql/gmI18N.sql?rev=1.15 This doesn't work for me I get /cgi-bin/viewcvs/gnumed/gnumed/gnumed/serve: unknown location > - i18n_curr_lang holds the (ISO) language string per user I'm interested in this part. Is there a facility for having variables on the postgres side that you can set once at the beginning of a session and then refer to in functions? I instead made my application insert lang_en,lang_fr etc into the queries as appropriate. This made the application code a little uglier than necessary, but it's not too bad, I put the function name in a variable and call $xlate(column) in a double-quoted string. I think having a global postgres variable would be a nice shortcut, though I fear it would make it harder to do things like index lang_en() and lang_fr(). Do you cheat and make your _() IMMUTABLE? It's only immutable if you guarantee that i18n_currr_lang can never change in the middle of a query. But even then you're lying to the server and indexing _() would do strange things, for example. In fact I think it would produce corrupted indexes if it worked at all. > - v_missing_translations is a view of, well, ... This is the one thing that the array solution can't do. But in my case it's dynamic text and both languages are being provided at the same time by the same people. I'm not going to have a translator going through batches of them like gettext. Actually I also have a separate solution for the static application text that is going to be translated in the gettext style. In that case there is a translations table much like gettext. However all the logic for lookups is handled on the application side in a Smarty prefilter. -- greg
http://savannah.gnu.org/cgi-bin/viewcvs/*checkout*/gnumed/gnumed/gnumed/server/sql/gmI18N.sql?rev=1.15 Works for me. You'll have to make sure you get the entire string for the URL. It may have been wrapped by your mail handling agent. > > - i18n_curr_lang holds the (ISO) language string per user > > I'm interested in this part. Is there a facility for having variables on the > postgres side that you can set once at the beginning of a session and then > refer to in functions? We set them the first time a user connects to a particular database. Unless they are changed later (possible anytime with immediate transparent switching of languages in queries and views) the language will stay the same across sessions/ clients/ connections. > I instead made my application insert lang_en,lang_fr etc into the queries as > appropriate. This made the application code a little uglier than necessary, > but it's not too bad, I put the function name in a variable and call > $xlate(column) in a double-quoted string. We then use select _(fruits.name) style queries. The _() function uses the language in i18n_curr_lang (for CURRENT_USER). > I think having a global postgres variable would be a nice shortcut, though I > fear it would make it harder to do things like index lang_en() and lang_fr(). I'd think on could index i18n_translations just fine: create unique index on i18n_translations(trans) where lang='xx'; > Do you cheat and make your _() IMMUTABLE? It's only immutable if you guarantee > that i18n_currr_lang can never change in the middle of a query. Given that it'd take a writing query that should run in a transaction it shouldn't change within a read query. However, I'm not too sure about views, like we use translated columns in views like this: create view a_view as select bla as bla, _(bla) as l10n_bla ... > But even then > you're lying to the server and indexing _() would do strange things, for > example. In fact I think it would produce corrupted indexes if it worked at > all. I haven't experienced any of that but it may well happen - dunno. > > - v_missing_translations is a view of, well, ... > > This is the one thing that the array solution can't do. But in my case it's > dynamic text and both languages are being provided at the same time by the > same people. I'm not going to have a translator going through batches of them > like gettext. Well, you wouldn't *have* to use that view, it's just for convenience. You could just as well insert the dynamic text into i18n_translations (lang, orig, trans) whenever dynamic text is inserted. We mainly use that (+ i18n()) for gettext style batch translation of static application text. Except that all the lookup is handled by _() on the server right inside the query :-) Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346