Thread: Multilingual application, ORDER BY w/ different locales?
Hi! We are using postgres as RDBMS in an web application that is translated to about a dozen different languages. Some users get puzzled about the sorting order in lists, since we have to choose only one locale for all ORDER BY queries. I am dreaming of a SET LC_COLLATE or simliar command that will only affect my session, not all other users. I know this is not implemented in postgres. How impossible is it to add this feature, and what implications would pg suffer? All discussions regarding locale problems in postgres are about LIKE indexing. For us, collating is more important. Can we help? /Palle
Palle Girgensohn <girgen@partitur.se> writes: > I am dreaming of a SET LC_COLLATE or simliar command that will > only affect my session, not all other users. > I know this is not implemented in postgres. How impossible is it to add > this feature, and what implications would pg suffer? Actually, what the SQL spec suggests is that LOCALE be attached to individual table columns. A SET command to cause LOCALE to change on the fly within a session is quite impractical: that would mean that the sort ordering of existing columns changes, which would mean that any indexes on those columns are broken. Per-column LOCALE is on the to-do list. In my mind the main difficulty with it is that the standard C library doesn't really support concurrent use of multiple locales: it's built around the assumption that you set your locale once at program startup. setlocale() is, typically, not a fast operation. To get around this it seems we'd need to write our own set of locale library routines, which is a daunting amount of work. I think the last time this came up, someone mentioned that there's an open BSD-license locale library being worked on, which possibly we could adapt instead of reinventing this wheel for ourselves. But I don't recall more than that. Check the archives. regards, tom lane
--On Saturday, November 17, 2001 13:39:36 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Palle Girgensohn <girgen@partitur.se> writes: >> I am dreaming of a SET LC_COLLATE or simliar command that will >> only affect my session, not all other users. >> I know this is not implemented in postgres. How impossible is it to add >> this feature, and what implications would pg suffer? > > Actually, what the SQL spec suggests is that LOCALE be attached to > individual table columns. A SET command to cause LOCALE to change > on the fly within a session is quite impractical: that would mean > that the sort ordering of existing columns changes, which would mean > that any indexes on those columns are broken. OK, indexes and sort ordering are coupled, and must be? In that case, I see the problem. > Per-column LOCALE is on the to-do list. My need is really to get different sorting on *the same* column, depending on which locale the present user prefers. Collation can be quite different in Swedish, English, German or Frencn, for example. Our users can chose the language they prefer from a list, and since it is a web app, all languages are used simultaneously on the same system, and since we use a database session pool, different langs can be preferred att different times in the same database session. So, in this case there is no need for per-column locale; we really need to be able to shift sorting order (ORDER BY only) "on-the-fly". I guess this is not even supported by the SQL standard, or any other RDBMS for that matter, right? > In my mind the main difficulty > with it is that the standard C library doesn't really support concurrent > use of multiple locales: it's built around the assumption that you set > your locale once at program startup. setlocale() is, typically, not > a fast operation. To get around this it seems we'd need to write our > own set of locale library routines, which is a daunting amount of work. > > I think the last time this came up, someone mentioned that there's an > open BSD-license locale library being worked on, which possibly we could > adapt instead of reinventing this wheel for ourselves. But I don't > recall more than that. Check the archives. Thanks, I will. Cheers, Palle
Palle Girgensohn <girgen@partitur.se> writes: >> Actually, what the SQL spec suggests is that LOCALE be attached to >> individual table columns. A SET command to cause LOCALE to change >> on the fly within a session is quite impractical: that would mean >> that the sort ordering of existing columns changes, which would mean >> that any indexes on those columns are broken. > OK, indexes and sort ordering are coupled, and must be? Well, the sort ordering of any particular index has to be well-defined, which means that there has to be a fixed locale associated with it. > My need is really to get different sorting on *the same* column, depending > on which locale the present user prefers. > ... I guess this is not even supported by the SQL standard, or > any other RDBMS for that matter, right? I believe SQL regards the locale as essentially a property of a datatype, which means that in theory you should be able to cast a column value to type text-with-locale-X and then ORDER BY that. It'd be an on-the-fly sort, not able to exploit any indexes, but it sounds like that's acceptable to you. Looking at the SQL92 spec, the name they actually give to this notion is COLLATE, not locale, but it does look like you can label a string expression with the collation type you want it to be sorted by. regards, tom lane
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > Would it be possible to make a function in plpgsql or whatever that > wrapped the collate changes and then order by that and make functional > indexes? Would the system use it? IIRC, we were debating whether we should consider collation to be an attribute of the datatype (think typmod) or an attribute of individual values (think field added to values of textual types). In the former case, a function like this would only work if we allowed its result to be declared as having the right collate attribute. Which is not impossible, but we don't currently associate any typmod with function arguments or results, and so I'm not sure how painful it would be. With the field-in-data-value approach it's easy to see how it would work. But another byte or word per text value might be a high price to pay ... regards, tom lane
On Sat, 17 Nov 2001, Tom Lane wrote: > Palle Girgensohn <girgen@partitur.se> writes: > > My need is really to get different sorting on *the same* column, depending > > on which locale the present user prefers. > > ... I guess this is not even supported by the SQL standard, or > > any other RDBMS for that matter, right? > > I believe SQL regards the locale as essentially a property of a > datatype, which means that in theory you should be able to cast a column > value to type text-with-locale-X and then ORDER BY that. It'd be an > on-the-fly sort, not able to exploit any indexes, but it sounds like > that's acceptable to you. Would it be possible to make a function in plpgsql or whatever that wrapped the collate changes and then order by that and make functional indexes? Would the system use it?
On Sat, 17 Nov 2001, Tom Lane wrote: > Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > > Would it be possible to make a function in plpgsql or whatever that > > wrapped the collate changes and then order by that and make functional > > indexes? Would the system use it? > > IIRC, we were debating whether we should consider collation to be an > attribute of the datatype (think typmod) or an attribute of individual > values (think field added to values of textual types). In the former > case, a function like this would only work if we allowed its result to > be declared as having the right collate attribute. Which is not > impossible, but we don't currently associate any typmod with function > arguments or results, and so I'm not sure how painful it would be. > With the field-in-data-value approach it's easy to see how it would > work. But another byte or word per text value might be a high price > to pay ... True. Although I wonder how things like substring would work in the model with typmods if the collation isn't attached in any fashion to the return values since I think the substring collation is supposed to be the same as the input string's, whereas for something like convert it's a different collation based on a parameter. I wonder if as a temporary thing, you could use a function that did something similar to strxfrm as long as you only used that for sorting purposes.
> IIRC, we were debating whether we should consider collation to be an > attribute of the datatype (think typmod) or an attribute of individual > values (think field added to values of textual types). In the former > case, a function like this would only work if we allowed its result to > be declared as having the right collate attribute. Which is not > impossible, but we don't currently associate any typmod with function > arguments or results, and so I'm not sure how painful it would be. > With the field-in-data-value approach it's easy to see how it would > work. But another byte or word per text value might be a high price > to pay ... I think the price is not so high. To give the collation info to text data types, it's enough to store the info in the pg_attribute. ie. only additional several bytes per column are required, not per instance. Of course we would need to add some extra bytes to the in-memory string data, it's just a temporary data anyway. -- Tatsuo Ishii
> --On Saturday, November 17, 2001 13:39:36 -0500 Tom Lane > <tgl@sss.pgh.pa.us> wrote: > >> In my mind the main difficulty >> with it is that the standard C library doesn't really support concurrent >> use of multiple locales: it's built around the assumption that you set >> your locale once at program startup. setlocale() is, typically, not >> a fast operation. To get around this it seems we'd need to write our >> own set of locale library routines, which is a daunting amount of work. >> >> I think the last time this came up, someone mentioned that there's an >> open BSD-license locale library being worked on, which possibly we could >> adapt instead of reinventing this wheel for ourselves. But I don't >> recall more than that. Check the archives. > I guess it must have been IBM's International Classes for Unicode at http://oss.software.ibm.com/icu/ It is quite big: Download File Size Description icu-1.8.1.zip <http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1.zip> 7.3 MB ZIP file for Windows platforms icu-1.8.1.tgz <http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1.tgz> 6.4 MB gzipped tar archive for Unix and other platforms icu-1.8.1-docs.zip <http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1-docs.zip> 1.1 MB ZIP file with the API documentation icu-1.8.1-docs.tgz <http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1-docs.tgz> 0.9 MB gzipped tar archive with the API documentation but I suspect that it would otherways be the easiest way to get a good internationalisation support. --------------- Hannu
Stephan Szabo wrote: >On Sat, 17 Nov 2001, Tom Lane wrote: > >>Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: >> >>>Would it be possible to make a function in plpgsql or whatever that >>>wrapped the collate changes and then order by that and make functional >>>indexes? Would the system use it? >>> >>IIRC, we were debating whether we should consider collation to be an >>attribute of the datatype (think typmod) or an attribute of individual >>values (think field added to values of textual types). In the former >>case, a function like this would only work if we allowed its result to >>be declared as having the right collate attribute. Which is not >>impossible, but we don't currently associate any typmod with function >>arguments or results, and so I'm not sure how painful it would be. >>With the field-in-data-value approach it's easy to see how it would >>work. But another byte or word per text value might be a high price >>to pay ... >> > >True. Although I wonder how things like substring would work in the >model with typmods if the collation isn't attached in any fashion to >the return values since I think the substring collation is supposed >to be the same as the input string's, whereas for something like >convert it's a different collation based on a parameter. I wonder if >as a temporary thing, you could use a function that did something >similar to strxfrm as long as you only used that for sorting purposes. > That would mean a new datatype that such function returns CREATE FUNCTION text_with_collation(text,collation) RETURNS text_with_collation That would be sorted using the rules of that collation. This can currently be added in contrib, but should eventually go into core. The function itself is quite easy, but the collation is the part that can either be done by a) writing our own library b) using system locale (i think that locale switching is slow in default glibc , so the following can be slow too ORDER BY text_with_collation(t1,'et_EE'), text_with_collation(t1,'fr_CA') but Idoubt anybody uses it. c) using a third party library - at least IBM has one which is almost as big as whole postgreSQL ;) assuming that one backend needs mostl one locale at a time, I think that b) will be the easiest to implement, but this will clash with current locale support if it is compiled in so you have to be rapidly swithcing LC_COLLATE between the default and that of the current datum. so what we actually need is a system that will _not_ use locale-aware functions unless specifically told to do so by feeding it with text_with_locale values. --------------- Hannu ---------------- Hannu