Thread: Multilingual application, ORDER BY w/ different locales?

Multilingual application, ORDER BY w/ different locales?

From
Palle Girgensohn
Date:
Hi!

We are using postgres as RDBMS in an web application that is translated to 
about a dozen different languages. Some users get puzzled about the sorting 
order in lists, since we have to choose only one locale for all ORDER BY 
queries. I am dreaming of a SET LC_COLLATE or simliar command that will 
only affect my session, not all other users.

I know this is not implemented in postgres. How impossible is it to add 
this feature, and what implications would pg suffer? All discussions 
regarding locale problems in postgres are about LIKE indexing. For us, 
collating is more important. Can we help?

/Palle




Re: Multilingual application, ORDER BY w/ different locales?

From
Tom Lane
Date:
Palle Girgensohn <girgen@partitur.se> writes:
> I am dreaming of a SET LC_COLLATE or simliar command that will 
> only affect my session, not all other users.
> I know this is not implemented in postgres. How impossible is it to add 
> this feature, and what implications would pg suffer?

Actually, what the SQL spec suggests is that LOCALE be attached to
individual table columns.  A SET command to cause LOCALE to change
on the fly within a session is quite impractical: that would mean
that the sort ordering of existing columns changes, which would mean
that any indexes on those columns are broken.

Per-column LOCALE is on the to-do list.  In my mind the main difficulty
with it is that the standard C library doesn't really support concurrent
use of multiple locales: it's built around the assumption that you set
your locale once at program startup.  setlocale() is, typically, not
a fast operation.  To get around this it seems we'd need to write our
own set of locale library routines, which is a daunting amount of work.

I think the last time this came up, someone mentioned that there's an
open BSD-license locale library being worked on, which possibly we could
adapt instead of reinventing this wheel for ourselves.  But I don't
recall more than that.  Check the archives.
        regards, tom lane


Re: Multilingual application, ORDER BY w/ different

From
Palle Girgensohn
Date:
--On Saturday, November 17, 2001 13:39:36 -0500 Tom Lane 
<tgl@sss.pgh.pa.us> wrote:

> Palle Girgensohn <girgen@partitur.se> writes:
>> I am dreaming of a SET LC_COLLATE or simliar command that will
>> only affect my session, not all other users.
>> I know this is not implemented in postgres. How impossible is it to add
>> this feature, and what implications would pg suffer?
>
> Actually, what the SQL spec suggests is that LOCALE be attached to
> individual table columns.  A SET command to cause LOCALE to change
> on the fly within a session is quite impractical: that would mean
> that the sort ordering of existing columns changes, which would mean
> that any indexes on those columns are broken.

OK, indexes and sort ordering are coupled, and must be? In that case, I see 
the problem.

> Per-column LOCALE is on the to-do list.

My need is really to get different sorting on *the same* column, depending 
on which locale the present user prefers. Collation can be quite different 
in Swedish, English, German or Frencn, for example. Our users can chose the 
language they prefer from a list, and since it is a web app, all languages 
are used simultaneously on the same system, and since we use a database 
session pool, different langs can be preferred att different times in the 
same database session. So, in this case there is no need for per-column 
locale; we really need to be able to shift sorting order (ORDER BY only) 
"on-the-fly". I guess this is not even supported by the SQL standard, or 
any other RDBMS for that matter, right?

> In my mind the main difficulty
> with it is that the standard C library doesn't really support concurrent
> use of multiple locales: it's built around the assumption that you set
> your locale once at program startup.  setlocale() is, typically, not
> a fast operation.  To get around this it seems we'd need to write our
> own set of locale library routines, which is a daunting amount of work.
>
> I think the last time this came up, someone mentioned that there's an
> open BSD-license locale library being worked on, which possibly we could
> adapt instead of reinventing this wheel for ourselves.  But I don't
> recall more than that.  Check the archives.

Thanks, I will.

Cheers,
Palle



Re: Multilingual application, ORDER BY w/ different locales?

From
Tom Lane
Date:
Palle Girgensohn <girgen@partitur.se> writes:
>> Actually, what the SQL spec suggests is that LOCALE be attached to
>> individual table columns.  A SET command to cause LOCALE to change
>> on the fly within a session is quite impractical: that would mean
>> that the sort ordering of existing columns changes, which would mean
>> that any indexes on those columns are broken.

> OK, indexes and sort ordering are coupled, and must be?

Well, the sort ordering of any particular index has to be well-defined,
which means that there has to be a fixed locale associated with it.

> My need is really to get different sorting on *the same* column, depending 
> on which locale the present user prefers.
> ... I guess this is not even supported by the SQL standard, or 
> any other RDBMS for that matter, right?

I believe SQL regards the locale as essentially a property of a
datatype, which means that in theory you should be able to cast a column
value to type text-with-locale-X and then ORDER BY that.  It'd be an
on-the-fly sort, not able to exploit any indexes, but it sounds like
that's acceptable to you.

Looking at the SQL92 spec, the name they actually give to this notion
is COLLATE, not locale, but it does look like you can label a string
expression with the collation type you want it to be sorted by.
        regards, tom lane


Re: Multilingual application, ORDER BY w/ different locales?

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> Would it be possible to make a function in plpgsql or whatever that
> wrapped the collate changes and then order by that and make functional
> indexes?  Would the system use it?

IIRC, we were debating whether we should consider collation to be an
attribute of the datatype (think typmod) or an attribute of individual
values (think field added to values of textual types).  In the former
case, a function like this would only work if we allowed its result to
be declared as having the right collate attribute.  Which is not
impossible, but we don't currently associate any typmod with function
arguments or results, and so I'm not sure how painful it would be.
With the field-in-data-value approach it's easy to see how it would
work.  But another byte or word per text value might be a high price
to pay ...
        regards, tom lane


Re: Multilingual application, ORDER BY w/ different locales?

From
Stephan Szabo
Date:
On Sat, 17 Nov 2001, Tom Lane wrote:

> Palle Girgensohn <girgen@partitur.se> writes:
> > My need is really to get different sorting on *the same* column, depending
> > on which locale the present user prefers.
> > ... I guess this is not even supported by the SQL standard, or
> > any other RDBMS for that matter, right?
>
> I believe SQL regards the locale as essentially a property of a
> datatype, which means that in theory you should be able to cast a column
> value to type text-with-locale-X and then ORDER BY that.  It'd be an
> on-the-fly sort, not able to exploit any indexes, but it sounds like
> that's acceptable to you.

Would it be possible to make a function in plpgsql or whatever that
wrapped the collate changes and then order by that and make functional
indexes?  Would the system use it?



Re: Multilingual application, ORDER BY w/ different locales?

From
Stephan Szabo
Date:
On Sat, 17 Nov 2001, Tom Lane wrote:

> Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> > Would it be possible to make a function in plpgsql or whatever that
> > wrapped the collate changes and then order by that and make functional
> > indexes?  Would the system use it?
>
> IIRC, we were debating whether we should consider collation to be an
> attribute of the datatype (think typmod) or an attribute of individual
> values (think field added to values of textual types).  In the former
> case, a function like this would only work if we allowed its result to
> be declared as having the right collate attribute.  Which is not
> impossible, but we don't currently associate any typmod with function
> arguments or results, and so I'm not sure how painful it would be.
> With the field-in-data-value approach it's easy to see how it would
> work.  But another byte or word per text value might be a high price
> to pay ...

True.  Although I wonder how things like substring would work in the
model with typmods if the collation isn't attached in any fashion to
the return values since I think the substring collation is supposed
to be the same as the input string's, whereas for something like
convert it's a different collation based on a parameter. I wonder if
as a temporary thing, you could use a function that did something
similar to strxfrm as long as you only used that for sorting purposes.




Re: Multilingual application, ORDER BY w/ different

From
Tatsuo Ishii
Date:
> IIRC, we were debating whether we should consider collation to be an
> attribute of the datatype (think typmod) or an attribute of individual
> values (think field added to values of textual types).  In the former
> case, a function like this would only work if we allowed its result to
> be declared as having the right collate attribute.  Which is not
> impossible, but we don't currently associate any typmod with function
> arguments or results, and so I'm not sure how painful it would be.
> With the field-in-data-value approach it's easy to see how it would
> work.  But another byte or word per text value might be a high price
> to pay ...

I think the price is not so high. To give the collation info to text
data types, it's enough to store the info in the
pg_attribute. ie. only additional several bytes per column are
required, not per instance. Of course we would need to add some extra
bytes to the in-memory string data, it's just a temporary data anyway.
--
Tatsuo Ishii



Re: Multilingual application, ORDER BY w/ different

From
Hannu Krosing
Date:
> --On Saturday, November 17, 2001 13:39:36 -0500 Tom Lane 
> <tgl@sss.pgh.pa.us> wrote:
>
>> In my mind the main difficulty
>> with it is that the standard C library doesn't really support concurrent
>> use of multiple locales: it's built around the assumption that you set
>> your locale once at program startup.  setlocale() is, typically, not
>> a fast operation.  To get around this it seems we'd need to write our
>> own set of locale library routines, which is a daunting amount of work.
>>
>> I think the last time this came up, someone mentioned that there's an
>> open BSD-license locale library being worked on, which possibly we could
>> adapt instead of reinventing this wheel for ourselves.  But I don't
>> recall more than that.  Check the archives. 
>
I guess it must have been IBM's International Classes for Unicode at
http://oss.software.ibm.com/icu/

It is quite big:

       Download

File   Size   Description
icu-1.8.1.zip 
<http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1.zip>   7.3 MB  ZIP file for Windows platforms
icu-1.8.1.tgz 
<http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1.tgz>   6.4 MB  gzipped tar archive for Unix and other
platforms
icu-1.8.1-docs.zip 
<http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1-docs.zip>   
1.1 MB   ZIP file with the API documentation
icu-1.8.1-docs.tgz 
<http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1-docs.tgz>   
0.9 MB   gzipped tar archive with the API documentation


but I suspect that it would otherways be the easiest way to get a good 
internationalisation support.

---------------
Hannu





Re: Multilingual application, ORDER BY w/ different locales?

From
Hannu Krosing
Date:

Stephan Szabo wrote:

>On Sat, 17 Nov 2001, Tom Lane wrote:
>
>>Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
>>
>>>Would it be possible to make a function in plpgsql or whatever that
>>>wrapped the collate changes and then order by that and make functional
>>>indexes?  Would the system use it?
>>>
>>IIRC, we were debating whether we should consider collation to be an
>>attribute of the datatype (think typmod) or an attribute of individual
>>values (think field added to values of textual types).  In the former
>>case, a function like this would only work if we allowed its result to
>>be declared as having the right collate attribute.  Which is not
>>impossible, but we don't currently associate any typmod with function
>>arguments or results, and so I'm not sure how painful it would be.
>>With the field-in-data-value approach it's easy to see how it would
>>work.  But another byte or word per text value might be a high price
>>to pay ...
>>
>
>True.  Although I wonder how things like substring would work in the
>model with typmods if the collation isn't attached in any fashion to
>the return values since I think the substring collation is supposed
>to be the same as the input string's, whereas for something like
>convert it's a different collation based on a parameter. I wonder if
>as a temporary thing, you could use a function that did something
>similar to strxfrm as long as you only used that for sorting purposes.
>
That would mean a new datatype that such function returns

CREATE FUNCTION text_with_collation(text,collation) RETURNS 
text_with_collation

That would be sorted using the rules of that collation.

This can currently be added in contrib, but should eventually go into core.

The function itself is quite easy, but the collation is the part that 
can either be done by
a) writing our own library

b) using system locale (i think that locale switching is slow in default 
glibc , so the following can be slow too ORDER BY text_with_collation(t1,'et_EE'), text_with_collation(t1,'fr_CA') but
Idoubt anybody uses it.
 

c) using a third party library - at least IBM has one which is almost as 
big as whole postgreSQL ;)

assuming that one backend needs mostl one locale at a time, I think that 
b) will be the easiest to
implement, but this will clash with current locale support if it is 
compiled in  so you have to be
rapidly swithcing LC_COLLATE between the default and that of the current 
datum.

so what we actually need is a system that will _not_ use locale-aware 
functions unless specifically
told to do so by feeding it with text_with_locale values.

---------------
Hannu










----------------
Hannu