Thread: How to add locale support for each column?

How to add locale support for each column?

From
Mahmoud Taghizadeh
Date:
I have already this mail to PATCHES mailing list and nobody replied me !!!.,
Hope the hackers  and the developers send their ideas.
 
Qustion: is this approach is suitable for solving the need for locale per column in PostgreSQL ?
 
There is a function I attached to this mail. this function is similar to nls_sort.  this function is written by Karel Zak, I dont know if he had tried to submit this code or not.
 
In this distribution you will find file nls_string.c. It contains the
definition of function nls_string(text, text) which takes a string
parameter and a locale name and returns string describing the ordering.
So you can run

  select * from table order by nls_string(name, 'en_US.UTF-8')

or

  select * from table order by nls_string(name, 'cs_CZ.UTF-8')

or

  select * from table order by nls_string(name, 'C')
 
Lates version is available at:
http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/

M. Taghizadeh


Do you Yahoo!?
vote.yahoo.com - Register online to vote today!
Attachment

Re: How to add locale support for each column?

From
Peter Eisentraut
Date:
Mahmoud Taghizadeh wrote:
> I have already this mail to PATCHES mailing list and nobody replied
> me !!!., Hope the hackers  and the developers send their ideas.
>
> Qustion: is this approach is suitable for solving the need for locale
> per column in PostgreSQL ?

I doubt that, because 1) it's nonstandard, and 2) switching the locale 
at run time is too expensive when using the system library.  It's a 
good workaround, though, if it satisfies the requirements of your 
application.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/



Re: How to add locale support for each column?

From
Greg Stark
Date:
Peter Eisentraut <peter_e@gmx.net> writes:

> 2) switching the locale at run time is too expensive when using the system
> library.

Fwiw I did some experiments with this and found it wasn't true. At least with
glibc I could switch the locale to what I wanted and back for every record in
a million record table with basically no noticeable effect on the query
execution time.

Of course not every system will have an equally speedy implementation. But I
don't think that's an argument for postgres to reimplement portions of the OS.
If the OS locale handling is slow on some OS's then postgres should just warn
its users that using locales on those OS's will be slow.

In any case I suspect more than just the glibc implementation cache locales in
memory at this point. It seems like just too obvious a feature and more and
more applications require locale switching to be fast anyways.

-- 
greg



Re: How to add locale support for each column?

From
Dennis Bjorklund
Date:
On 19 Sep 2004, Greg Stark wrote:

> don't think that's an argument for postgres to reimplement portions of the OS.
> If the OS locale handling is slow on some OS's then postgres should just warn
> its users that using locales on those OS's will be slow.
> 
> In any case I suspect more than just the glibc implementation cache locales in
> memory at this point. It seems like just too obvious a feature and more and
> more applications require locale switching to be fast anyways.

Still, we want the final solution to be what the sql standard specify. A
function as the proposed is however useful until a standard sql solution
is implemented. I've used a similar function for some projects and it have
worked well for me also.

I think such a function fits perfect in contrib.

-- 
/Dennis Björklund



Re: How to add locale support for each column?

From
Greg Stark
Date:
Dennis Bjorklund <db@zigo.dhs.org> writes:

> Still, we want the final solution to be what the sql standard specify. A
> function as the proposed is however useful until a standard sql solution
> is implemented. I've used a similar function for some projects and it have
> worked well for me also.

I haven't read the standard in this area (and the sections I have read don't
lead me to think reading this section would be a trivial task), but from what
I've seen people describe here I think the standard behaviour will be nigh
useless anyway. From what I understand the standard has you declare a locale
per column. That would mean the entire column can only store strings from a
single locale.

So in my application I would need to declare a new set of columns in every
table every time I localize it for a new language. And worse, for
user-provided input where each input could be in the user's locale language I
wouldn't be able to store it in a properly normalized table where one column
stores the user's input in whatever locale he prefers.

I actually would prefer an interface like he describes. And I think it's
enlightening that everyone that tries their hand at this seems to come up with
an interface much like this. You say you did for example, and I did also when
I needed a strxfrm interface. (I discussed it on the list but strangely the
list archives search isn't finding it)

> I think such a function fits perfect in contrib.

I would like to see all the locale handling functions people have proposed
over time gathered up and merged into a single coherent source base. There are
probably a lot of duplicate code, and some could probably benefit from copying
code from others.

I don't think this is just a stop-gap solution though. I expect it would live
on in contrib and continue to be useful even if the standard locale handling
is ever implemented -- which will only happen if someone somewhere finds it
would actually be useful for their needs.

Possibly it would be useful to consider this as a low level interface. Then
define a localized data type that uses it to implement the standard behaviour.

-- 
greg



Re: How to add locale support for each column?

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> 2) switching the locale at run time is too expensive when using the system
>> library.

> Fwiw I did some experiments with this and found it wasn't true.

Really?

I was just in process of doing experiments using the attached test
program.  Here are some results (all using en_US/fr_FR or local
equivalent as the test locales, and all built with "gcc -O2").
The numbers are microseconds per loop iteration:
        -DUSE_LOC    no USE_LOC    no USE_LOC, locale1=C

HPUX 10.20        170        52.5        1.9
RHL 8 (glibc 2.2.93-5)    8.89        1.04        0.038
FC2 (glibc 2.3.3-27)    5.16        0.44        0.028
Darwin (OS X 10.3.5)    1199        1.75        0.35

(The third column is thrown in just to remind you of how awful standard
strcoll implementations are in themselves.)

These are on machines of widely varying horsepower, so the absolute
numbers shouldn't be compared across rows, but the general story holds:
setlocale should be considered to be at least an order of magnitude
slower than strcoll, and on non-glibc machines it can be a whole lot
worse than that.

> If the OS locale handling is slow on some OS's then postgres should just warn
> its users that using locales on those OS's will be slow.

I don't think we can take that attitude when the cost penalty involved
can be a couple of orders of magnitude.
        regards, tom lane


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>

int
main(int argc, char** argv)
{int        i, n, j = 0, k = 0;char   *locale1, *locale2;
locale1 = argv[1];locale2 = argv[2];n = atoi(argv[3]);
/* not ifdef'd, so that without USE_LOC we only test locale1 */if (setlocale(LC_COLLATE, locale1) == NULL){
printf("setlocale(LC_COLLATE,%s) failed\n", locale1);    exit(1);}if (setlocale(LC_CTYPE, locale1) == NULL){
printf("setlocale(LC_CTYPE,%s) failed\n", locale1);    exit(1);}
 
for (i = 0; i < n; i++){
#ifdef USE_LOC    if (setlocale(LC_COLLATE, locale1) == NULL)    {        printf("setlocale(LC_COLLATE, %s) failed\n",
locale1);       exit(1);    }    if (setlocale(LC_CTYPE, locale1) == NULL)    {        printf("setlocale(LC_CTYPE, %s)
failed\n",locale1);        exit(1);    }
 
#endif    j = strcoll("foobarbaz", "foo bar bath");
#ifdef USE_LOC    if (setlocale(LC_COLLATE, locale2) == NULL)    {        printf("setlocale(LC_COLLATE, %s) failed\n",
locale2);       exit(1);    }    if (setlocale(LC_CTYPE, locale2) == NULL)    {        printf("setlocale(LC_CTYPE, %s)
failed\n",locale2);        exit(1);    }
 
#endif    k = strcoll("foobarbaz", "foo bar bath");}
printf("in %s strcoll gives %d\n", locale1, j);
#ifdef USE_LOCprintf("in %s strcoll gives %d\n", locale2, k);
#endif
return 0;
}


Re: How to add locale support for each column?

From
Stephan Szabo
Date:
On Sun, 19 Sep 2004, Greg Stark wrote:

>
> Dennis Bjorklund <db@zigo.dhs.org> writes:
>
> > Still, we want the final solution to be what the sql standard specify. A
> > function as the proposed is however useful until a standard sql solution
> > is implemented. I've used a similar function for some projects and it have
> > worked well for me also.
>
> I haven't read the standard in this area (and the sections I have read don't
> lead me to think reading this section would be a trivial task), but from what
> I've seen people describe here I think the standard behaviour will be nigh
> useless anyway. From what I understand the standard has you declare a locale
> per column. That would mean the entire column can only store strings from a
> single locale.

AFAICT, it's one default collation per column, but you can specify a
collation to use on sorts and groups and comparisons.

I think example statement parts would be like:ORDER BY name COLLATE de_DEWHERE name COLLATE de_DE < 'Smith'WHERE name <
'Smith'COLLATE de_DE
 

There are limitations, like I believe the following is an errorWHERE name COLLATE de_DE < 'Smith' COLLATE en_US
because both have different explicitly given collations.


Re: How to add locale support for each column?

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> I don't see how this is relevant though. One way or another postgres is going
> to have to sort strings in varying locales chosen at run-time. Comparing
> against strcoll's execution time without changing changing locales is a straw
> man.

No, it's not, because if we were to do our own implementation we could
avoid the need to do setlocale per se at all (inside the loop that is).
The trick is to expose the different locales as objects (pointed-to
structs) that can be passed to the strcoll function.  The fundamental
difficulty is not so much with the libc innards as with the lousy API.

> I see no reason to think Postgres's implementation of looking up xfrm rules
> for the specified locale will be any faster than the OS's.

The point is that we can move that lookup out of the inner loop.

> An order of magnitude change in the cost for strcoll isn't really relevant
> unless the cost for strcoll is significant to begin with. I suspect strcoll
> costs are currently dwarfed by the palloc costs to evaluate the expression
> already.

I don't have numbers for that at my fingertips, but I have measured it
in the past, and the comparison function *is* a significant part of the
total CPU cost in sorts.
        regards, tom lane


Re: How to add locale support for each column?

From
Dennis Bjorklund
Date:
On 19 Sep 2004, Greg Stark wrote:

> I've seen people describe here I think the standard behaviour will be nigh
> useless anyway. From what I understand the standard has you declare a locale
> per column.

Yes, you can define a default collation for a column if you want to but
more important you can set the charset and encoding for a column.

> That would mean the entire column can only store strings from a
> single locale.

You store strings in a charset/encoding. The same charset can use 
different collations when you compare strings later on.
> So in my application I would need to declare a new set of columns in every
> table every time I localize it for a new language.

Not at all, you can just tell it to use another collation in your query. 
The collation for the column is just the default. In your query you can do 
something like

SELECT ... ORDER BY foo COLLATE "sv_SE"

I'm not 100% sure of the syntax of the language identifier. but something 
like that.

> I actually would prefer an interface like he describes. And I think it's
> enlightening that everyone that tries their hand at this seems to come
> up with an interface much like this. You say you did for example, and I
> did also when I needed a strxfrm interface.

This is just because functions is the way you extend postgresql as a user. 
It's much easier then to add language constructs.

The standard is clear (as clear as an sql standard ever is :-) and most of
the other databases implements it. I dont think we should make up our own
system and have that as the implementation goal. Until we have the
standard solution, a simple function like the discussed is useful and 
putting it in contrib is what I prefer. I would use it.

-- 
/Dennis Björklund



Re: How to add locale support for each column?

From
Greg Stark
Date:
Hm, ok, setting up a new database with the Conway implementation I see the
time penalty is not "negligible". However it is quite tolerable:

test=> explain analyze select * from test order by a;                                                  QUERY PLAN
                                           
 
----------------------------------------------------------------------------------------------------------------Sort
(cost=69.83..72.33rows=1000 width=32) (actual time=772.175..900.155 rows=206300 loops=1)  Sort Key: a  ->  Seq Scan on
test (cost=0.00..20.00 rows=1000 width=32) (actual time=0.009..203.362 rows=206300 loops=1)Total runtime: 1055.137 ms
 
(4 rows)

test=> explain analyze select * from test order by lower(a);                                                  QUERY
PLAN                                                  
 
----------------------------------------------------------------------------------------------------------------Sort
(cost=72.33..74.83rows=1000 width=32) (actual time=1190.248..1318.607 rows=206300 loops=1)  Sort Key: lower(a)  ->  Seq
Scanon test  (cost=0.00..22.50 rows=1000 width=32) (actual time=0.021..517.679 rows=206300 loops=1)Total runtime:
1565.564ms
 
(4 rows)

test=> explain analyze select * from test order by strxfrm(a,'en_US');
QUERYPLAN                                                    
 
-----------------------------------------------------------------------------------------------------------------Sort
(cost=72.33..74.83rows=1000 width=32) (actual time=2135.535..2263.429 rows=206300 loops=1)  Sort Key: strxfrm(a,
'en_US'::text) ->  Seq Scan on test  (cost=0.00..22.50 rows=1000 width=32) (actual time=0.060..1384.428 rows=206300
loops=1)Totalruntime: 2516.981 ms
 
(4 rows)

test=> explain analyze select * from test order by strxfrm(a,'fr_CA');
QUERY PLAN                                                    
 
------------------------------------------------------------------------------------------------------------------Sort
(cost=72.33..74.83rows=1000 width=32) (actual time=2155.991..2284.054 rows=206300 loops=1)  Sort Key: strxfrm(a,
'fr_CA'::text) ->  Seq Scan on test  (cost=0.00..22.50 rows=1000 width=32) (actual time=18.455..1403.390 rows=206300
loops=1)Totalruntime: 2538.886 ms
 
(4 rows)

-- 
greg



Re: How to add locale support for each column?

From
Tatsuo Ishii
Date:
> Mahmoud Taghizadeh wrote:
> > I have already this mail to PATCHES mailing list and nobody replied
> > me !!!., Hope the hackers  and the developers send their ideas.
> >
> > Qustion: is this approach is suitable for solving the need for locale
> > per column in PostgreSQL ?
> 
> I doubt that, because 1) it's nonstandard, and 2) switching the locale 
> at run time is too expensive when using the system library.  It's a 
> good workaround, though, if it satisfies the requirements of your 
> application.

What about internal sorting, for example GROYP BY, UNION, merge join
etc.? These still use the per database cluster locale, that might be
inconsistent with per column locale and might lead to unwanted result,
no?
--
Tatsuo Ishii


Re: How to add locale support for each column?

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > Peter Eisentraut <peter_e@gmx.net> writes:
> >> 2) switching the locale at run time is too expensive when using the system
> >> library.
>
> > Fwiw I did some experiments with this and found it wasn't true.
>
> Really?

We're following two different methodologies so the results aren't comparable.
I exposed strxfrm to postgres and then did a sort on strxfrm(col). The
resulting query times were slower than sorting on lower(col) by negligible
amounts.

I have the original code I wrote and Joe Conway's reimplementation of it using
setjmp/longjmp to protect against errors. However the list archives appear to
have been down that month so I've attached Joe Conway's implementation below.

> These are on machines of widely varying horsepower, so the absolute
> numbers shouldn't be compared across rows, but the general story holds:
> setlocale should be considered to be at least an order of magnitude
> slower than strcoll, and on non-glibc machines it can be a whole lot
> worse than that.

I don't see how this is relevant though. One way or another postgres is going
to have to sort strings in varying locales chosen at run-time. Comparing
against strcoll's execution time without changing changing locales is a straw
man. It's like comparing your tcp/ip bandwidth with the loopback interface's
bandwidth.

I see no reason to think Postgres's implementation of looking up xfrm rules
for the specified locale will be any faster than the OS's. We know some OS's
suck but some certainly don't.

Perhaps glibc's locale handling functions ought to be available as a separate
library users of those OS's could install -- if it isn't already.

> I don't think we can take that attitude when the cost penalty involved
> can be a couple of orders of magnitude.

Aside from the above complaint there's another problem with your methodology.
An order of magnitude change in the cost for strcoll isn't really relevant
unless the cost for strcoll is significant to begin with. I suspect strcoll
costs are currently dwarfed by the palloc costs to evaluate the expression
already.



Here's the implementation in postgres from Joe Conway btw:


Joe Conway wrote:
> What about something like this?

Oops! Forgot to restrore error handling. See below:

Joe

> 8<--------------------------------
>
> #include <setjmp.h>
> #include <string.h>
>
> #include "postgres.h"
> #include "fmgr.h"
> #include "tcop/tcopprot.h"
> #include "utils/builtins.h"
>
> #define GET_STR(textp) \
>   DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp)))
> #define GET_BYTEA(str_) \
>   DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
> #define MAX_BYTEA_LEN    0x3fffffff
>
> /*
>  * pg_strxfrm - Function to convert string similar to the strxfrm C
>  * function using a specified locale.
>  */
> extern Datum pg_strxfrm(PG_FUNCTION_ARGS);
> PG_FUNCTION_INFO_V1(pg_strxfrm);
>
> Datum
> pg_strxfrm(PG_FUNCTION_ARGS)
> {
>   char       *str = GET_STR(PG_GETARG_TEXT_P(0));
>   size_t      str_len = strlen(str);
>   char       *localestr = GET_STR(PG_GETARG_TEXT_P(1));
>   size_t      approx_trans_len = 4 + (str_len * 3);
>   char       *trans = (char *) palloc(approx_trans_len + 1);
>   size_t      actual_trans_len;
>   char       *oldlocale;
>   char       *newlocale;
>   sigjmp_buf  save_restart;
>
>   if (approx_trans_len > MAX_BYTEA_LEN)
>     elog(ERROR, "source string too long to transform");
>
>   oldlocale = setlocale(LC_COLLATE, NULL);
>   if (!oldlocale)
>     elog(ERROR, "setlocale failed to return a locale");
>
>   /* catch elog while locale is set other than the default */
>   memcpy(&save_restart, &Warn_restart, sizeof(save_restart));
>   if (sigsetjmp(Warn_restart, 1) != 0)
>   {
>     memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
>     newlocale = setlocale(LC_COLLATE, oldlocale);
>     if (!newlocale)
>       elog(PANIC, "setlocale failed to reset locale: %s", localestr);
>     siglongjmp(Warn_restart, 1);
>   }
>
>   newlocale = setlocale(LC_COLLATE, localestr);
>   if (!newlocale)
>     elog(ERROR, "setlocale failed to set a locale: %s", localestr);
>
>   actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
>
>   /* if the buffer was not large enough, resize it and try again */
>   if (actual_trans_len >= approx_trans_len)
>   {
>     approx_trans_len = actual_trans_len + 1;
>     if (approx_trans_len > MAX_BYTEA_LEN)
>       elog(ERROR, "source string too long to transform");
>
>     trans = (char *) repalloc(trans, approx_trans_len + 1);
>     actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
>
>     /* if the buffer still not large enough, punt */
>     if (actual_trans_len >= approx_trans_len)
>       elog(ERROR, "strxfrm failed, buffer insufficient");
>   }
>
>   newlocale = setlocale(LC_COLLATE, oldlocale);
>   if (!newlocale)
>     elog(PANIC, "setlocale failed to reset locale: %s", localestr);

   /* restore normal error handling */
   memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));

>
>   PG_RETURN_BYTEA_P(GET_BYTEA(trans));
> }
>
> 8<--------------------------------
>






--
greg

Re: How to add locale support for each column?

From
Mahmoud Taghizadeh
Date:
and whats the final result?, does this method is suitable ?
Is it possible for postgresql leaders to add this function in postgresql source code?
 
 
glibc does not support jalali and herbew and ... calendars, so postgresql does not support them too. we are planning to write some simple functions to convert different calanders.
 
whats the group idea?

Greg Stark <gsstark@mit.edu> wrote:
Tom Lane writes:

> Greg Stark writes:
> > Peter Eisentraut writes:
> >> 2) switching the locale at run time is too expensive when using the system
> >> library.
>
> > Fwiw I did some experiments with this and found it wasn't true.
>
> Really?

We're following two different methodologies so the results aren't comparable.
I exposed strxfrm to postgres and then did a sort on strxfrm(col). The
resulting query times were slower than sorting on lower(col) by negligible
amounts.

I have the original code I wrote and Joe Conway's reimplementation of it using
setjmp/longjmp to protect against errors. However the list archives appear to
have been down that month so I've attached Joe Conway's implementation below.

> These are on machines of widely varying horsepower, so the absolute
> numbers shouldn't be compared across rows, but the general story holds:
> setlocale should be considered to be at least an order of magnitude
> slower than strcoll, and on non-glibc machines it can be a whole lot
> worse than that.

I don't see how this is relevant though. One way or another postgres is going
to have to sort strings in varying locales chosen at run-time. Comparing
against strcoll's execution time without changing changing locales is a straw
man. It's like comparing your tcp/ip bandwidth with the loopback interface's
bandwidth.

I see no reason to think Postgres's implementation of looking up xfrm rules
for the specified locale will be any faster than the OS's. We know some OS's
suck but some certainly don't.

Perhaps glibc's locale handling functions ought to be available as a separate
library users of those OS's could install -- if it isn 't already.

> I don't think we can take that attitude when the cost penalty involved
> can be a couple of orders of magnitude.

Aside from the above complaint there's another problem with your methodology.
An order of magnitude change in the cost for strcoll isn't really relevant
unless the cost for strcoll is significant to begin with. I suspect strcoll
costs are currently dwarfed by the palloc costs to evaluate the expression
already.



Here's the implementation in postgres from Joe Conway btw:




> ATTACHMENT part 2 message/rfc822
Date: Sat, 23 Aug 2003 09:48:49 -0700
From: Joe Conway
CC: Greg Stark , Stephan Szabo ,
Tom Lane , pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Collation rules and multi-lingual databases

Joe Conway wrote:
> What about something like this?

Oops! Forgot to restrore error handling. See below:

Joe

> 8<--------------------------------
>
> #include
> #include
>
> #include "postgres.h"
> #include "fmgr.h"
> #include "tcop/tcopprot.h"
> #include "utils/builtins.h"
>
> #define GET_STR(textp) \
> DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp)))
> #define GET_BYTEA(str_) \
> DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
> #define MAX_BYTEA_LEN 0x3fffffff
>
> /*
> * pg_strxfrm - Function to convert string similar to the strxfrm C
> * function using a specified locale.
> */
> extern Datum pg_strxfrm(PG_FUNCTION_ARGS);
> PG_FUNCTION_INFO_V1(pg_strxfrm);
>
> Datum
> pg_strxfrm(PG_FUNCTION_ARGS)
> {
> char *str = GET_STR(PG_GETARG_TEXT_P(0));
> size_t str_len = strlen(str);
> char *localestr = GET_STR(PG_GETARG_TEXT_P(1));
> size_t approx_trans_len = 4 + (str_len * 3);
> char *trans = (char *) palloc(approx_trans_len + 1);
> size_t actual_trans_len;
> char *oldlocale;
> char *newlocale;
> sigjmp_buf save_restart;
>
> if (approx_trans_len > MAX_BYTEA_LEN)
> elog(ERROR, "source string too long to transform");
>
> oldlocale = setlocale(LC_COLLATE, NULL);
> if (!oldlocale)
> elog(ERROR, "setlocale failed to return a locale");
>
> /* catch elog while locale is set other than the default */
> memcpy(&save_restart, &Warn_restart, sizeof(save_restart));
> if (sigsetjmp(Warn_restart, 1) != 0)
> {
> memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
> newlocale = setlocale(LC_COLLATE, oldlocale);
> if (!newlocale)
> elog(PANIC, "setlocale failed to reset locale: %s", localestr);
> siglongjmp(Warn_restart, 1);
> }
>
> newlocale = setlocale(LC_COLLATE, localestr);
> if (!newlocale)
> elog(ERROR, "setlocale failed to set a locale: %s", localestr);
>
> actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
>
> /* if the buffer was not large enough, resize it and try again */
> if (actual_trans_len >= approx_trans_len)
> {
> approx_trans_len = actual_trans_len + 1;
> if (approx_trans_len > MAX_BYTEA_LEN)
> elog(ERROR, "source string too long to transform");
>
> trans = (char *) repalloc(trans, approx_trans_len + 1);
> actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
>
> /* if the buffer still not large enough, punt */
> if (actual_trans_len >= approx_trans_len)
> elog(ERROR, "strxfrm failed, buffer insufficient");
> }
>
> newlocale = setlocale(LC_COLLATE, oldlocale);
> if (!newlocale)
> elog(P ANIC, "setlocale failed to reset locale: %s", localestr);

/* restore normal error handling */
memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));

>
> PG_RETURN_BYTEA_P(GET_BYTEA(trans));
> }
>
> 8<--------------------------------
>






--
greg

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings


With Regards,
--taghi


Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!

Re: How to add locale support for each column?

From
Stephan Szabo
Date:
On Sun, 19 Sep 2004, Greg Stark wrote:

> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > Greg Stark <gsstark@mit.edu> writes:
> > > Peter Eisentraut <peter_e@gmx.net> writes:
> > >> 2) switching the locale at run time is too expensive when using the system
> > >> library.
> >
> > > Fwiw I did some experiments with this and found it wasn't true.
> >
> > Really?
>
> We're following two different methodologies so the results aren't comparable.
> I exposed strxfrm to postgres and then did a sort on strxfrm(col). The
> resulting query times were slower than sorting on lower(col) by negligible
> amounts.

But shouldn't the comparison be against sorting on col not lower(col)?
strxfrm(col) sorts seem comparable to col, strxfrm(lower(col)) sorts seem
comparable to lower(col). Some collations do treat 'A' and 'a' as be
adjacent in sort order, but that's not a guarantee, so it's not valid to
say, "everywhere you'd use lower(col) you can use strxfrm instead."

And in past numbers you sent, it looked like the amounts were: 1s for sort
on col, 1.5s for sort on lower(col), 2.5s for sort on strxfrm(col).  That
doesn't seem negligible to me unless that doesn't grow linearly with the
number of rows. It also seems like if the only differences in the query
was that, then the time for the strxfrm was significant compared to the
rest of the query time on that query.

> > These are on machines of widely varying horsepower, so the absolute
> > numbers shouldn't be compared across rows, but the general story holds:
> > setlocale should be considered to be at least an order of magnitude
> > slower than strcoll, and on non-glibc machines it can be a whole lot
> > worse than that.
>
> I don't see how this is relevant though. One way or another postgres is going
> to have to sort strings in varying locales chosen at run-time. Comparing
> against strcoll's execution time without changing changing locales is a straw
> man. It's like comparing your tcp/ip bandwidth with the loopback interface's
> bandwidth.
>
> I see no reason to think Postgres's implementation of looking up xfrm rules
> for the specified locale will be any faster than the OS's. We know some OS's
> suck but some certainly don't.

But do you have to change locales per row or per sort? Presumably, a built
in implementation may be able to do the latter rather than the former.



Re: How to add locale support for each column?

From
Greg Stark
Date:
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

> But shouldn't the comparison be against sorting on col not lower(col)?
> strxfrm(col) sorts seem comparable to col, strxfrm(lower(col)) sorts seem
> comparable to lower(col). Some collations do treat 'A' and 'a' as be
> adjacent in sort order, but that's not a guarantee, so it's not valid to
> say, "everywhere you'd use lower(col) you can use strxfrm instead."

Well, in my implementation strxfrm is a postgresql function. So I wanted to
compare it with an expression that had at least as much overhead as a
postgresql expression with a single function call.

> And in past numbers you sent, it looked like the amounts were: 1s for sort
> on col, 1.5s for sort on lower(col), 2.5s for sort on strxfrm(col).  That
> doesn't seem negligible to me

Right, I amended my "negligible" claim. It's a significant but reasonable
speed. A 1.5s delay on sorting 100k rows is certainly not the kind of
intolerable delay that would make the idea of switching locales intolerable.

> unless that doesn't grow linearly with the number of rows.

Well I was comparing sorting 206,000 rows. Even if it scales linearly, a 10s
delay on sorting 2M records isn't really fatal. I certainly wouldn't want to
remove the ability to sort using strcmp if the data is ascii or binary. But if
you're going to use locale collation order it's going to be slower. strxfrm
has to do quite a bit of work. Even a postgres-internal mechanism is going to
have to do that same work.

The only time you could save is the time it takes to look up "en_US" in a list
(or hash) of cached locales and switch a pointer. I suspect that's going to be
on a small (but not negligible) portion the overhead. I guess this is subject
to analysis, I'll try to do a gprof run at some point to answer that.


> > I see no reason to think Postgres's implementation of looking up xfrm rules
> > for the specified locale will be any faster than the OS's. We know some OS's
> > suck but some certainly don't.
> 
> But do you have to change locales per row or per sort? Presumably, a built
> in implementation may be able to do the latter rather than the former.

We certainly need the ability to change the locales per-row, in fact possibly
multiple times per row.

Consider

select en,fr from translationsorder by en,fr

Which is actually something reasonable I could have to do in my current
project.

However changing locales should be nigh-instantaneous, it really ought to be
just changing a pointer. And in the API Tom foresees shouldn't even happen.
The only cost of sorting on many locales (aside from the initial load) would
be in the reduced cache hit rate from using more locale tables.

-- 
greg



Re: How to add locale support for each column?

From
Stephan Szabo
Date:
On Sat, 25 Sep 2004, Greg Stark wrote:

> Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
>
> > But shouldn't the comparison be against sorting on col not lower(col)?
> > strxfrm(col) sorts seem comparable to col, strxfrm(lower(col)) sorts seem
> > comparable to lower(col). Some collations do treat 'A' and 'a' as be
> > adjacent in sort order, but that's not a guarantee, so it's not valid to
> > say, "everywhere you'd use lower(col) you can use strxfrm instead."
>
> Well, in my implementation strxfrm is a postgresql function. So I wanted to
> compare it with an expression that had at least as much overhead as a
> postgresql expression with a single function call.

I'd thought there was still a question of where such a thing would live?
If it's an external project or a contrib thing, the above might be true,
but if it's meant to be a truly supported internal builtin then the
function call cost is part of the implementation and is significant data
that cannot be thrown out.

> > And in past numbers you sent, it looked like the amounts were: 1s for sort
> > on col, 1.5s for sort on lower(col), 2.5s for sort on strxfrm(col).  That
> > doesn't seem negligible to me
>
> Right, I amended my "negligible" claim. It's a significant but reasonable
> speed. A 1.5s delay on sorting 100k rows is certainly not the kind of
> intolerable delay that would make the idea of switching locales intolerable.

Aparently the message I responded to hung around for a while before
getting to me because they came out of order.

> > unless that doesn't grow linearly with the number of rows.
>
> Well I was comparing sorting 206,000 rows. Even if it scales linearly, a 10s
> delay on sorting 2M records isn't really fatal. I certainly wouldn't want to

I agree in general, but if part of this involves forcing "C" locale (see
my question at the end) and so any locale sorting is forced to do this,
then if a query in en_US currently takes 7 seconds, but now will take 17,
I think that's signficant.

> remove the ability to sort using strcmp if the data is ascii or binary. But if
> you're going to use locale collation order it's going to be slower. strxfrm
> has to do quite a bit of work. Even a postgres-internal mechanism is going to
> have to do that same work.

Was your strxfrm comparison against a column comparison in "C" locale then
rather than one using en_US or some other such locale?

> > > I see no reason to think Postgres's implementation of looking up xfrm rules
> > > for the specified locale will be any faster than the OS's. We know some OS's
> > > suck but some certainly don't.
> >
> > But do you have to change locales per row or per sort? Presumably, a built
> > in implementation may be able to do the latter rather than the former.
>
> We certainly need the ability to change the locales per-row, in fact possibly
> multiple times per row.

But we don't presumably have to look up the locale each time as you note.
I don't see how whether our implementation is not any faster than the OS's
matters if we simply do it less.  Now, as you also pointed out, it may
turn out that the time for the OS lookups after the first *are*
reasonably insigificant.

--
More importantly, do we have know whether or not this function really
works properly in non-C locales?  Is the strxfrm result guaranteed to sort
correctly (using strcoll) in others?


Re: How to add locale support for each column?

From
Greg Stark
Date:
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

> I'd thought there was still a question of where such a thing would live?
> If it's an external project or a contrib thing, the above might be true,
> but if it's meant to be a truly supported internal builtin then the
> function call cost is part of the implementation and is significant data
> that cannot be thrown out.

Well it seems to be consensus that it would be good to have a complete locale
handling as envisioned by the spec. But I don't see that as relevant to this
discussion. I'm comparing a function handling strxfrm with a function handling
lower() and with sorting on a column directly. The point was to demonstrate
that it was practical (if not ideal) to switch locales repeatedly, especially
when you take into account that *any* function will have some overhead
anyways. If it were built into postgres the overhead might be lower, but I
doubt by much, and in any case it's just not an option for me now.

> Aparently the message I responded to hung around for a while before
> getting to me because they came out of order.

That seems to be happening a lot lately.

> I agree in general, but if part of this involves forcing "C" locale (see
> my question at the end) and so any locale sorting is forced to do this,
> then if a query in en_US currently takes 7 seconds, but now will take 17,
> I think that's significant.

I compared against sorting in C locale. It would be interesting to know how
much of the penalty came from simply having to do the work strxfrm vs the
overhead of switching locales. The former is inevitable. *Any* implementation
of locale collation orders is going to have to do it. 

The latter is maybe something we can work on reducing, though not without
considerable cost in terms of code complexity. It will mean either lobbying
for API changes in libc or growing the codebase of postgres by the size of an
entire i18n package. I strongly suspect maintaining i18n packages turns out to
be a *lot* of work.

> Was your strxfrm comparison against a column comparison in "C" locale then
> rather than one using en_US or some other such locale?

C.

I could compare it against sorting in a database created in a given locale,
but I suspect I'll find gprof output more directly helpful.

> But we don't presumably have to look up the locale each time as you note.

The question is whether looking up the locale is significant compared to
executing strxfrm. I suspect it'll be significant, but not the majority of the
time. 

The real question is whether speeding up sorting by removing that overhead is
worth the complexity of abandoning libc.

I would strongly urge people to consider writing postgres support to assume
standard libc functionality. If we can convince glibc and BSD libc people to
add a more reasonable interface we can optionally use it, just as we do other
more modern interfaces to old features. 

If some platforms are just terminally braindead we should look for ways to
support people installing gnu libintl (or whatever the glibc i18n chunk is
called) separately and using it like we do libreadline, libkrb, or libz.

> More importantly, do we have know whether or not this function really works
> properly in non-C locales? Is the strxfrm result guaranteed to sort
> correctly (using strcoll) in others?

Well you wouldn't want to use strcoll at all actually, just strcmp. Actually
Conway's reimplementation returns a bytea which is probably more correct than
my original plan to return text. Though I should check whether postgres has to
do extra work to sort bytea data instead of varchar data, especially since
strxfrm should never return strings containing nuls.


-- 
greg