Thread: How to add locale support for each column?
I have already this mail to PATCHES mailing list and nobody replied me !!!.,
Hope the hackers and the developers send their ideas.
Qustion: is this approach is suitable for solving the need for locale per column in PostgreSQL ?
There is a function I attached to this mail. this function is similar to nls_sort. this function is written by Karel Zak, I dont know if he had tried to submit this code or not.
In this distribution you will find file nls_string.c. It contains the
definition of function nls_string(text, text) which takes a string
parameter and a locale name and returns string describing the ordering.
So you can run
select * from table order by nls_string(name, 'en_US.UTF-8')
or
select * from table order by nls_string(name, 'cs_CZ.UTF-8')
or
select * from table order by nls_string(name, 'C')
definition of function nls_string(text, text) which takes a string
parameter and a locale name and returns string describing the ordering.
So you can run
select * from table order by nls_string(name, 'en_US.UTF-8')
or
select * from table order by nls_string(name, 'cs_CZ.UTF-8')
or
select * from table order by nls_string(name, 'C')
Lates version is available at:
http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/
M. Taghizadeh
http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/
M. Taghizadeh
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!
Attachment
Mahmoud Taghizadeh wrote: > I have already this mail to PATCHES mailing list and nobody replied > me !!!., Hope the hackers and the developers send their ideas. > > Qustion: is this approach is suitable for solving the need for locale > per column in PostgreSQL ? I doubt that, because 1) it's nonstandard, and 2) switching the locale at run time is too expensive when using the system library. It's a good workaround, though, if it satisfies the requirements of your application. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes: > 2) switching the locale at run time is too expensive when using the system > library. Fwiw I did some experiments with this and found it wasn't true. At least with glibc I could switch the locale to what I wanted and back for every record in a million record table with basically no noticeable effect on the query execution time. Of course not every system will have an equally speedy implementation. But I don't think that's an argument for postgres to reimplement portions of the OS. If the OS locale handling is slow on some OS's then postgres should just warn its users that using locales on those OS's will be slow. In any case I suspect more than just the glibc implementation cache locales in memory at this point. It seems like just too obvious a feature and more and more applications require locale switching to be fast anyways. -- greg
On 19 Sep 2004, Greg Stark wrote: > don't think that's an argument for postgres to reimplement portions of the OS. > If the OS locale handling is slow on some OS's then postgres should just warn > its users that using locales on those OS's will be slow. > > In any case I suspect more than just the glibc implementation cache locales in > memory at this point. It seems like just too obvious a feature and more and > more applications require locale switching to be fast anyways. Still, we want the final solution to be what the sql standard specify. A function as the proposed is however useful until a standard sql solution is implemented. I've used a similar function for some projects and it have worked well for me also. I think such a function fits perfect in contrib. -- /Dennis Björklund
Dennis Bjorklund <db@zigo.dhs.org> writes: > Still, we want the final solution to be what the sql standard specify. A > function as the proposed is however useful until a standard sql solution > is implemented. I've used a similar function for some projects and it have > worked well for me also. I haven't read the standard in this area (and the sections I have read don't lead me to think reading this section would be a trivial task), but from what I've seen people describe here I think the standard behaviour will be nigh useless anyway. From what I understand the standard has you declare a locale per column. That would mean the entire column can only store strings from a single locale. So in my application I would need to declare a new set of columns in every table every time I localize it for a new language. And worse, for user-provided input where each input could be in the user's locale language I wouldn't be able to store it in a properly normalized table where one column stores the user's input in whatever locale he prefers. I actually would prefer an interface like he describes. And I think it's enlightening that everyone that tries their hand at this seems to come up with an interface much like this. You say you did for example, and I did also when I needed a strxfrm interface. (I discussed it on the list but strangely the list archives search isn't finding it) > I think such a function fits perfect in contrib. I would like to see all the locale handling functions people have proposed over time gathered up and merged into a single coherent source base. There are probably a lot of duplicate code, and some could probably benefit from copying code from others. I don't think this is just a stop-gap solution though. I expect it would live on in contrib and continue to be useful even if the standard locale handling is ever implemented -- which will only happen if someone somewhere finds it would actually be useful for their needs. Possibly it would be useful to consider this as a low level interface. Then define a localized data type that uses it to implement the standard behaviour. -- greg
Greg Stark <gsstark@mit.edu> writes: > Peter Eisentraut <peter_e@gmx.net> writes: >> 2) switching the locale at run time is too expensive when using the system >> library. > Fwiw I did some experiments with this and found it wasn't true. Really? I was just in process of doing experiments using the attached test program. Here are some results (all using en_US/fr_FR or local equivalent as the test locales, and all built with "gcc -O2"). The numbers are microseconds per loop iteration: -DUSE_LOC no USE_LOC no USE_LOC, locale1=C HPUX 10.20 170 52.5 1.9 RHL 8 (glibc 2.2.93-5) 8.89 1.04 0.038 FC2 (glibc 2.3.3-27) 5.16 0.44 0.028 Darwin (OS X 10.3.5) 1199 1.75 0.35 (The third column is thrown in just to remind you of how awful standard strcoll implementations are in themselves.) These are on machines of widely varying horsepower, so the absolute numbers shouldn't be compared across rows, but the general story holds: setlocale should be considered to be at least an order of magnitude slower than strcoll, and on non-glibc machines it can be a whole lot worse than that. > If the OS locale handling is slow on some OS's then postgres should just warn > its users that using locales on those OS's will be slow. I don't think we can take that attitude when the cost penalty involved can be a couple of orders of magnitude. regards, tom lane #include <stdio.h> #include <stdlib.h> #include <string.h> #include <locale.h> int main(int argc, char** argv) {int i, n, j = 0, k = 0;char *locale1, *locale2; locale1 = argv[1];locale2 = argv[2];n = atoi(argv[3]); /* not ifdef'd, so that without USE_LOC we only test locale1 */if (setlocale(LC_COLLATE, locale1) == NULL){ printf("setlocale(LC_COLLATE,%s) failed\n", locale1); exit(1);}if (setlocale(LC_CTYPE, locale1) == NULL){ printf("setlocale(LC_CTYPE,%s) failed\n", locale1); exit(1);} for (i = 0; i < n; i++){ #ifdef USE_LOC if (setlocale(LC_COLLATE, locale1) == NULL) { printf("setlocale(LC_COLLATE, %s) failed\n", locale1); exit(1); } if (setlocale(LC_CTYPE, locale1) == NULL) { printf("setlocale(LC_CTYPE, %s) failed\n",locale1); exit(1); } #endif j = strcoll("foobarbaz", "foo bar bath"); #ifdef USE_LOC if (setlocale(LC_COLLATE, locale2) == NULL) { printf("setlocale(LC_COLLATE, %s) failed\n", locale2); exit(1); } if (setlocale(LC_CTYPE, locale2) == NULL) { printf("setlocale(LC_CTYPE, %s) failed\n",locale2); exit(1); } #endif k = strcoll("foobarbaz", "foo bar bath");} printf("in %s strcoll gives %d\n", locale1, j); #ifdef USE_LOCprintf("in %s strcoll gives %d\n", locale2, k); #endif return 0; }
On Sun, 19 Sep 2004, Greg Stark wrote: > > Dennis Bjorklund <db@zigo.dhs.org> writes: > > > Still, we want the final solution to be what the sql standard specify. A > > function as the proposed is however useful until a standard sql solution > > is implemented. I've used a similar function for some projects and it have > > worked well for me also. > > I haven't read the standard in this area (and the sections I have read don't > lead me to think reading this section would be a trivial task), but from what > I've seen people describe here I think the standard behaviour will be nigh > useless anyway. From what I understand the standard has you declare a locale > per column. That would mean the entire column can only store strings from a > single locale. AFAICT, it's one default collation per column, but you can specify a collation to use on sorts and groups and comparisons. I think example statement parts would be like:ORDER BY name COLLATE de_DEWHERE name COLLATE de_DE < 'Smith'WHERE name < 'Smith'COLLATE de_DE There are limitations, like I believe the following is an errorWHERE name COLLATE de_DE < 'Smith' COLLATE en_US because both have different explicitly given collations.
Greg Stark <gsstark@mit.edu> writes: > I don't see how this is relevant though. One way or another postgres is going > to have to sort strings in varying locales chosen at run-time. Comparing > against strcoll's execution time without changing changing locales is a straw > man. No, it's not, because if we were to do our own implementation we could avoid the need to do setlocale per se at all (inside the loop that is). The trick is to expose the different locales as objects (pointed-to structs) that can be passed to the strcoll function. The fundamental difficulty is not so much with the libc innards as with the lousy API. > I see no reason to think Postgres's implementation of looking up xfrm rules > for the specified locale will be any faster than the OS's. The point is that we can move that lookup out of the inner loop. > An order of magnitude change in the cost for strcoll isn't really relevant > unless the cost for strcoll is significant to begin with. I suspect strcoll > costs are currently dwarfed by the palloc costs to evaluate the expression > already. I don't have numbers for that at my fingertips, but I have measured it in the past, and the comparison function *is* a significant part of the total CPU cost in sorts. regards, tom lane
On 19 Sep 2004, Greg Stark wrote: > I've seen people describe here I think the standard behaviour will be nigh > useless anyway. From what I understand the standard has you declare a locale > per column. Yes, you can define a default collation for a column if you want to but more important you can set the charset and encoding for a column. > That would mean the entire column can only store strings from a > single locale. You store strings in a charset/encoding. The same charset can use different collations when you compare strings later on. > So in my application I would need to declare a new set of columns in every > table every time I localize it for a new language. Not at all, you can just tell it to use another collation in your query. The collation for the column is just the default. In your query you can do something like SELECT ... ORDER BY foo COLLATE "sv_SE" I'm not 100% sure of the syntax of the language identifier. but something like that. > I actually would prefer an interface like he describes. And I think it's > enlightening that everyone that tries their hand at this seems to come > up with an interface much like this. You say you did for example, and I > did also when I needed a strxfrm interface. This is just because functions is the way you extend postgresql as a user. It's much easier then to add language constructs. The standard is clear (as clear as an sql standard ever is :-) and most of the other databases implements it. I dont think we should make up our own system and have that as the implementation goal. Until we have the standard solution, a simple function like the discussed is useful and putting it in contrib is what I prefer. I would use it. -- /Dennis Björklund
Hm, ok, setting up a new database with the Conway implementation I see the time penalty is not "negligible". However it is quite tolerable: test=> explain analyze select * from test order by a; QUERY PLAN ----------------------------------------------------------------------------------------------------------------Sort (cost=69.83..72.33rows=1000 width=32) (actual time=772.175..900.155 rows=206300 loops=1) Sort Key: a -> Seq Scan on test (cost=0.00..20.00 rows=1000 width=32) (actual time=0.009..203.362 rows=206300 loops=1)Total runtime: 1055.137 ms (4 rows) test=> explain analyze select * from test order by lower(a); QUERY PLAN ----------------------------------------------------------------------------------------------------------------Sort (cost=72.33..74.83rows=1000 width=32) (actual time=1190.248..1318.607 rows=206300 loops=1) Sort Key: lower(a) -> Seq Scanon test (cost=0.00..22.50 rows=1000 width=32) (actual time=0.021..517.679 rows=206300 loops=1)Total runtime: 1565.564ms (4 rows) test=> explain analyze select * from test order by strxfrm(a,'en_US'); QUERYPLAN -----------------------------------------------------------------------------------------------------------------Sort (cost=72.33..74.83rows=1000 width=32) (actual time=2135.535..2263.429 rows=206300 loops=1) Sort Key: strxfrm(a, 'en_US'::text) -> Seq Scan on test (cost=0.00..22.50 rows=1000 width=32) (actual time=0.060..1384.428 rows=206300 loops=1)Totalruntime: 2516.981 ms (4 rows) test=> explain analyze select * from test order by strxfrm(a,'fr_CA'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------Sort (cost=72.33..74.83rows=1000 width=32) (actual time=2155.991..2284.054 rows=206300 loops=1) Sort Key: strxfrm(a, 'fr_CA'::text) -> Seq Scan on test (cost=0.00..22.50 rows=1000 width=32) (actual time=18.455..1403.390 rows=206300 loops=1)Totalruntime: 2538.886 ms (4 rows) -- greg
> Mahmoud Taghizadeh wrote: > > I have already this mail to PATCHES mailing list and nobody replied > > me !!!., Hope the hackers and the developers send their ideas. > > > > Qustion: is this approach is suitable for solving the need for locale > > per column in PostgreSQL ? > > I doubt that, because 1) it's nonstandard, and 2) switching the locale > at run time is too expensive when using the system library. It's a > good workaround, though, if it satisfies the requirements of your > application. What about internal sorting, for example GROYP BY, UNION, merge join etc.? These still use the per database cluster locale, that might be inconsistent with per column locale and might lead to unwanted result, no? -- Tatsuo Ishii
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > Peter Eisentraut <peter_e@gmx.net> writes: > >> 2) switching the locale at run time is too expensive when using the system > >> library. > > > Fwiw I did some experiments with this and found it wasn't true. > > Really? We're following two different methodologies so the results aren't comparable. I exposed strxfrm to postgres and then did a sort on strxfrm(col). The resulting query times were slower than sorting on lower(col) by negligible amounts. I have the original code I wrote and Joe Conway's reimplementation of it using setjmp/longjmp to protect against errors. However the list archives appear to have been down that month so I've attached Joe Conway's implementation below. > These are on machines of widely varying horsepower, so the absolute > numbers shouldn't be compared across rows, but the general story holds: > setlocale should be considered to be at least an order of magnitude > slower than strcoll, and on non-glibc machines it can be a whole lot > worse than that. I don't see how this is relevant though. One way or another postgres is going to have to sort strings in varying locales chosen at run-time. Comparing against strcoll's execution time without changing changing locales is a straw man. It's like comparing your tcp/ip bandwidth with the loopback interface's bandwidth. I see no reason to think Postgres's implementation of looking up xfrm rules for the specified locale will be any faster than the OS's. We know some OS's suck but some certainly don't. Perhaps glibc's locale handling functions ought to be available as a separate library users of those OS's could install -- if it isn't already. > I don't think we can take that attitude when the cost penalty involved > can be a couple of orders of magnitude. Aside from the above complaint there's another problem with your methodology. An order of magnitude change in the cost for strcoll isn't really relevant unless the cost for strcoll is significant to begin with. I suspect strcoll costs are currently dwarfed by the palloc costs to evaluate the expression already. Here's the implementation in postgres from Joe Conway btw: Joe Conway wrote: > What about something like this? Oops! Forgot to restrore error handling. See below: Joe > 8<-------------------------------- > > #include <setjmp.h> > #include <string.h> > > #include "postgres.h" > #include "fmgr.h" > #include "tcop/tcopprot.h" > #include "utils/builtins.h" > > #define GET_STR(textp) \ > DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp))) > #define GET_BYTEA(str_) \ > DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_))) > #define MAX_BYTEA_LEN 0x3fffffff > > /* > * pg_strxfrm - Function to convert string similar to the strxfrm C > * function using a specified locale. > */ > extern Datum pg_strxfrm(PG_FUNCTION_ARGS); > PG_FUNCTION_INFO_V1(pg_strxfrm); > > Datum > pg_strxfrm(PG_FUNCTION_ARGS) > { > char *str = GET_STR(PG_GETARG_TEXT_P(0)); > size_t str_len = strlen(str); > char *localestr = GET_STR(PG_GETARG_TEXT_P(1)); > size_t approx_trans_len = 4 + (str_len * 3); > char *trans = (char *) palloc(approx_trans_len + 1); > size_t actual_trans_len; > char *oldlocale; > char *newlocale; > sigjmp_buf save_restart; > > if (approx_trans_len > MAX_BYTEA_LEN) > elog(ERROR, "source string too long to transform"); > > oldlocale = setlocale(LC_COLLATE, NULL); > if (!oldlocale) > elog(ERROR, "setlocale failed to return a locale"); > > /* catch elog while locale is set other than the default */ > memcpy(&save_restart, &Warn_restart, sizeof(save_restart)); > if (sigsetjmp(Warn_restart, 1) != 0) > { > memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart)); > newlocale = setlocale(LC_COLLATE, oldlocale); > if (!newlocale) > elog(PANIC, "setlocale failed to reset locale: %s", localestr); > siglongjmp(Warn_restart, 1); > } > > newlocale = setlocale(LC_COLLATE, localestr); > if (!newlocale) > elog(ERROR, "setlocale failed to set a locale: %s", localestr); > > actual_trans_len = strxfrm(trans, str, approx_trans_len + 1); > > /* if the buffer was not large enough, resize it and try again */ > if (actual_trans_len >= approx_trans_len) > { > approx_trans_len = actual_trans_len + 1; > if (approx_trans_len > MAX_BYTEA_LEN) > elog(ERROR, "source string too long to transform"); > > trans = (char *) repalloc(trans, approx_trans_len + 1); > actual_trans_len = strxfrm(trans, str, approx_trans_len + 1); > > /* if the buffer still not large enough, punt */ > if (actual_trans_len >= approx_trans_len) > elog(ERROR, "strxfrm failed, buffer insufficient"); > } > > newlocale = setlocale(LC_COLLATE, oldlocale); > if (!newlocale) > elog(PANIC, "setlocale failed to reset locale: %s", localestr); /* restore normal error handling */ memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart)); > > PG_RETURN_BYTEA_P(GET_BYTEA(trans)); > } > > 8<-------------------------------- > -- greg
and whats the final result?, does this method is suitable ?
Is it possible for postgresql leaders to add this function in postgresql source code?
glibc does not support jalali and herbew and ... calendars, so postgresql does not support them too. we are planning to write some simple functions to convert different calanders.
whats the group idea?
Greg Stark <gsstark@mit.edu> wrote:
Greg Stark <gsstark@mit.edu> wrote:
Tom Lane writes:
> Greg Stark writes:
> > Peter Eisentraut writes:
> >> 2) switching the locale at run time is too expensive when using the system
> >> library.
>
> > Fwiw I did some experiments with this and found it wasn't true.
>
> Really?
We're following two different methodologies so the results aren't comparable.
I exposed strxfrm to postgres and then did a sort on strxfrm(col). The
resulting query times were slower than sorting on lower(col) by negligible
amounts.
I have the original code I wrote and Joe Conway's reimplementation of it using
setjmp/longjmp to protect against errors. However the list archives appear to
have been down that month so I've attached Joe Conway's implementation below.
> These are on machines of widely varying horsepower, so the absolute
> numbers shouldn't be compared across rows, but the general story holds:
> setlocale should be considered to be at least an order of magnitude
> slower than strcoll, and on non-glibc machines it can be a whole lot
> worse than that.
I don't see how this is relevant though. One way or another postgres is going
to have to sort strings in varying locales chosen at run-time. Comparing
against strcoll's execution time without changing changing locales is a straw
man. It's like comparing your tcp/ip bandwidth with the loopback interface's
bandwidth.
I see no reason to think Postgres's implementation of looking up xfrm rules
for the specified locale will be any faster than the OS's. We know some OS's
suck but some certainly don't.
Perhaps glibc's locale handling functions ought to be available as a separate
library users of those OS's could install -- if it isn 't already.
> I don't think we can take that attitude when the cost penalty involved
> can be a couple of orders of magnitude.
Aside from the above complaint there's another problem with your methodology.
An order of magnitude change in the cost for strcoll isn't really relevant
unless the cost for strcoll is significant to begin with. I suspect strcoll
costs are currently dwarfed by the palloc costs to evaluate the expression
already.
Here's the implementation in postgres from Joe Conway btw:
> ATTACHMENT part 2 message/rfc822
Date: Sat, 23 Aug 2003 09:48:49 -0700
From: Joe Conway
CC: Greg Stark , Stephan Szabo ,
Tom Lane , pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Collation rules and multi-lingual databases
Joe Conway wrote:
> What about something like this?
Oops! Forgot to restrore error handling. See below:
Joe
> 8<--------------------------------
>
> #include
> #include
>
> #include "postgres.h"
> #include "fmgr.h"
> #include "tcop/tcopprot.h"
> #include "utils/builtins.h"
>
> #define GET_STR(textp) \
> DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp)))
> #define GET_BYTEA(str_) \
> DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
> #define MAX_BYTEA_LEN 0x3fffffff
>
> /*
> * pg_strxfrm - Function to convert string similar to the strxfrm C
> * function using a specified locale.
> */
> extern Datum pg_strxfrm(PG_FUNCTION_ARGS);
> PG_FUNCTION_INFO_V1(pg_strxfrm);
>
> Datum
> pg_strxfrm(PG_FUNCTION_ARGS)
> {
> char *str = GET_STR(PG_GETARG_TEXT_P(0));
> size_t str_len = strlen(str);
> char *localestr = GET_STR(PG_GETARG_TEXT_P(1));
> size_t approx_trans_len = 4 + (str_len * 3);
> char *trans = (char *) palloc(approx_trans_len + 1);
> size_t actual_trans_len;
> char *oldlocale;
> char *newlocale;
> sigjmp_buf save_restart;
>
> if (approx_trans_len > MAX_BYTEA_LEN)
> elog(ERROR, "source string too long to transform");
>
> oldlocale = setlocale(LC_COLLATE, NULL);
> if (!oldlocale)
> elog(ERROR, "setlocale failed to return a locale");
>
> /* catch elog while locale is set other than the default */
> memcpy(&save_restart, &Warn_restart, sizeof(save_restart));
> if (sigsetjmp(Warn_restart, 1) != 0)
> {
> memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
> newlocale = setlocale(LC_COLLATE, oldlocale);
> if (!newlocale)
> elog(PANIC, "setlocale failed to reset locale: %s", localestr);
> siglongjmp(Warn_restart, 1);
> }
>
> newlocale = setlocale(LC_COLLATE, localestr);
> if (!newlocale)
> elog(ERROR, "setlocale failed to set a locale: %s", localestr);
>
> actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
>
> /* if the buffer was not large enough, resize it and try again */
> if (actual_trans_len >= approx_trans_len)
> {
> approx_trans_len = actual_trans_len + 1;
> if (approx_trans_len > MAX_BYTEA_LEN)
> elog(ERROR, "source string too long to transform");
>
> trans = (char *) repalloc(trans, approx_trans_len + 1);
> actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
>
> /* if the buffer still not large enough, punt */
> if (actual_trans_len >= approx_trans_len)
> elog(ERROR, "strxfrm failed, buffer insufficient");
> }
>
> newlocale = setlocale(LC_COLLATE, oldlocale);
> if (!newlocale)
> elog(P ANIC, "setlocale failed to reset locale: %s", localestr);
/* restore normal error handling */
memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
>
> PG_RETURN_BYTEA_P(GET_BYTEA(trans));
> }
>
> 8<--------------------------------
>
--
greg
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
With Regards,
--taghi
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
On Sun, 19 Sep 2004, Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Greg Stark <gsstark@mit.edu> writes: > > > Peter Eisentraut <peter_e@gmx.net> writes: > > >> 2) switching the locale at run time is too expensive when using the system > > >> library. > > > > > Fwiw I did some experiments with this and found it wasn't true. > > > > Really? > > We're following two different methodologies so the results aren't comparable. > I exposed strxfrm to postgres and then did a sort on strxfrm(col). The > resulting query times were slower than sorting on lower(col) by negligible > amounts. But shouldn't the comparison be against sorting on col not lower(col)? strxfrm(col) sorts seem comparable to col, strxfrm(lower(col)) sorts seem comparable to lower(col). Some collations do treat 'A' and 'a' as be adjacent in sort order, but that's not a guarantee, so it's not valid to say, "everywhere you'd use lower(col) you can use strxfrm instead." And in past numbers you sent, it looked like the amounts were: 1s for sort on col, 1.5s for sort on lower(col), 2.5s for sort on strxfrm(col). That doesn't seem negligible to me unless that doesn't grow linearly with the number of rows. It also seems like if the only differences in the query was that, then the time for the strxfrm was significant compared to the rest of the query time on that query. > > These are on machines of widely varying horsepower, so the absolute > > numbers shouldn't be compared across rows, but the general story holds: > > setlocale should be considered to be at least an order of magnitude > > slower than strcoll, and on non-glibc machines it can be a whole lot > > worse than that. > > I don't see how this is relevant though. One way or another postgres is going > to have to sort strings in varying locales chosen at run-time. Comparing > against strcoll's execution time without changing changing locales is a straw > man. It's like comparing your tcp/ip bandwidth with the loopback interface's > bandwidth. > > I see no reason to think Postgres's implementation of looking up xfrm rules > for the specified locale will be any faster than the OS's. We know some OS's > suck but some certainly don't. But do you have to change locales per row or per sort? Presumably, a built in implementation may be able to do the latter rather than the former.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > But shouldn't the comparison be against sorting on col not lower(col)? > strxfrm(col) sorts seem comparable to col, strxfrm(lower(col)) sorts seem > comparable to lower(col). Some collations do treat 'A' and 'a' as be > adjacent in sort order, but that's not a guarantee, so it's not valid to > say, "everywhere you'd use lower(col) you can use strxfrm instead." Well, in my implementation strxfrm is a postgresql function. So I wanted to compare it with an expression that had at least as much overhead as a postgresql expression with a single function call. > And in past numbers you sent, it looked like the amounts were: 1s for sort > on col, 1.5s for sort on lower(col), 2.5s for sort on strxfrm(col). That > doesn't seem negligible to me Right, I amended my "negligible" claim. It's a significant but reasonable speed. A 1.5s delay on sorting 100k rows is certainly not the kind of intolerable delay that would make the idea of switching locales intolerable. > unless that doesn't grow linearly with the number of rows. Well I was comparing sorting 206,000 rows. Even if it scales linearly, a 10s delay on sorting 2M records isn't really fatal. I certainly wouldn't want to remove the ability to sort using strcmp if the data is ascii or binary. But if you're going to use locale collation order it's going to be slower. strxfrm has to do quite a bit of work. Even a postgres-internal mechanism is going to have to do that same work. The only time you could save is the time it takes to look up "en_US" in a list (or hash) of cached locales and switch a pointer. I suspect that's going to be on a small (but not negligible) portion the overhead. I guess this is subject to analysis, I'll try to do a gprof run at some point to answer that. > > I see no reason to think Postgres's implementation of looking up xfrm rules > > for the specified locale will be any faster than the OS's. We know some OS's > > suck but some certainly don't. > > But do you have to change locales per row or per sort? Presumably, a built > in implementation may be able to do the latter rather than the former. We certainly need the ability to change the locales per-row, in fact possibly multiple times per row. Consider select en,fr from translationsorder by en,fr Which is actually something reasonable I could have to do in my current project. However changing locales should be nigh-instantaneous, it really ought to be just changing a pointer. And in the API Tom foresees shouldn't even happen. The only cost of sorting on many locales (aside from the initial load) would be in the reduced cache hit rate from using more locale tables. -- greg
On Sat, 25 Sep 2004, Greg Stark wrote: > Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > > But shouldn't the comparison be against sorting on col not lower(col)? > > strxfrm(col) sorts seem comparable to col, strxfrm(lower(col)) sorts seem > > comparable to lower(col). Some collations do treat 'A' and 'a' as be > > adjacent in sort order, but that's not a guarantee, so it's not valid to > > say, "everywhere you'd use lower(col) you can use strxfrm instead." > > Well, in my implementation strxfrm is a postgresql function. So I wanted to > compare it with an expression that had at least as much overhead as a > postgresql expression with a single function call. I'd thought there was still a question of where such a thing would live? If it's an external project or a contrib thing, the above might be true, but if it's meant to be a truly supported internal builtin then the function call cost is part of the implementation and is significant data that cannot be thrown out. > > And in past numbers you sent, it looked like the amounts were: 1s for sort > > on col, 1.5s for sort on lower(col), 2.5s for sort on strxfrm(col). That > > doesn't seem negligible to me > > Right, I amended my "negligible" claim. It's a significant but reasonable > speed. A 1.5s delay on sorting 100k rows is certainly not the kind of > intolerable delay that would make the idea of switching locales intolerable. Aparently the message I responded to hung around for a while before getting to me because they came out of order. > > unless that doesn't grow linearly with the number of rows. > > Well I was comparing sorting 206,000 rows. Even if it scales linearly, a 10s > delay on sorting 2M records isn't really fatal. I certainly wouldn't want to I agree in general, but if part of this involves forcing "C" locale (see my question at the end) and so any locale sorting is forced to do this, then if a query in en_US currently takes 7 seconds, but now will take 17, I think that's signficant. > remove the ability to sort using strcmp if the data is ascii or binary. But if > you're going to use locale collation order it's going to be slower. strxfrm > has to do quite a bit of work. Even a postgres-internal mechanism is going to > have to do that same work. Was your strxfrm comparison against a column comparison in "C" locale then rather than one using en_US or some other such locale? > > > I see no reason to think Postgres's implementation of looking up xfrm rules > > > for the specified locale will be any faster than the OS's. We know some OS's > > > suck but some certainly don't. > > > > But do you have to change locales per row or per sort? Presumably, a built > > in implementation may be able to do the latter rather than the former. > > We certainly need the ability to change the locales per-row, in fact possibly > multiple times per row. But we don't presumably have to look up the locale each time as you note. I don't see how whether our implementation is not any faster than the OS's matters if we simply do it less. Now, as you also pointed out, it may turn out that the time for the OS lookups after the first *are* reasonably insigificant. -- More importantly, do we have know whether or not this function really works properly in non-C locales? Is the strxfrm result guaranteed to sort correctly (using strcoll) in others?
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > I'd thought there was still a question of where such a thing would live? > If it's an external project or a contrib thing, the above might be true, > but if it's meant to be a truly supported internal builtin then the > function call cost is part of the implementation and is significant data > that cannot be thrown out. Well it seems to be consensus that it would be good to have a complete locale handling as envisioned by the spec. But I don't see that as relevant to this discussion. I'm comparing a function handling strxfrm with a function handling lower() and with sorting on a column directly. The point was to demonstrate that it was practical (if not ideal) to switch locales repeatedly, especially when you take into account that *any* function will have some overhead anyways. If it were built into postgres the overhead might be lower, but I doubt by much, and in any case it's just not an option for me now. > Aparently the message I responded to hung around for a while before > getting to me because they came out of order. That seems to be happening a lot lately. > I agree in general, but if part of this involves forcing "C" locale (see > my question at the end) and so any locale sorting is forced to do this, > then if a query in en_US currently takes 7 seconds, but now will take 17, > I think that's significant. I compared against sorting in C locale. It would be interesting to know how much of the penalty came from simply having to do the work strxfrm vs the overhead of switching locales. The former is inevitable. *Any* implementation of locale collation orders is going to have to do it. The latter is maybe something we can work on reducing, though not without considerable cost in terms of code complexity. It will mean either lobbying for API changes in libc or growing the codebase of postgres by the size of an entire i18n package. I strongly suspect maintaining i18n packages turns out to be a *lot* of work. > Was your strxfrm comparison against a column comparison in "C" locale then > rather than one using en_US or some other such locale? C. I could compare it against sorting in a database created in a given locale, but I suspect I'll find gprof output more directly helpful. > But we don't presumably have to look up the locale each time as you note. The question is whether looking up the locale is significant compared to executing strxfrm. I suspect it'll be significant, but not the majority of the time. The real question is whether speeding up sorting by removing that overhead is worth the complexity of abandoning libc. I would strongly urge people to consider writing postgres support to assume standard libc functionality. If we can convince glibc and BSD libc people to add a more reasonable interface we can optionally use it, just as we do other more modern interfaces to old features. If some platforms are just terminally braindead we should look for ways to support people installing gnu libintl (or whatever the glibc i18n chunk is called) separately and using it like we do libreadline, libkrb, or libz. > More importantly, do we have know whether or not this function really works > properly in non-C locales? Is the strxfrm result guaranteed to sort > correctly (using strcoll) in others? Well you wouldn't want to use strcoll at all actually, just strcmp. Actually Conway's reimplementation returns a bytea which is probably more correct than my original plan to return text. Though I should check whether postgres has to do extra work to sort bytea data instead of varchar data, especially since strxfrm should never return strings containing nuls. -- greg