Thread: requesting features in PostgreSQL
I would like to request a new aggregate function for PostgreSQL. I need an aggregate function that calculates geometric mean, and can be used in SQL statements the same as any other aggregate function such as count(), sum() and avg(). Rather than have a custom function, I'd like to see it added as a standard PostgreSQL feature. Is there a specific PostgreSQL developer that I should contact? Or is an open request on this list sufficient for consideration? The geometric mean function, I'll call it gmean(), is similar to avg(), except that instead of adding the individual values, you would multiply them; and instead of dividing the sum by the sample size, you would raise the resulting product by a power of (1 / sample size). Therefore, the gmean() of the values 1, 2, 3 and 9 would equal: = (1 * 2 * 3 * 9) ^ (1 / 4) = 54 ^ 0.25 = 2.710806 # rounded to 6 decimal places This function differs from arithmetic mean (average) in that it lessens the affect of outliers without discounting them altogether. Thanks, Andrew Gould __________________________________________________ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/
Andrew Gould <andrewgould@yahoo.com> writes: > I need an aggregate function that calculates geometric > mean, and can be used in SQL statements the same as > any other aggregate function such as count(), sum() > and avg(). Rather than have a custom function, I'd > like to see it added as a standard PostgreSQL feature. One request does not strike me as sufficient reason to make it a standard feature. Extensibility is what Postgres is all about --- so go ahead and write your own. AFAICS this should take about ten minutes to prototype (two simple plpgsql or pltcl functions and a user-defined aggregate). If you intend to process very large volumes of data, it might be worth rewriting the transition function in C for speed. (You could steal the existing transition function for avg() as a model.) regards, tom lane
One request is certainly not sufficient for a new standard feature. As I see more industries becoming data savvy, however, I think it's time will come. Until then.... Simple python scripts are the extent of my coding experience. As I will be processing 2 to 6 million records within one query, I guess I should learn some C. Once I untar PostgreSQL's source, where can I find the code for avg()? Will it have it's own file? Thanks, Andrew --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Gould <andrewgould@yahoo.com> writes: > > I need an aggregate function that calculates > geometric > > mean, and can be used in SQL statements the same > as > > any other aggregate function such as count(), > sum() > > and avg(). Rather than have a custom function, > I'd > > like to see it added as a standard PostgreSQL > feature. > > One request does not strike me as sufficient reason > to make it a > standard feature. Extensibility is what Postgres is > all about > --- so go ahead and write your own. > > AFAICS this should take about ten minutes to > prototype (two simple > plpgsql or pltcl functions and a user-defined > aggregate). If you intend > to process very large volumes of data, it might be > worth rewriting the > transition function in C for speed. (You could > steal the existing > transition function for avg() as a model.) > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: you can get off all lists at once with the > unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) __________________________________________________ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/
Andrew Gould <andrewgould@yahoo.com> writes: > Once I untar PostgreSQL's source, where can I find the > code for avg()? Will it have it's own file? The C code is in src/backend/utils/adt/float.c --- look at float8_accum() and float8_avg(). regards, tom lane
I decided to use your problem as a little exercize for myself, so I came up with a working aggregate for gmean. I posted the code below. I wrote it in plpgsql. It's possible the "a1" function (the main part of the aggregate) won't be the bottleneck for performance. You could probably translate to C, and you can probably solve the problem more gracefully than I, but I have working code. I couldn't think of what else to use as a state type, so I just used a two-element array of floats. The first element holds the running product (i.e. 1*2*3*9) and the second holds the number of records visited (4). I would appreciate it if you (or anyone else) would post (or direct email) what changes you make for efficiency, including if you write a c function for a1 (a2 doesn't really need a C func unless you really want, since it should only be called once per aggregation). I could probably help you translate to C if you'd like, but first I'd like to make sure I have the most efficient algorithm. Regards, Jeff ______________________________________ create function a1f(float[2]) returns float as ' BEGIN return ($1[1]^(1/($1[2]))); END; ' language 'plpgsql'; create function a1(float[2],float) returns float[2] as ' DECLARE ret float[2]; BEGIN ret := ''{'' || (($1[1]) * ($2)) || '','' || (($1[2]) + 1) || ''}''; RETURN ret; END; ' language 'plpgsql'; create aggregate a2 (basetype=float,sfunc=a1,stype=float[],finalfunc=a1f,initcond='{1.0,0.0}');
Jeff Davis <list-pgsql-general@empires.org> writes: > create function a1(float[2],float) returns float[2] as ' > DECLARE > ret float[2]; > BEGIN > ret := ''{'' || (($1[1]) * ($2)) || '','' || (($1[2]) + 1) || ''}''; > RETURN ret; > END; > ' language 'plpgsql'; We really need better support for arrays in plpgsql :-(. The above will work, but it invokes conversion of floats to text and back again on every call; that's slow and will probably cause accumulation of roundoff errors in the aggregate result. I tried to do this: create function a1(float[2],float) returns float[2] as ' declare ret float[2]; begin ret := $1; ret[1] := ret[1] * $2; ret[2] := ret[2] + 1; return ret; end' language plpgsql; but it failed with syntax errors --- plpgsql doesn't understand the notation "var[subscript] := something". Someone oughta dig into it and fix that. In the meantime I think the most practical way to do this task in plpgsql is to abuse the built-in "point" type, which can be treated as an array of 2 floats: regression=# create function a1(point,float) returns point as ' regression'# begin regression'# return point($1[0] * $2, $1[1] + 1); regression'# end' language plpgsql; CREATE regression=# select a1('(2,3)'::point, 44); a1 -------- (88,4) (1 row) Note that the subscripts are [0],[1] not [1],[2] ... a bit of legacy incompatibility ... regards, tom lane
Yup, I tried the same exact thing :) I wouldn't mind seeing some better support in plpgsql either. However, anyone can create their own type so I guess it isn't much of a problem. I updated my code to use points instead, and more meaningful identifiers. ----------------------------------------------------------- create function float_gmean(point) returns float as ' BEGIN return ($1[0]^(1/($1[1]))); END; ' language 'plpgsql'; create function float_gmean_accum(point,float) returns point as ' BEGIN RETURN point(($1[0]*$2),($1[1]+1)); END; ' language 'plpgsql'; create aggregate gmean (basetype=float,sfunc=float_gmean_accum,stype=point,finalfunc=float_gmean,initcond='(1.0,0.0)'); ----------------------------------------------------------- > I tried to do this: > > create function a1(float[2],float) returns float[2] as ' > declare > ret float[2]; > begin > ret := $1; > ret[1] := ret[1] * $2; > ret[2] := ret[2] + 1; > return ret; > end' language plpgsql; > > but it failed with syntax errors --- plpgsql doesn't understand the > notation "var[subscript] := something". Someone oughta dig into it > and fix that. > > In the meantime I think the most practical way to do this task in > plpgsql is to abuse the built-in "point" type, which can be treated > as an array of 2 floats: > > regression=# create function a1(point,float) returns point as ' > regression'# begin > regression'# return point($1[0] * $2, $1[1] + 1); > regression'# end' language plpgsql; > CREATE > regression=# select a1('(2,3)'::point, 44); > a1 > -------- > (88,4) > (1 row) > > Note that the subscripts are [0],[1] not [1],[2] ... a bit of legacy > incompatibility ... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
I was looking at the code. Why do you calculate "sumX2" in float8_accum? Later you have a comment "ignore sumX2" in float8_avg(). Regards, Jeff On Friday 26 April 2002 11:30 am, Tom Lane wrote: > Andrew Gould <andrewgould@yahoo.com> writes: > > Once I untar PostgreSQL's source, where can I find the > > code for avg()? Will it have it's own file? > > The C code is in src/backend/utils/adt/float.c --- look at > float8_accum() and float8_avg(). > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Jeff Davis <list-pgsql-general@empires.org> writes: > I was looking at the code. Why do you calculate "sumX2" in float8_accum? Because the same accumulator function is also used for stddev. regards, tom lane
I want to thank you both for your help on this. I read the C source and the plpgsql below. I think I'm understanding what I'm seeing. I will be out of town for several days; but will try out the plpsql solution next week. (If my luck in airports holds true, I'll have plenty of time to read the documentation.) Thanks again, Andrew --- Jeff Davis <list-pgsql-general@empires.org> wrote: > Yup, I tried the same exact thing :) I wouldn't mind > seeing some better > support in plpgsql either. However, anyone can > create their own type so I > guess it isn't much of a problem. > > I updated my code to use points instead, and more > meaningful identifiers. > > ----------------------------------------------------------- > create function float_gmean(point) returns float as > ' > BEGIN > return ($1[0]^(1/($1[1]))); > END; > ' language 'plpgsql'; > > create function float_gmean_accum(point,float) > returns point as ' > BEGIN > RETURN point(($1[0]*$2),($1[1]+1)); > END; > ' language 'plpgsql'; > > create aggregate gmean > (basetype=float,sfunc=float_gmean_accum,stype=point,finalfunc=float_gmean,initcond='(1.0,0.0)'); > ----------------------------------------------------------- > > > > I tried to do this: > > > > create function a1(float[2],float) returns > float[2] as ' > > declare > > ret float[2]; > > begin > > ret := $1; > > ret[1] := ret[1] * $2; > > ret[2] := ret[2] + 1; > > return ret; > > end' language plpgsql; > > > > but it failed with syntax errors --- plpgsql > doesn't understand the > > notation "var[subscript] := something". Someone > oughta dig into it > > and fix that. > > > > In the meantime I think the most practical way to > do this task in > > plpgsql is to abuse the built-in "point" type, > which can be treated > > as an array of 2 floats: > > > > regression=# create function a1(point,float) > returns point as ' > > regression'# begin > > regression'# return point($1[0] * $2, $1[1] + 1); > > regression'# end' language plpgsql; > > CREATE > > regression=# select a1('(2,3)'::point, 44); > > a1 > > -------- > > (88,4) > > (1 row) > > > > Note that the subscripts are [0],[1] not [1],[2] > ... a bit of legacy > > incompatibility ... > > > > regards, tom lane > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 4: Don't 'kill -9' the postmaster __________________________________________________ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com