Thread: Median/Quantile Aggregate
Hi, I know people have asked about this before but I can't find a working solution on the web - I've found code for specific instances of calculating medians but I need a general aggregate function for calculating medians, or more generally any given quantile. The kind of thing I need to do is to be able to extract the median value from a table of 4 million rows, aggregating across more than 50,000 grouping values - the sort of thing that is really easy to do for averaging: SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id; From what I've seen on the web, I know this isn't trivial or necessarily fast but at the moment I'm reading the data out into something that calculates medians and then matching it back in and this doesn't feel particularly elegant! Any suggestions gratefully received... Thanks, David --------------------------------------- Dr. David Orme Department of Biological Sciences Imperial College London Silwood Park Campus Ascot, Berkshire SL5 7PY UK. Tel: +44 (0)20 759 42358 Fax: +44 (0)20 759 42339 e-mail: d.orme@imperial.ac.uk
Have you looked into pl/R? Sean On May 17, 2005, at 11:46 AM, David Orme wrote: > Hi, > > I know people have asked about this before but I can't find a working > solution on the web - I've found code for specific instances of > calculating medians but I need a general aggregate function for > calculating medians, or more generally any given quantile. > > The kind of thing I need to do is to be able to extract the median > value from a table of 4 million rows, aggregating across more than > 50,000 grouping values - the sort of thing that is really easy to do > for averaging: > > SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id; > > From what I've seen on the web, I know this isn't trivial or > necessarily fast but at the moment I'm reading the data out into > something that calculates medians and then matching it back in and > this doesn't feel particularly elegant! > > Any suggestions gratefully received... > > Thanks, > David > > > > --------------------------------------- > Dr. David Orme > > Department of Biological Sciences > Imperial College London > Silwood Park Campus > Ascot, Berkshire SL5 7PY UK. > > Tel: +44 (0)20 759 42358 > Fax: +44 (0)20 759 42339 > e-mail: d.orme@imperial.ac.uk > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
David Orme <d.orme@imperial.ac.uk> writes: > I know people have asked about this before but I can't find a working > solution on the web - I've found code for specific instances of > calculating medians but I need a general aggregate function for > calculating medians, or more generally any given quantile. > The kind of thing I need to do is to be able to extract the median > value from a table of 4 million rows, aggregating across more than > 50,000 grouping values - the sort of thing that is really easy to do > for averaging: > SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id; You could build a custom aggregate for this. array_append() will do fine as the transition function, so all you really need to write is a final function that sorts the given array and then picks out the middle (or appropriate-quantile) element. In fact, you could cheat a bit and let the system do the sorting for you: SELECT grid_id, myagg(rs) FROM (SELECT grid_id, rs FROM behr_grid ORDER BY grid_id, rs) ss GROUP BY grid_id; If the aggregate is only used in a context like this, it will always see presorted input and so it can just pull out the middle element. (Note: I think this trick only works in PG 7.4 and later.) So, lightly tested: regression=# create function get_middle(anyarray) returns anyelement as regression-# 'declare n integer; regression'# begin regression'# n := (array_lower($1, 1) + array_upper($1, 1)) / 2; regression'# return $1[n]; regression'# end' language plpgsql; CREATE FUNCTION regression=# create aggregate sortedmedian( regression(# sfunc = array_append, regression(# finalfunc = get_middle, regression(# basetype = anyelement, regression(# stype = anyarray, regression(# initcond = '{}' regression(# ); CREATE AGGREGATE regression=# select hundred, min(thousand), max(thousand), sortedmedian(thousand) from regression-# (select hundred, thousand from tenk1 order by 1,2) ss regression-# group by hundred; hundred | min | max | sortedmedian ---------+-----+-----+-------------- 0 | 0 | 900 | 400 1 | 1 | 901 | 401 2 | 2 | 902 | 402 3 | 3 | 903 | 403 4 | 4 | 904 | 404 5 | 5 | 905 | 405 6 | 6 | 906 | 406 7 | 7 | 907 | 407 8 | 8 | 908 | 408 9 | 9 | 909 | 409 10 | 10 | 910 | 410 ... regards, tom lane
Using PL/R, two examples... One for median and inter-quartile range CREATE OR REPLACE FUNCTION r_median(_float8) RETURNS float AS ' median(arg1) ' LANGUAGE 'plr'; CREATE AGGREGATE median ( sfunc = plr_array_accum, basetype = float8, stype = _float8, finalfunc = r_median ); CREATE OR REPLACE FUNCTION r_iqr(_float8) RETURNS float AS ' IQR(arg1) ' LANGUAGE 'plr'; CREATE AGGREGATE iqr ( sfunc = plr_array_accum, basetype = float8, stype = _float8, finalfunc = r_iqr ); > Hi, > > I know people have asked about this before but I can't find a working > solution on the web - I've found code for specific instances of > calculating medians but I need a general aggregate function for > calculating medians, or more generally any given quantile. > > The kind of thing I need to do is to be able to extract the median > value from a table of 4 million rows, aggregating across more than > 50,000 grouping values - the sort of thing that is really easy to do > for averaging: > > SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id; > > From what I've seen on the web, I know this isn't trivial or > necessarily fast but at the moment I'm reading the data out into > something that calculates medians and then matching it back in and > this doesn't feel particularly elegant! > > Any suggestions gratefully received... > > Thanks, > David > > > > --------------------------------------- > Dr. David Orme > > Department of Biological Sciences > Imperial College London > Silwood Park Campus > Ascot, Berkshire SL5 7PY UK. > > Tel: +44 (0)20 759 42358 > Fax: +44 (0)20 759 42339 > e-mail: d.orme@imperial.ac.uk > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
On 17 May 2005, at 17:15, Gordon Haverland wrote: > On Tuesday 17 May 2005 09:46, you wrote: > >> Hi, >> >> I know people have asked about this before but I can't find a >> working solution on the web - I've found code for specific >> instances of calculating medians but I need a general aggregate >> function for calculating medians, or more generally any given >> quantile. >> >> The kind of thing I need to do is to be able to extract the >> median value from a table of 4 million rows, aggregating across >> more than 50,000 grouping values - the sort of thing that is >> really easy to do for averaging: >> >> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id; >> >> From what I've seen on the web, I know this isn't trivial or >> necessarily fast but at the moment I'm reading the data out >> into something that calculates medians and then matching it >> back in and this doesn't feel particularly elegant! >> > > This is all from a numerical methods point of view, not a > dbase/SQL point of view. > > Most of the quantile stuff I remember, at some point involves > sorting and then binning. A person can sort all of the data > points, and be sure to find whatever quantiles you are interested > in (possibly using interpolation between 2 values which are close > to the desired quantile). > > In your case, 4 million rows really isn't too many values, and so > you probably could do this "in core". But, it may be you are > headed towards larger problems, where the "out of core" methods > might be more useful. Actually the size of the table should be fairly static - there is the possibility that it could increase to 16 million though! > With the out of core methods, you are going to pass over the data > more than once. There may be a special "first pass" in order to > obtain some kind of estimate for the value you are looking for > and a range to look over. If you are looking for a median and > know the average (and hence presumably the standard deviation as > well), you can look over a window centered on the mean with a > width that is some fraction of the standard deviation (or > standard error). The next pass over the data just counts how > many values are below the lower limit, how many values are above > the upper limit, and copies into core (or a temporary table) all > the values within the window of interest. Then the windowed > values are sorted (hopefully in core), and the desired quantile > calculated. The smaller the window, the fewer values copied out, > the faster the sort goes. But, smaller windows also mean greater > chance that the quantile you are interested in isn't within the > window range you are using. Which means at least one more pass > through the data with new lower and upper limits. Neat idea and I can certainly see the advantage. My worry is that the median (and more generally any quantile) are usually applied to variables that depart from the expectations of any particular canonical distribution. This is certainly the case with my data, so my feeling that a potentially slower sort/bin approach is safer. > > If you happen to know the underlying statistical distribution of > the data, it may be that the functional form of that distribution > can be calculated from a few statistics. For example: the > Gaussian distribution can be calculated from the mean and > variance. With a functional form, you may be able to > analytically calculate whatever quantile is of interest. Of > course, this presupposes that there are no outliers in your data. That is at the root of the problem - I'm dealing with non-normal variables (strongly right skewed in this instance) and so any approach using a mean and standard deviation to approximate a quantile is going to go badly astray! Thanks for the response, David
Hi Sean and Adam, I certainly have considered PL/R but I was initially hoping for something that meant I didn't have to recompile R with the shared library enabled on my Mac! From the looks of Adam's messages on the PL/R helplist, that doesn't look like too much of a struggle. Since R is already my first choice for statistical analysis (and indeed, the external software I was using for medians in the first place!) this seems like a good way forward. Tom - thanks for the plpgsql suggestion. I'll try and get that working too, for the practise! Thanks to all, David On 17 May 2005, at 17:34, Adam Witney wrote: > > Using PL/R, two examples... One for median and inter-quartile range > > CREATE OR REPLACE FUNCTION r_median(_float8) RETURNS float AS ' > median(arg1) > ' LANGUAGE 'plr'; > > CREATE AGGREGATE median ( > sfunc = plr_array_accum, > basetype = float8, > stype = _float8, > finalfunc = r_median > ); > > > CREATE OR REPLACE FUNCTION r_iqr(_float8) RETURNS float AS ' > IQR(arg1) > ' LANGUAGE 'plr'; > > CREATE AGGREGATE iqr ( > sfunc = plr_array_accum, > basetype = float8, > stype = _float8, > finalfunc = r_iqr > ); > > > > >> Hi, >> >> I know people have asked about this before but I can't find a working >> solution on the web - I've found code for specific instances of >> calculating medians but I need a general aggregate function for >> calculating medians, or more generally any given quantile. >> >> The kind of thing I need to do is to be able to extract the median >> value from a table of 4 million rows, aggregating across more than >> 50,000 grouping values - the sort of thing that is really easy to do >> for averaging: >> >> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id; >> >> From what I've seen on the web, I know this isn't trivial or >> necessarily fast but at the moment I'm reading the data out into >> something that calculates medians and then matching it back in and >> this doesn't feel particularly elegant! >> >> Any suggestions gratefully received... >> >> Thanks, >> David >> >> >> >> --------------------------------------- >> Dr. David Orme >> >> Department of Biological Sciences >> Imperial College London >> Silwood Park Campus >> Ascot, Berkshire SL5 7PY UK. >> >> Tel: +44 (0)20 759 42358 >> Fax: +44 (0)20 759 42339 >> e-mail: d.orme@imperial.ac.uk >> >> >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 3: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly >> > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > >