Thread: Median/Quantile Aggregate

Median/Quantile Aggregate

From
David Orme
Date:
Hi,

I know people have asked about this before but I can't find a working
solution on the web - I've found code for specific instances of
calculating medians but I need a general aggregate function for
calculating medians, or more generally any given quantile.

The kind of thing I need to do is to be able to extract the median
value from a table of 4 million rows, aggregating across more than
50,000 grouping values - the sort of thing that is really easy to do
for averaging:

SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id;

 From what I've seen on the web, I know this isn't trivial or
necessarily fast but at the moment I'm reading the data out into
something that calculates medians and then matching it back in and
this doesn't feel particularly elegant!

Any suggestions gratefully received...

Thanks,
David



---------------------------------------
Dr. David Orme

Department of Biological Sciences
Imperial College London
Silwood Park Campus
Ascot, Berkshire SL5 7PY UK.

Tel: +44 (0)20 759 42358
Fax: +44 (0)20 759 42339
e-mail: d.orme@imperial.ac.uk



Re: Median/Quantile Aggregate

From
Sean Davis
Date:
Have you looked into pl/R?

Sean

On May 17, 2005, at 11:46 AM, David Orme wrote:

> Hi,
>
> I know people have asked about this before but I can't find a working
> solution on the web - I've found code for specific instances of
> calculating medians but I need a general aggregate function for
> calculating medians, or more generally any given quantile.
>
> The kind of thing I need to do is to be able to extract the median
> value from a table of 4 million rows, aggregating across more than
> 50,000 grouping values - the sort of thing that is really easy to do
> for averaging:
>
> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id;
>
> From what I've seen on the web, I know this isn't trivial or
> necessarily fast but at the moment I'm reading the data out into
> something that calculates medians and then matching it back in and
> this doesn't feel particularly elegant!
>
> Any suggestions gratefully received...
>
> Thanks,
> David
>
>
>
> ---------------------------------------
> Dr. David Orme
>
> Department of Biological Sciences
> Imperial College London
> Silwood Park Campus
> Ascot, Berkshire SL5 7PY UK.
>
> Tel: +44 (0)20 759 42358
> Fax: +44 (0)20 759 42339
> e-mail: d.orme@imperial.ac.uk
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>


Re: Median/Quantile Aggregate

From
Tom Lane
Date:
David Orme <d.orme@imperial.ac.uk> writes:
> I know people have asked about this before but I can't find a working
> solution on the web - I've found code for specific instances of
> calculating medians but I need a general aggregate function for
> calculating medians, or more generally any given quantile.

> The kind of thing I need to do is to be able to extract the median
> value from a table of 4 million rows, aggregating across more than
> 50,000 grouping values - the sort of thing that is really easy to do
> for averaging:

> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id;

You could build a custom aggregate for this.  array_append()
will do fine as the transition function, so all you really need to
write is a final function that sorts the given array and then picks
out the middle (or appropriate-quantile) element.

In fact, you could cheat a bit and let the system do the sorting for
you:

SELECT grid_id, myagg(rs)
FROM (SELECT grid_id, rs FROM behr_grid ORDER BY grid_id, rs) ss
GROUP BY grid_id;

If the aggregate is only used in a context like this, it will always
see presorted input and so it can just pull out the middle element.
(Note: I think this trick only works in PG 7.4 and later.)

So, lightly tested:

regression=# create function get_middle(anyarray) returns anyelement as
regression-# 'declare n integer;
regression'# begin
regression'#   n := (array_lower($1, 1) + array_upper($1, 1)) / 2;
regression'#   return $1[n];
regression'# end' language plpgsql;
CREATE FUNCTION
regression=# create aggregate sortedmedian(
regression(#   sfunc = array_append,
regression(#   finalfunc = get_middle,
regression(#   basetype = anyelement,
regression(#   stype = anyarray,
regression(#   initcond = '{}'
regression(# );
CREATE AGGREGATE
regression=# select hundred, min(thousand), max(thousand), sortedmedian(thousand) from
regression-# (select hundred, thousand from tenk1 order by 1,2) ss
regression-# group by hundred;
 hundred | min | max | sortedmedian
---------+-----+-----+--------------
       0 |   0 | 900 |          400
       1 |   1 | 901 |          401
       2 |   2 | 902 |          402
       3 |   3 | 903 |          403
       4 |   4 | 904 |          404
       5 |   5 | 905 |          405
       6 |   6 | 906 |          406
       7 |   7 | 907 |          407
       8 |   8 | 908 |          408
       9 |   9 | 909 |          409
      10 |  10 | 910 |          410
      ...


            regards, tom lane

Re: Median/Quantile Aggregate

From
Adam Witney
Date:
Using PL/R, two examples... One for median and inter-quartile range

CREATE OR REPLACE FUNCTION r_median(_float8) RETURNS float AS '
  median(arg1)
' LANGUAGE 'plr';

CREATE AGGREGATE median (
  sfunc = plr_array_accum,
  basetype = float8,
  stype = _float8,
  finalfunc = r_median
);


CREATE OR REPLACE FUNCTION r_iqr(_float8) RETURNS float AS '
  IQR(arg1)
' LANGUAGE 'plr';

CREATE AGGREGATE iqr (
  sfunc = plr_array_accum,
  basetype = float8,
  stype = _float8,
  finalfunc = r_iqr
);



> Hi,
>
> I know people have asked about this before but I can't find a working
> solution on the web - I've found code for specific instances of
> calculating medians but I need a general aggregate function for
> calculating medians, or more generally any given quantile.
>
> The kind of thing I need to do is to be able to extract the median
> value from a table of 4 million rows, aggregating across more than
> 50,000 grouping values - the sort of thing that is really easy to do
> for averaging:
>
> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id;
>
> From what I've seen on the web, I know this isn't trivial or
> necessarily fast but at the moment I'm reading the data out into
> something that calculates medians and then matching it back in and
> this doesn't feel particularly elegant!
>
> Any suggestions gratefully received...
>
> Thanks,
> David
>
>
>
> ---------------------------------------
> Dr. David Orme
>
> Department of Biological Sciences
> Imperial College London
> Silwood Park Campus
> Ascot, Berkshire SL5 7PY UK.
>
> Tel: +44 (0)20 759 42358
> Fax: +44 (0)20 759 42339
> e-mail: d.orme@imperial.ac.uk
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>     subscribe-nomail command to majordomo@postgresql.org so that your
>     message can get through to the mailing list cleanly


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


Re: Median/Quantile Aggregate

From
David Orme
Date:
On 17 May 2005, at 17:15, Gordon Haverland wrote:

> On Tuesday 17 May 2005 09:46, you wrote:
>
>> Hi,
>>
>> I know people have asked about this before but I can't find a
>> working solution on the web - I've found code for specific
>> instances of calculating medians but I need a general aggregate
>> function for calculating medians, or more generally any given
>> quantile.
>>
>> The kind of thing I need to do is to be able to extract the
>> median value from a table of 4 million rows, aggregating across
>> more than 50,000 grouping values - the sort of thing that is
>> really easy to do for averaging:
>>
>> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id;
>>
>>  From what I've seen on the web, I know this isn't trivial or
>> necessarily fast but at the moment I'm reading the data out
>> into something that calculates medians and then matching it
>> back in and this doesn't feel particularly elegant!
>>
>
> This is all from a numerical methods point of view, not a
> dbase/SQL point of view.
>
> Most of the quantile stuff I remember, at some point involves
> sorting and then binning.  A person can sort all of the data
> points, and be sure to find whatever quantiles you are interested
> in (possibly using interpolation between 2 values which are close
> to the desired quantile).
>
> In your case, 4 million rows really isn't too many values, and so
> you probably could do this "in core".  But, it may be you are
> headed towards larger problems, where the "out of core" methods
> might be more useful.

Actually the size of the table should be fairly static - there is the
possibility that it could increase to 16 million though!

> With the out of core methods, you are going to pass over the data
> more than once.  There may be a special "first pass" in order to
> obtain some kind of estimate for the value you are looking for
> and a range to look over.  If you are looking for a median and
> know the average (and hence presumably the standard deviation as
> well), you can look over a window centered on the mean with a
> width that is some fraction of the standard deviation (or
> standard error).  The next pass over the data just counts how
> many values are below the lower limit, how many values are above
> the upper limit, and copies into core (or a temporary table) all
> the values within the window of interest.  Then the windowed
> values are sorted (hopefully in core), and the desired quantile
> calculated.  The smaller the window, the fewer values copied out,
> the faster the sort goes.  But, smaller windows also mean greater
> chance that the quantile you are interested in isn't within the
> window range you are using.  Which means at least one more pass
> through the data with new lower and upper limits.

Neat idea and I can certainly see the advantage. My worry is that the
median (and more generally any quantile) are usually applied to
variables that depart from the expectations of any particular
canonical distribution. This is certainly the case with my data, so
my feeling that a potentially slower sort/bin approach is safer.

>
> If you happen to know the underlying statistical distribution of
> the data, it may be that the functional form of that distribution
> can be calculated from a few statistics.  For example: the
> Gaussian distribution can be calculated from the mean and
> variance.  With a functional form, you may be able to
> analytically calculate whatever quantile is of interest.  Of
> course, this presupposes that there are no outliers in your data.

That is at the root of the problem - I'm dealing with non-normal
variables (strongly right skewed in this instance) and so any
approach using a mean and standard deviation to approximate a
quantile is going to go badly astray!

Thanks for the response,
David




Re: Median/Quantile Aggregate

From
David Orme
Date:
Hi Sean and Adam,

I certainly have considered PL/R but I was initially hoping for
something that meant I didn't have to recompile R with the shared
library enabled on my Mac!  From the looks of Adam's messages on the
PL/R helplist, that doesn't look like too much of a struggle.

Since R is already my first choice for statistical analysis (and
indeed, the external software I was using for medians in the first
place!) this seems like a good way forward.

Tom - thanks for the plpgsql suggestion. I'll try and get that
working too, for the practise!

Thanks to all,
David

On 17 May 2005, at 17:34, Adam Witney wrote:

>
> Using PL/R, two examples... One for median and inter-quartile range
>
> CREATE OR REPLACE FUNCTION r_median(_float8) RETURNS float AS '
>   median(arg1)
> ' LANGUAGE 'plr';
>
> CREATE AGGREGATE median (
>   sfunc = plr_array_accum,
>   basetype = float8,
>   stype = _float8,
>   finalfunc = r_median
> );
>
>
> CREATE OR REPLACE FUNCTION r_iqr(_float8) RETURNS float AS '
>   IQR(arg1)
> ' LANGUAGE 'plr';
>
> CREATE AGGREGATE iqr (
>   sfunc = plr_array_accum,
>   basetype = float8,
>   stype = _float8,
>   finalfunc = r_iqr
> );
>
>
>
>
>> Hi,
>>
>> I know people have asked about this before but I can't find a working
>> solution on the web - I've found code for specific instances of
>> calculating medians but I need a general aggregate function for
>> calculating medians, or more generally any given quantile.
>>
>> The kind of thing I need to do is to be able to extract the median
>> value from a table of 4 million rows, aggregating across more than
>> 50,000 grouping values - the sort of thing that is really easy to do
>> for averaging:
>>
>> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id;
>>
>> From what I've seen on the web, I know this isn't trivial or
>> necessarily fast but at the moment I'm reading the data out into
>> something that calculates medians and then matching it back in and
>> this doesn't feel particularly elegant!
>>
>> Any suggestions gratefully received...
>>
>> Thanks,
>> David
>>
>>
>>
>> ---------------------------------------
>> Dr. David Orme
>>
>> Department of Biological Sciences
>> Imperial College London
>> Silwood Park Campus
>> Ascot, Berkshire SL5 7PY UK.
>>
>> Tel: +44 (0)20 759 42358
>> Fax: +44 (0)20 759 42339
>> e-mail: d.orme@imperial.ac.uk
>>
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 3: if posting/reading through Usenet, please send an appropriate
>>     subscribe-nomail command to majordomo@postgresql.org so that your
>>     message can get through to the mailing list cleanly
>>
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>