Re: Median/Quantile Aggregate - Mailing list pgsql-novice

From Tom Lane
Subject Re: Median/Quantile Aggregate
Date
Msg-id 29177.1116347496@sss.pgh.pa.us
Whole thread Raw
In response to Median/Quantile Aggregate  (David Orme <d.orme@imperial.ac.uk>)
List pgsql-novice
David Orme <d.orme@imperial.ac.uk> writes:
> I know people have asked about this before but I can't find a working
> solution on the web - I've found code for specific instances of
> calculating medians but I need a general aggregate function for
> calculating medians, or more generally any given quantile.

> The kind of thing I need to do is to be able to extract the median
> value from a table of 4 million rows, aggregating across more than
> 50,000 grouping values - the sort of thing that is really easy to do
> for averaging:

> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id;

You could build a custom aggregate for this.  array_append()
will do fine as the transition function, so all you really need to
write is a final function that sorts the given array and then picks
out the middle (or appropriate-quantile) element.

In fact, you could cheat a bit and let the system do the sorting for
you:

SELECT grid_id, myagg(rs)
FROM (SELECT grid_id, rs FROM behr_grid ORDER BY grid_id, rs) ss
GROUP BY grid_id;

If the aggregate is only used in a context like this, it will always
see presorted input and so it can just pull out the middle element.
(Note: I think this trick only works in PG 7.4 and later.)

So, lightly tested:

regression=# create function get_middle(anyarray) returns anyelement as
regression-# 'declare n integer;
regression'# begin
regression'#   n := (array_lower($1, 1) + array_upper($1, 1)) / 2;
regression'#   return $1[n];
regression'# end' language plpgsql;
CREATE FUNCTION
regression=# create aggregate sortedmedian(
regression(#   sfunc = array_append,
regression(#   finalfunc = get_middle,
regression(#   basetype = anyelement,
regression(#   stype = anyarray,
regression(#   initcond = '{}'
regression(# );
CREATE AGGREGATE
regression=# select hundred, min(thousand), max(thousand), sortedmedian(thousand) from
regression-# (select hundred, thousand from tenk1 order by 1,2) ss
regression-# group by hundred;
 hundred | min | max | sortedmedian
---------+-----+-----+--------------
       0 |   0 | 900 |          400
       1 |   1 | 901 |          401
       2 |   2 | 902 |          402
       3 |   3 | 903 |          403
       4 |   4 | 904 |          404
       5 |   5 | 905 |          405
       6 |   6 | 906 |          406
       7 |   7 | 907 |          407
       8 |   8 | 908 |          408
       9 |   9 | 909 |          409
      10 |  10 | 910 |          410
      ...


            regards, tom lane

pgsql-novice by date:

Previous
From: Sean Davis
Date:
Subject: Re: Median/Quantile Aggregate
Next
From: Adam Witney
Date:
Subject: Re: Median/Quantile Aggregate