finding medians - Mailing list pgsql-hackers

From Josh Burdick
Subject finding medians
Date
Msg-id 3CF688AB.4010005@gradient.cis.upenn.edu
Whole thread Raw
Responses Re: finding medians  (Josh Berkus <josh@agliodbs.com>)
Re: finding medians  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: finding medians  (Hannu Krosing <hannu@tm.ee>)
List pgsql-hackers
At the end of this message is some code I used to find medians.   It's kind of a hack, but approximately works, and
isintended as a 
 
somewhat awkward stopgap for people who need to use medians.  It 
illustrates the limitations of the current aggregate function setup, 
which works so nicely for avg() and stddev().   I don't have any good solutions.  I tried using a float4[] to store 
each element as it's added, but I couldn't get array updates working in 
PL/PgSQL, so that didn't help.   Perhaps aggregate functions could be passed an array?  Or a cursor, 
pointing at the first line?  I'm not sure.
   Anyways, perhaps it'll be helpful.   Josh

-- 
Josh Burdick
jburdick@gradient.cis.upenn.edu
http://www.cis.upenn.edu/~jburdick



/* Implementing median-finding in "pure Postgres."  Does this by
copying data to a temporary table.
  A weakness of this code is that it uses sorting, instead of
Hoare's linear-time median algorithm.  Presumably sorting is
implemented so efficiently that it'll be faster than anything
written in PL/PgSQL.  (Although Hoare's algorithm implemented
in C would be faster than either.)
  BUG: this isn't properly set up to deal with multiple users.
For example, if A computes a median, then B could read the data
from the median_tmp table.  Possibly you could fiddle with
transaction isolation levels, or add a user field to median_tmp,
or something else complicated, to prevent this, but for now I'm
not worrying about this.
  Written by Josh Burdick (jburdick@gradient.cis.upenn.edu).
Anyone can use this under the same license as Postgres.
  20020524, jtb: started. */

drop aggregate median(float4);
drop table median_tmp;
drop sequence median_id;
drop index median_tmp_median_id;
drop function median_sfunc_float4(bigint, float4);
drop function median_finalfunc_float4(bigint);

create sequence median_id;
create table median_tmp ( median_id int, x float4
);
create index median_tmp_median_id on median_tmp(median_id);

create function median_sfunc_float4
(bigint, float4) returns bigint as '

insert into median_tmp
values (case when $1 = 0 then nextval(''median_id'') else $1 end, $2);

select currval(''median_id'');

' language 'SQL';

create function median_finalfunc_float4
(bigint) returns float4 as '
declare

i bigint;
n bigint;
c refcursor;
m float4;
m1 float4;

begin

n := (select count(*) from median_tmp where median_id = $1);

open c for select x from median_tmp where median_id = $1 order by x;

for i in 1..((n+1)/2) loop fetch c into m;
end loop;

/* if n is even, fetch the next value, and average the two */
if (n % int8(2) = int8(0)) then fetch c into m1; m := (m + m1) / 2;
end if;

delete from median_tmp where median_id = $1;

return m;

end
' language 'plpgsql';

create aggregate median ( basetype = float4, stype = bigint, initcond = 0, sfunc = median_sfunc_float4, finalfunc =
median_finalfunc_float4
);








pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: self-tuning histograms
Next
From: Josh Berkus
Date:
Subject: Re: finding medians