Thread: finding medians
At the end of this message is some code I used to find medians. It's kind of a hack, but approximately works, and isintended as a somewhat awkward stopgap for people who need to use medians. It illustrates the limitations of the current aggregate function setup, which works so nicely for avg() and stddev(). I don't have any good solutions. I tried using a float4[] to store each element as it's added, but I couldn't get array updates working in PL/PgSQL, so that didn't help. Perhaps aggregate functions could be passed an array? Or a cursor, pointing at the first line? I'm not sure. Anyways, perhaps it'll be helpful. Josh -- Josh Burdick jburdick@gradient.cis.upenn.edu http://www.cis.upenn.edu/~jburdick /* Implementing median-finding in "pure Postgres." Does this by copying data to a temporary table. A weakness of this code is that it uses sorting, instead of Hoare's linear-time median algorithm. Presumably sorting is implemented so efficiently that it'll be faster than anything written in PL/PgSQL. (Although Hoare's algorithm implemented in C would be faster than either.) BUG: this isn't properly set up to deal with multiple users. For example, if A computes a median, then B could read the data from the median_tmp table. Possibly you could fiddle with transaction isolation levels, or add a user field to median_tmp, or something else complicated, to prevent this, but for now I'm not worrying about this. Written by Josh Burdick (jburdick@gradient.cis.upenn.edu). Anyone can use this under the same license as Postgres. 20020524, jtb: started. */ drop aggregate median(float4); drop table median_tmp; drop sequence median_id; drop index median_tmp_median_id; drop function median_sfunc_float4(bigint, float4); drop function median_finalfunc_float4(bigint); create sequence median_id; create table median_tmp ( median_id int, x float4 ); create index median_tmp_median_id on median_tmp(median_id); create function median_sfunc_float4 (bigint, float4) returns bigint as ' insert into median_tmp values (case when $1 = 0 then nextval(''median_id'') else $1 end, $2); select currval(''median_id''); ' language 'SQL'; create function median_finalfunc_float4 (bigint) returns float4 as ' declare i bigint; n bigint; c refcursor; m float4; m1 float4; begin n := (select count(*) from median_tmp where median_id = $1); open c for select x from median_tmp where median_id = $1 order by x; for i in 1..((n+1)/2) loop fetch c into m; end loop; /* if n is even, fetch the next value, and average the two */ if (n % int8(2) = int8(0)) then fetch c into m1; m := (m + m1) / 2; end if; delete from median_tmp where median_id = $1; return m; end ' language 'plpgsql'; create aggregate median ( basetype = float4, stype = bigint, initcond = 0, sfunc = median_sfunc_float4, finalfunc = median_finalfunc_float4 );
Josh, > At the end of this message is some code I used to find medians. > It's kind of a hack, but approximately works, and is intended as a > somewhat awkward stopgap for people who need to use medians. It > illustrates the limitations of the current aggregate function setup, > which works so nicely for avg() and stddev(). Actually, finding the median is one of the classic SQL problems. You can't do it without 2 passes through the data set, disallowing the use of traditional aggregate functions. Joe Celko has half a chapter devoted to various methods of finding the median. Can I talk you into submitting your code to Techdocs? I'd love to have it somewhere where it won't get buried in the mailing list archives. -- -Josh Berkus
ACK! Sorting to find a median is criminal. "Introduction to Algorithms" by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest ISBN: 0262031418 explains the better algorithm very well. Here is a freely available C++ template (written by me) for a bunch of statistics (everything *but* the selection problem): ftp://cap.connx.com/pub/tournament_software/STATS.HPP It uses this template for improved summation accuracy: ftp://cap.connx.com/pub/tournament_software/Kahan.Hpp Here is an outline for selection. I wrote it in C++, but a rewrite to C is trivial: // Quickselect: find Kth smallest of first N items in array A // recursive routine finds Kth smallest in A[Low..High] // Etype: must have copy constructor, oeprator=, and operator< // Nonrecursive driver is omitted. template < class Etype > void QuickSelect (Etype A[], int Low, int High, int k) { if (Low + Cutoff > High) InsertionSort (&A[Low], High - Low + 1); else { // Sort Low, Middle, High int Middle = (Low + High) / 2; if (A[Middle] < A[Low]) Swap (A[Low], A[Middle]); if (A[High] < A[Low]) Swap (A[Low], A[High]); if (A[High] < A[Middle]) Swap (A[Middle], A[High]); // Place pivot at Position High-1 Etype Pivot = A[Middle]; Swap (A[Middle], A[High - 1]); // Begin partitioning int i, j; for (i = Low, j = High - 1;;) { while (A[++i] < Pivot); while (Pivot < A[--j]); if (i < j) Swap (A[i], A[j]); else break; } // Restore pivot Swap (A[i], A[High - 1]); // Recurse: only this part changes if (k < i) QuickSelect (A, Low, i - 1, k); else if (k > i) QuickSelect (A, i + 1, High, k); } } template < class Etype > void QuickSelect (Etype A[], int N, int k) { QuickSelect (A, 0, N - 1, k - 1); } If you want to use this stuff to improve statistics with vacuum, be my guest.
Here is a program written in C that demonstrates 2 median/selection computation techniques: ACM Algorithm 727 (implementation by Sherif Hashem) QuickSelect (implemented by me). Since it is written in C, it would be useful to PostgreSQL project without any fanfare. ftp://cap.connx.com/pub/chess-engines/new-approach/727.c The ACM agorithm 727 is an approximation. The QuickSelect result is exact.
Josh Burdick <jburdick@gradient.cis.upenn.edu> writes: > illustrates the limitations of the current aggregate function setup, > which works so nicely for avg() and stddev(). > I don't have any good solutions. I tried using a float4[] to store > each element as it's added, but I couldn't get array updates working in > PL/PgSQL, so that didn't help. > Perhaps aggregate functions could be passed an array? Or a cursor, > pointing at the first line? I'm not sure. I don't think that would help. The real problem here is the amount of internal storage needed. AFAIK there are no exact algorithms for finding the median that require less than O(N) workspace for N input items. Your "hack" with a temporary table is not a bad approach if you want to work for large N. There are algorithms out there for finding approximate medians using limited workspace; it might be reasonable to transform one of these into a Postgres aggregate function. regards, tom lane
On Fri, 2002-05-31 at 01:16, Josh Burdick wrote: > BUG: this isn't properly set up to deal with multiple users. > For example, if A computes a median, then B could read the data > from the median_tmp table. Possibly you could fiddle with > transaction isolation levels, or add a user field to median_tmp, > or something else complicated, to prevent this, but for now I'm > not worrying about this. You could just use temp tables and indexes - they are local to connection . create TEMP sequence median_id; create TEMP table median_tmp ( median_id int, x float4 ); --------------- Hannu
> -----Original Message----- > From: Hannu Krosing [mailto:hannu@tm.ee] > Sent: Thursday, May 30, 2002 1:17 PM > To: Josh Burdick > Cc: pgsql-hackers@postgresql.org; josh@agliodbs.com > Subject: Re: [HACKERS] finding medians > > > On Fri, 2002-05-31 at 01:16, Josh Burdick wrote: > > BUG: this isn't properly set up to deal with multiple users. > > For example, if A computes a median, then B could read the data > > from the median_tmp table. Possibly you could fiddle with > > transaction isolation levels, or add a user field to median_tmp, > > or something else complicated, to prevent this, but for now I'm > > not worrying about this. > > You could just use temp tables and indexes - they are local to > connection . > > create TEMP sequence median_id; > create TEMP table median_tmp ( > median_id int, > x float4 > ); Another pure SQL solution would be to perform an order by on the column of interest. Use a cursor to seek to the middle element. If the data set is odd, then the median is the center element. If the data set is even, the median is the average of the two center elements. A SQL function would be pretty easy. Of course, it is not the most efficient way to do it. A nice thing about having the data ordered is that you can then extract the statistic for any kth partition. Hence, you could generate a function quantile() and call quantile (0.5) to get the median. If you have a query: select quantile(.1, col), quantile(.2, col), quantile(.3,col), ... quantile(.9, col) from sometable you only have to do the sort once and then operate on the sorted data. For queries like that, sorting is probably just fine, since the selection algorithm is only approximately linear for each single instance.