Thread: MEDIAN as custom aggregate?
Folks, Hey, anybody have a custom aggregate for median calucation? I'm doing this through a PL/pgSQL function, and a custom aggregate would probably be faster. For those whose stats terminology is rusty, the "median" is the "middle" value in a distribution. For example, if we had the following data: Table ages person age Jim 21 Rusty 24 Carol 37 Bob 62 Leah 78 Our Median would be Carol's age, 37. This is a different figure from the Mean, or Average, which is 44.4. Using the combination of the Mean and the Median you can do all kinds of interesting statistical analysis. -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Can't you do something like select age from ages order by age limit 1 offset (select count(*) from ages)/2; except you can't nest the select so you'll have to use a variable to hold it... Make sure it does the right thing when there is an odd number of rows. I don't understand why you want the median and not some parameters of your assumed distribution (mean and variance, for example)but each to his own... Allan. Josh Berkus wrote: > Folks, > > Hey, anybody have a custom aggregate for median calucation? I'm doing > this through a PL/pgSQL function, and a custom aggregate would probably > be faster. > > For those whose stats terminology is rusty, the "median" is the "middle" > value in a distribution. For example, if we had the following data: > > Table ages > person age > Jim 21 > Rusty 24 > Carol 37 > Bob 62 > Leah 78 > > Our Median would be Carol's age, 37. This is a different figure from > the Mean, or Average, which is 44.4. Using the combination of the Mean > and the Median you can do all kinds of interesting statistical analysis. > > -Josh Berkus > > ______AGLIO DATABASE SOLUTIONS___________________________ > Josh Berkus > Complete information technology josh@agliodbs.com > and data management solutions (415) 565-7293 > for law firms, small businesses fax 621-2533 > and non-profit organizations. San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Allan, > Can't you do something like > > select age from ages order by age limit 1 offset (select count(*) > from ages)/2; > > except you can't nest the select so you'll have to use a variable to > hold it... > > Make sure it does the right thing when there is an odd number of > rows. Duuuuuuuh! I feel like a real idiot now. The query I'll use is this: SELECT site, COUNT(pageviews) as count_views, AVG(pageviews) AS mean_views, median_views FROM sites, (SELECT pageviews as median_view FROM pageviews LIMIT 1 OFFSET middlerec('pageviews')) med GROUP BY site, median_views; Where middlerec is a custom function that counts the records and returns the middle one. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
"Josh Berkus" <josh@agliodbs.com> writes: > The query I'll use is this: > SELECT site, COUNT(pageviews) as count_views, AVG(pageviews) AS > mean_views, median_views > FROM sites, > (SELECT pageviews as median_view > FROM pageviews > LIMIT 1 OFFSET middlerec('pageviews')) med > GROUP BY site, median_views; > Where middlerec is a custom function that counts the records and returns > the middle one. Um ... does that work? I thought LIMIT was fairly restrictive about what it would take as a parameter --- like, constants or $n parameters only. I do not know of any median-finding algorithm that doesn't require a depressingly large amount of storage... regards, tom lane
On Fri, 12 Oct 2001 12:38:12 -0700 "Josh Berkus" wrote: > For those whose stats terminology is rusty, the "median" is the "middle" > value in a distribution. For example, if we had the following data: > > Table ages > person age > Jim 21 > Rusty 24 > Carol 37 > Bob 62 > Leah 78 > > Our Median would be Carol's age, 37. This is a different figure from > the Mean, or Average, which is 44.4. Using the combination of the Mean > and the Median you can do all kinds of interesting statistical analysis. > In such case of this, there needs "nextval" in a query to deal with a large number of rows. I think the following query, is not smart, will return the necessary rows (or an average of the rows). But even so it may need considerable time... -- (on 7.1.3) create sequence seq_ages start 1; select a1.age, a1.rank -- or select avg(a1.age) from (select a0.person, a0.age, (nextval('seq_ages') - 1) as rank from (select *, setval('seq_ages',1) -- to reset a sequence from ages order by age -- this insignificant "order by" is -- needed in order to work "setval" )as a0 order by a0.age ) as a1 where exists (select * from ages where a1.rank >= (select(count(*)+1)/2 from ages) and a1.rank <= (select count(*)/2+1 from ages) ) ; Regards, Masaru Sugawara
Tom, > Um ... does that work? I thought LIMIT was fairly restrictive about > what it would take as a parameter --- like, constants or $n > parameters > only. > > I do not know of any median-finding algorithm that doesn't require a > depressingly large amount of storage... Me neither. You're right; the query didn't work. Here's a link to the median-finding function I posted to the CookBook: http://www.brasileiro.net/postgres/cookbook/view-one-recipe.adp?recipe_id=1654 Anyone who can improve it is welcome! -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco