Thread: MEDIAN as custom aggregate?

MEDIAN as custom aggregate?

From
"Josh Berkus"
Date:
Folks,

Hey, anybody have a custom aggregate for median calucation?  I'm doing
this through a PL/pgSQL function, and a custom aggregate would probably
be faster.

For those whose stats terminology is rusty, the "median" is the "middle"
value in a distribution.  For example, if we had the following data:

Table ages
person        age
Jim        21
Rusty        24
Carol        37
Bob        62
Leah        78

Our Median would be Carol's age, 37.  This is a different figure from
the Mean, or Average, which is 44.4.  Using the combination of the Mean
and the Median you can do all kinds of interesting statistical analysis.

-Josh Berkus


______AGLIO DATABASE SOLUTIONS___________________________                                      Josh Berkus Complete
informationtechnology      josh@agliodbs.com  and data management solutions       (415) 565-7293 for law firms, small
businesses       fax 621-2533   and non-profit organizations.      San Francisco
 


Re: MEDIAN as custom aggregate?

From
Allan Engelhardt
Date:
Can't you do something like

select age from ages order by age limit 1 offset (select count(*) from ages)/2;

except you can't nest the select so you'll have to use a variable to hold it...

Make sure it does the right thing when there is an odd number of rows.


I don't understand why you want the median and not some parameters of your assumed distribution (mean and variance, for
example)but each to his own...
 


Allan.

Josh Berkus wrote:

> Folks,
>
> Hey, anybody have a custom aggregate for median calucation?  I'm doing
> this through a PL/pgSQL function, and a custom aggregate would probably
> be faster.
>
> For those whose stats terminology is rusty, the "median" is the "middle"
> value in a distribution.  For example, if we had the following data:
>
> Table ages
> person          age
> Jim             21
> Rusty           24
> Carol           37
> Bob             62
> Leah            78
>
> Our Median would be Carol's age, 37.  This is a different figure from
> the Mean, or Average, which is 44.4.  Using the combination of the Mean
> and the Median you can do all kinds of interesting statistical analysis.
>
> -Josh Berkus
>
> ______AGLIO DATABASE SOLUTIONS___________________________
>                                        Josh Berkus
>   Complete information technology      josh@agliodbs.com
>    and data management solutions       (415) 565-7293
>   for law firms, small businesses        fax 621-2533
>     and non-profit organizations.      San Francisco
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


Re: MEDIAN as custom aggregate?

From
"Josh Berkus"
Date:
Allan,

> Can't you do something like
> 
> select age from ages order by age limit 1 offset (select count(*)
> from ages)/2;
> 
> except you can't nest the select so you'll have to use a variable to
> hold it...
> 
> Make sure it does the right thing when there is an odd number of
> rows.

Duuuuuuuh!  I feel like a real idiot now.

The query I'll use is this:

SELECT site, COUNT(pageviews) as count_views, AVG(pageviews) AS
mean_views, median_views
FROM sites, (SELECT pageviews as median_view FROM pageviews LIMIT 1 OFFSET middlerec('pageviews')) med
GROUP BY site, median_views;

Where middlerec is a custom function that counts the records and returns
the middle one.

-Josh

______AGLIO DATABASE SOLUTIONS___________________________                                      Josh Berkus Complete
informationtechnology      josh@agliodbs.com  and data management solutions       (415) 565-7293 for law firms, small
businesses       fax 621-2533   and non-profit organizations.      San Francisco
 


Re: MEDIAN as custom aggregate?

From
Tom Lane
Date:
"Josh Berkus" <josh@agliodbs.com> writes:
> The query I'll use is this:

> SELECT site, COUNT(pageviews) as count_views, AVG(pageviews) AS
> mean_views, median_views
> FROM sites, 
>     (SELECT pageviews as median_view
>      FROM pageviews
>      LIMIT 1 OFFSET middlerec('pageviews')) med
> GROUP BY site, median_views;

> Where middlerec is a custom function that counts the records and returns
> the middle one.

Um ... does that work?  I thought LIMIT was fairly restrictive about
what it would take as a parameter --- like, constants or $n parameters
only.

I do not know of any median-finding algorithm that doesn't require a
depressingly large amount of storage...
        regards, tom lane


Re: MEDIAN as custom aggregate?

From
Masaru Sugawara
Date:
On Fri, 12 Oct 2001 12:38:12 -0700
"Josh Berkus" wrote:

> For those whose stats terminology is rusty, the "median" is the "middle"
> value in a distribution.  For example, if we had the following data:
> 
> Table ages
> person        age
> Jim        21
> Rusty        24
> Carol        37
> Bob        62
> Leah        78
> 
> Our Median would be Carol's age, 37.  This is a different figure from
> the Mean, or Average, which is 44.4.  Using the combination of the Mean
> and the Median you can do all kinds of interesting statistical analysis.
> 

In such case of this, there needs "nextval" in a query to deal with
a large number of rows. I think the following query, is not smart, 
will return the necessary rows (or an average of the rows). But even
so it may need considerable time...


-- (on 7.1.3)
create sequence seq_ages start 1;

select a1.age, a1.rank   -- or select avg(a1.age) from (select a0.person, a0.age, (nextval('seq_ages') - 1) as rank
   from (select *, setval('seq_ages',1)   -- to reset a sequence                 from ages                 order by age
 -- this insignificant "order by" is                                --  needed in order to work "setval"
)as a0          order by a0.age       ) as a1 where exists (select * from ages                  where a1.rank >=
(select(count(*)+1)/2 from ages)                    and a1.rank <= (select  count(*)/2+1  from ages)               )
 
;



Regards,
Masaru Sugawara



Re: MEDIAN as custom aggregate?

From
"Josh Berkus"
Date:
Tom,

> Um ... does that work?  I thought LIMIT was fairly restrictive about
> what it would take as a parameter --- like, constants or $n
> parameters
> only.
> 
> I do not know of any median-finding algorithm that doesn't require a
> depressingly large amount of storage...

Me neither.  You're right; the query didn't work.  

Here's a link to the median-finding function I posted to the CookBook:

http://www.brasileiro.net/postgres/cookbook/view-one-recipe.adp?recipe_id=1654

Anyone who can improve it is welcome!

-Josh

______AGLIO DATABASE SOLUTIONS___________________________                                      Josh Berkus Complete
informationtechnology      josh@agliodbs.com  and data management solutions       (415) 565-7293 for law firms, small
businesses       fax 621-2533   and non-profit organizations.      San Francisco