Thread: Statistics with PostgreSQL

Statistics with PostgreSQL

From
Hrishikesh Deshmukh
Date:
Hi All,

Is there a way to simple statistics like mean/median/mode in PostgreSQL.
I have tables like PsetID | IntensityValue. I want to find out mean
(intensityValue) of some PsetID(s)?!
Any urls/pointers/books would be a big help.

Thanks,
Hrishi

Re: Statistics with PostgreSQL

From
David Fetter
Date:
On Fri, Mar 18, 2005 at 01:37:10PM -0500, Hrishikesh Deshmukh wrote:
> Hi All,
>
> Is there a way to simple statistics like mean/median/mode in
> PostgreSQL.  I have tables like PsetID | IntensityValue. I want to
> find out mean (intensityValue) of some PsetID(s)?!

> Any urls/pointers/books would be a big help.

Hrishi,

For statistics beyond avg() and stddev(), check out PL/R
http://www.joeconway.com/plr/

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100   mobile: +1 415 235 3778

Remember to vote!

Re: Statistics with PostgreSQL

From
"Dann Corbit"
Date:
Mean is just sum(col)/count(col)
Mode can be calculated with having, max, count
Median can be computed by sorting, using a cursor, and going to the
middle.

There are more efficient and better (more accurate) ways to do it, but
those have to be implemented at a low level.  Of course, since you have
libpq, anything is possible.

If you want to implement these things at a low level to get better
answers, Kahan (or compenstated) summation is a good idea, and do the
summation into a larger type to prevent loss of precision.

Here are some statistical templates I wrote that are free for any
purpose you like:
ftp://cap.connx.com/pub/tournament_software/Kahan.Hpp
ftp://cap.connx.com/pub/tournament_software/STATS.HPP

The Cephes collection by Moshier has good extended precision types, if
you need to carefully avoid any PLOSS.

For median, the QuickSelect algorithm is very good.  Here is an
implementation I wrote in C++:

#include <cstdio>
#include <cstdlib>
#include <iostream>

using namespace std;

/*
**
** In the following code, every reference to CLR means:
**
**    "Introduction to Algorithms"
**    By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest
**    ISBN 0-07-013143-0
*/


/*
** CLR, page 187
*/
template < class Etype >
Etype
RandomSelect(Etype A[], size_t p, size_t r, size_t i)
{
    size_t          q,
                    k;
    if (p == r)
        return A[p];
    q = RandomPartition(A, p, r);
    k = q - p + 1;

    if (i <= k)
        return RandomSelect(A, p, q, i);
    else
        return RandomSelect(A, q + 1, r, i - k);
}

size_t          RandRange(size_t a, size_t b)
{
    size_t          c = (size_t) ((double) rand() / ((double) RAND_MAX +
1) * (b - a));
    return c + a;
}

/*
** CLR, page 162
*/
template < class Etype >
size_t
RandomPartition(Etype A[], size_t p, size_t r)
{
    size_t          i = RandRange(p, r);
    Etype           Temp;
    Temp = A[p];
    A[p] = A[i];
    A[i] = Temp;
    return Partition(A, p, r);
}

/*
** CLR, page 154
*/
template < class Etype >
size_t
Partition(Etype A[], size_t p, size_t r)
{
    Etype           x,
                    temp;
    size_t          i,
                    j;

    x = A[p];
    i = p - 1;
    j = r + 1;

    for (;;) {
        do {
            j--;
        } while (!(A[j] <= x));
        do {
            i++;
        } while (!(A[i] >= x));
        if (i < j) {
            temp = A[i];
            A[i] = A[j];
            A[j] = temp;
        } else
            return j;
    }
}

double           data[30];
int             main(void)
{
    size_t          i;
    size_t          size = sizeof(data) / sizeof(data[0]);
    for (i = 0; i < size; i++) {
        data[i] = rand();
    }


    for (i = 0; i < size; i++) {
        cout << data[i] << endl;
    }

    cout << "1st item is " << RandomSelect(data, 0, size - 1, 0) <<
endl;
    cout << "2nd item is " << RandomSelect(data, 0, size - 1, 1) <<
endl;
    cout << "3rd item is " << RandomSelect(data, 0, size - 1, 2) <<
endl;
    for (i = 4; i < size; i++)
        cout  << i << "th item is " << RandomSelect(data, 0, size - 1,
i) << endl;
    return 0;
}

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Hrishikesh
Deshmukh
Sent: Friday, March 18, 2005 10:37 AM
To: Postgresql-General
Subject: [GENERAL] Statistics with PostgreSQL

Hi All,

Is there a way to simple statistics like mean/median/mode in PostgreSQL.
I have tables like PsetID | IntensityValue. I want to find out mean
(intensityValue) of some PsetID(s)?!
Any urls/pointers/books would be a big help.

Thanks,
Hrishi

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@postgresql.org so that your
      message can get through to the mailing list cleanly

Re: Statistics with PostgreSQL

From
"Dann Corbit"
Date:
/* On the test stub: */
[snip]
double           data[30];
int             main(void)
{
    size_t          i;
    size_t          size = sizeof(data) / sizeof(data[0]);
    for (i = 0; i < size; i++) {
        data[i] = rand();
    }


    for (i = 0; i < size; i++) {
        cout << data[i] << endl;
    }

    cout << "1st item is " << RandomSelect(data, 0, size - 1, 0) <<
endl;
    cout << "2nd item is " << RandomSelect(data, 0, size - 1, 1) <<
endl;
    cout << "3rd item is " << RandomSelect(data, 0, size - 1, 2) <<
endl;
    for (i = 4; i < size; i++)
        cout  << i << "th item is " << RandomSelect(data, 0, size - 1,
i) << endl;
    return 0;
}

/*
The positions are cardinal, so the first three queries above should be:
    cout << "1st item is " << RandomSelect(data, 0, size - 1, 1) <<
endl;
    cout << "2nd item is " << RandomSelect(data, 0, size - 1, 2) <<
endl;
    cout << "3rd item is " << RandomSelect(data, 0, size - 1, 3) <<
endl;
*/


Re: Statistics with PostgreSQL

From
Brent Wood
Date:

> Mean is just sum(col)/count(col)

You can also just use avg(col).

Either way, be careful because nulls may not be treated as you want for
such calculations.

The stats package R can access Postgres databases, and can be used for
robust statistical analyses of the data.

See:
http://sourceforge.net/projects/rdbi/


Cheers,

  Brent Wood