Re: Statistics with PostgreSQL - Mailing list pgsql-general
From | Dann Corbit |
---|---|
Subject | Re: Statistics with PostgreSQL |
Date | |
Msg-id | D425483C2C5C9F49B5B7A41F89441547055A09@postal.corporate.connx.com Whole thread Raw |
In response to | Statistics with PostgreSQL (Hrishikesh Deshmukh <hdeshmuk@gmail.com>) |
Responses |
Re: Statistics with PostgreSQL
|
List | pgsql-general |
Mean is just sum(col)/count(col) Mode can be calculated with having, max, count Median can be computed by sorting, using a cursor, and going to the middle. There are more efficient and better (more accurate) ways to do it, but those have to be implemented at a low level. Of course, since you have libpq, anything is possible. If you want to implement these things at a low level to get better answers, Kahan (or compenstated) summation is a good idea, and do the summation into a larger type to prevent loss of precision. Here are some statistical templates I wrote that are free for any purpose you like: ftp://cap.connx.com/pub/tournament_software/Kahan.Hpp ftp://cap.connx.com/pub/tournament_software/STATS.HPP The Cephes collection by Moshier has good extended precision types, if you need to carefully avoid any PLOSS. For median, the QuickSelect algorithm is very good. Here is an implementation I wrote in C++: #include <cstdio> #include <cstdlib> #include <iostream> using namespace std; /* ** ** In the following code, every reference to CLR means: ** ** "Introduction to Algorithms" ** By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest ** ISBN 0-07-013143-0 */ /* ** CLR, page 187 */ template < class Etype > Etype RandomSelect(Etype A[], size_t p, size_t r, size_t i) { size_t q, k; if (p == r) return A[p]; q = RandomPartition(A, p, r); k = q - p + 1; if (i <= k) return RandomSelect(A, p, q, i); else return RandomSelect(A, q + 1, r, i - k); } size_t RandRange(size_t a, size_t b) { size_t c = (size_t) ((double) rand() / ((double) RAND_MAX + 1) * (b - a)); return c + a; } /* ** CLR, page 162 */ template < class Etype > size_t RandomPartition(Etype A[], size_t p, size_t r) { size_t i = RandRange(p, r); Etype Temp; Temp = A[p]; A[p] = A[i]; A[i] = Temp; return Partition(A, p, r); } /* ** CLR, page 154 */ template < class Etype > size_t Partition(Etype A[], size_t p, size_t r) { Etype x, temp; size_t i, j; x = A[p]; i = p - 1; j = r + 1; for (;;) { do { j--; } while (!(A[j] <= x)); do { i++; } while (!(A[i] >= x)); if (i < j) { temp = A[i]; A[i] = A[j]; A[j] = temp; } else return j; } } double data[30]; int main(void) { size_t i; size_t size = sizeof(data) / sizeof(data[0]); for (i = 0; i < size; i++) { data[i] = rand(); } for (i = 0; i < size; i++) { cout << data[i] << endl; } cout << "1st item is " << RandomSelect(data, 0, size - 1, 0) << endl; cout << "2nd item is " << RandomSelect(data, 0, size - 1, 1) << endl; cout << "3rd item is " << RandomSelect(data, 0, size - 1, 2) << endl; for (i = 4; i < size; i++) cout << i << "th item is " << RandomSelect(data, 0, size - 1, i) << endl; return 0; } -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Hrishikesh Deshmukh Sent: Friday, March 18, 2005 10:37 AM To: Postgresql-General Subject: [GENERAL] Statistics with PostgreSQL Hi All, Is there a way to simple statistics like mean/median/mode in PostgreSQL. I have tables like PsetID | IntensityValue. I want to find out mean (intensityValue) of some PsetID(s)?! Any urls/pointers/books would be a big help. Thanks, Hrishi ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
pgsql-general by date: