Thread: Median
Hello,
I am trying to do a Median or Trimmed-mean operation in postgreSQL. I was wondering if anybody knew how to do it. I appreciate it if you reply to my email address.
Thanks,
Bahram Kermani
Hi, I'll be glad if you describe more details about the problem. Is it a kind of statistical analysis or what? Omid Omoomi >From: "Kermani, Bahram" <BKermani@illumina.com> >To: "'pgsql-sql@postgresql.org'" <pgsql-sql@postgresql.org> >Subject: [SQL] Median >Date: Fri, 30 Jun 2000 17:37:06 -0700 > >Hello, > >I am trying to do a Median or Trimmed-mean operation in postgreSQL. I was >wondering if anybody knew how to do it. I appreciate it if you reply to my >email address. > >Thanks, >Bahram Kermani >bkermani@illumina.com <mailto:bkermani@illumina.com> ________________________________________________________________________ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
omid omoomi wrote: > Hi, > I'll be glad if you describe more details about the problem. Is it a kind of > statistical analysis or what? > Omid Omoomi Sorry to respond that slow. It's something, Ulf Mehlig described to me a couple of months ago. The median is the value, below and above of which exactly half of the sorted items would be. Having an odd number of entries, it's exactly the one in themiddle like 1 2 3 4 5 6 7 ------|------ M = 4 Having an even number of entries, it's the average between the two in the middle like 1 2 3 4 5 6 7 8 -------|------- M = (4 + 5) / 2 As he said, other ranges like Quartile (the position where 25% of the entries are below and 75% are above) wouldalso be of interest. So the most useful thing would be an aggregate like qantil(n), where n is value between 0.0 and 100.0, so that quantil(50.0) is the Median, quantil(25.0) is the first Quartile and so on. I don't see any quick solution how to solve this problem with an aggregate. Aggregates get all selected values in unsorted order, and don't know ahead how many items there will be. Even if, all this wouldn't be of any use, becauseyou need to look at the entire sorted list of selected items. Maybe someone else has an idea. > > >From: "Kermani, Bahram" <BKermani@illumina.com> > >To: "'pgsql-sql@postgresql.org'" <pgsql-sql@postgresql.org> > >Subject: [SQL] Median > >Date: Fri, 30 Jun 2000 17:37:06 -0700 > > > >Hello, > > > >I am trying to do a Median or Trimmed-mean operation in postgreSQL. I was > >wondering if anybody knew how to do it. I appreciate it if you reply to my > >email address. > > > >Thanks, > >Bahram Kermani > >bkermani@illumina.com <mailto:bkermani@illumina.com> > > ________________________________________________________________________ > Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com > Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
JanWieck@t-online.de (Jan Wieck) writes: > I don't see any quick solution how to solve this problem with > an aggregate. Aggregates get all selected values in unsorted > order, and don't know ahead how many items there will be. > Even if, all this wouldn't be of any use, because you need to > look at the entire sorted list of selected items. Hmm. It would be a pretty straightforward extension of the existing support for DISTINCT aggregates to allow the agg function to receive all the inputs in sorted order along with a count of how many there are, whereupon a percentile aggregate would be trivial. Slow, but trivial. Probably a better way would be to skip evaluating the agg's transition function as such, and instead call the agg's final function just once with a pointer to the tuplesort object that contains the sorted input data. Then you reach in and pull out just the items you want, instead of having to read 'em all. Or you can scan 'em if you want. We might need to add a few features to the tuplesort API to allow access to the N'th item in the sorted data, but it's surely doable. Anyone care to work up a detailed proposal for something along this line? It seems like more work than it's worth to me, but if someone else wants to do the legwork... regards, tom lane
> Maybe someone else has an idea. I implemented a different algorithm several years ago. It is an O(log(N)) process (unlike most other techniques), and was borrowed from the "Algorithms" book (it's at work; but it is the classic "yellow jacket" book with Fortran code and the other volume with *really bad* C code which looks like Fortran ;) Anyway, once the bugs were fixed (and there were several :( and a better endgame algorithm was coded, it was suitable for hypercube distributed processing, where you just exchange a small amount of info between iterations. And it involved simply *counting* how many values were above and below an initial or updated guess, then iterating on a new guess. afaik that wouldn't be suitable for the existing aggregate capabilities, but multi-pass algorithms would be nice to be able to do. - Thomas