Thread: how to identify outliers
Hey all, I have the following table: data(pnum text, distance float8, route text). I would like to remove the outliers in distance, i.e. lets say i get the avg dist of pnum for each route and the std deviation of the distance what is the best way to identify the outliers? Rhys.
Im asking how to get the ones that dont fall near the avg.... so for example lets say i have the following distances: 10,11,12,11,10,9,9,10,11,12,10,11,99 then 99 would be an outlier. the avg would be like 16 or 17 i reckon with the 99. so i want a way to find aan outlier, remove it and then recalcuate the avg...and then i'd get a 'better' avg..... i did some seraching about outliers and most of hits are about R or spss or some other statistical package.....so looking for a way to do it wholly in pgsql. Rhys On Tue, Oct 27, 2009 at 4:53 PM, Ben Chobot <bench@silentmedia.com> wrote: > Are you asking how to find the average and standard deviation? Or how to > compare the your data against some set values? Perhaps an example would be > appropriate; it's not very clear to me what you're asking. > > Rhys A.D. Stewart wrote: >> >> Hey all, >> I have the following table: data(pnum text, distance float8, route text). >> I would like to remove the outliers in distance, i.e. lets say i get >> the avg dist of pnum for each route and the std deviation of the >> distance what is the best way to identify the outliers? >> >> >> Rhys. >> >> >
Rhys A.D. Stewart escribió: > i did some seraching about outliers and most of hits are about R or > spss or some other statistical package.....so looking for a way to do > it wholly in pgsql. Well, then, maybe PL/R? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Rhys A.D. Stewart wrote: > Im asking how to get the ones that dont fall near the avg.... so for > example lets say i have the following distances: > 10,11,12,11,10,9,9,10,11,12,10,11,99 > > then 99 would be an outlier. the avg would be like 16 or 17 i reckon > with the 99. so i want a way to find aan outlier, remove it and then > recalcuate the avg...and then i'd get a 'better' avg..... > > i did some seraching about outliers and most of hits are about R or > spss or some other statistical package.....so looking for a way to do > it wholly in pgsql. > > > Rhys > > On Tue, Oct 27, 2009 at 4:53 PM, Ben Chobot <bench@silentmedia.com> wrote: >> Are you asking how to find the average and standard deviation? Or how to >> compare the your data against some set values? Perhaps an example would be >> appropriate; it's not very clear to me what you're asking. >> >> Rhys A.D. Stewart wrote: >>> Hey all, >>> I have the following table: data(pnum text, distance float8, route text). >>> I would like to remove the outliers in distance, i.e. lets say i get >>> the avg dist of pnum for each route and the std deviation of the >>> distance what is the best way to identify the outliers? >>> >>> >>> Rhys. Oh, so you want to "cook" your data? I don't agree with that conceptually, but: WITH base AS ( SELECT random(1, 100)::int AS i FROM generate_series(1, 100) i ), stats AS ( SELECT avg(i) AS dist_avg, stddev(i) AS dist_dev FROM base ) SELECT count(i), avg(i) AS new_avg, MIN(stats.dist_avg) AS old_avg, stddev(i) AS new_dev, MIN(stats.dist_dev) AS old_dev FROM base, stats WHERE base.i BETWEEN stats.dist_avg - dist_dev AND stats.dist_avg + dist_dev Scott Bailey
Rhys A.D. Stewart wrote: > Hey all, > I have the following table: data(pnum text, distance float8, route text). > I would like to remove the outliers in distance, i.e. lets say i get > the avg dist of pnum for each route and the std deviation of the > distance what is the best way to identify the outliers? > i dunno. brute force, two passes, one to identify the MIN and MAX of the values, then another SELECT avg(value) .... WHERE (....) AND val != minval AND val != maxval. you could probably do something with a standard deviation that is more accurate for large sets than just tossing the 2 outliers.
> Rhys A.D. Stewart wrote: > >I would like to remove the outliers in distance As others have said; an "outlier" is normally a human call and not something that's generally valid to do automatically. The operator would probably want to go in and look to see why it's that far out and either fix the typo or do whatever else it takes to "fix" the problem. On Tue, Oct 27, 2009 at 04:58:23PM -0700, John R Pierce wrote: > you could probably do something with a standard deviation that is more > accurate for large sets than just tossing the 2 outliers. I'd agree, stddev is probably best and the following should do something reasonable for what the OP was asking: SELECT d.* FROM data d, ( SELECT avg(distance), stddev(distance) FROM data) x WHERE abs(d.distance - x.avg) < x.stddev * 2; -- Sam http://samason.me.uk/
> > I'd agree, stddev is probably best and the following should do > something > reasonable for what the OP was asking: > > SELECT d.* > FROM data d, ( > SELECT avg(distance), stddev(distance) FROM data) x > WHERE abs(d.distance - x.avg) < x.stddev * 2; > [Spotts, Christopher] Statistically speaking if you dataset is of a fairly normal distribution the following works "well" and is a *fairly* standard outlier definition. First get a median function (there's a million of them on the net, doogle a google). You'll need one pass to get the median. Divide your data set in half based on that median. Get the median of the first half (this is Q1). Get the median of the second half (this is Q3). Then your range for your good data should be from (Q1 - (Q3-Q1)*1.5) TO (Q3 + (Q3-Q1)*1.5). Anything outside that range is an outlier. Adjust the 1.5 up or down to be more or less aggressive. Using the "avg" formula for outliers is bad news. I HIGHLY suggest installing PL/R for this, it makes it trivial. Chris