Thread: how to identify outliers

how to identify outliers

From

"Rhys A.D. Stewart"

Date:

27 October 2009, 18:36:22

Hey all,
I have the following table:  data(pnum text, distance float8, route text).
I would like to remove the outliers in distance, i.e. lets say i get
the avg dist of pnum for each route and the std deviation of the
distance what is the best way to identify the outliers?


Rhys.

Re: how to identify outliers

From

"Rhys A.D. Stewart"

Date:

27 October 2009, 19:37:20

Im asking how to get the ones that dont fall near the avg.... so for
example lets say i have the following distances:
10,11,12,11,10,9,9,10,11,12,10,11,99

then 99 would be an outlier. the avg would be like 16 or 17 i reckon
with the 99. so i want a way to find aan outlier, remove it and then
recalcuate the avg...and then i'd get a 'better' avg.....

i did some seraching about outliers and most of hits are about R or
spss or some other statistical package.....so looking for a way to do
it wholly in pgsql.

Rhys

On Tue, Oct 27, 2009 at 4:53 PM, Ben Chobot <bench@silentmedia.com> wrote:
> Are you asking how to find the average and standard deviation? Or how to
> compare the your data against some set values? Perhaps an example would be
> appropriate; it's not very clear to me what you're asking.
>
> Rhys A.D. Stewart wrote:
>>
>> Hey all,
>> I have the following table:  data(pnum text, distance float8, route text).
>> I would like to remove the outliers in distance, i.e. lets say i get
>> the avg dist of pnum for each route and the std deviation of the
>> distance what is the best way to identify the outliers?
>>
>>
>> Rhys.
>>
>>
>

Re: how to identify outliers

From

Alvaro Herrera

Date:

27 October 2009, 19:56:22

Rhys A.D. Stewart escribió:

> i did some seraching about outliers and most of hits are about R or
> spss or some other statistical package.....so looking for a way to do
> it wholly in pgsql.

Well, then, maybe PL/R?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: how to identify outliers

From

Scott Bailey

Date:

27 October 2009, 20:05:05

Rhys A.D. Stewart wrote:
> Im asking how to get the ones that dont fall near the avg.... so for
> example lets say i have the following distances:
> 10,11,12,11,10,9,9,10,11,12,10,11,99
>
> then 99 would be an outlier. the avg would be like 16 or 17 i reckon
> with the 99. so i want a way to find aan outlier, remove it and then
> recalcuate the avg...and then i'd get a 'better' avg.....
>
> i did some seraching about outliers and most of hits are about R or
> spss or some other statistical package.....so looking for a way to do
> it wholly in pgsql.
>
>
> Rhys
>
> On Tue, Oct 27, 2009 at 4:53 PM, Ben Chobot <bench@silentmedia.com> wrote:
>> Are you asking how to find the average and standard deviation? Or how to
>> compare the your data against some set values? Perhaps an example would be
>> appropriate; it's not very clear to me what you're asking.
>>
>> Rhys A.D. Stewart wrote:
>>> Hey all,
>>> I have the following table:  data(pnum text, distance float8, route text).
>>> I would like to remove the outliers in distance, i.e. lets say i get
>>> the avg dist of pnum for each route and the std deviation of the
>>> distance what is the best way to identify the outliers?
>>>
>>>
>>> Rhys.

Oh, so you want to "cook" your data? I don't agree with that
conceptually, but:


WITH base AS (
   SELECT random(1, 100)::int AS i
   FROM generate_series(1, 100) i
),
stats AS (
   SELECT avg(i) AS dist_avg, stddev(i) AS dist_dev FROM base
)
SELECT count(i), avg(i) AS new_avg, MIN(stats.dist_avg) AS old_avg,
   stddev(i) AS new_dev, MIN(stats.dist_dev) AS old_dev
FROM base, stats
WHERE base.i BETWEEN stats.dist_avg - dist_dev
   AND stats.dist_avg + dist_dev


Scott Bailey

Re: how to identify outliers

From

John R Pierce

Date:

27 October 2009, 20:58:47

Rhys A.D. Stewart wrote:
> Hey all,
> I have the following table:  data(pnum text, distance float8, route text).
> I would like to remove the outliers in distance, i.e. lets say i get
> the avg dist of pnum for each route and the std deviation of the
> distance what is the best way to identify the outliers?
>

i dunno.  brute force, two passes, one to identify the MIN and MAX of
the values, then another SELECT avg(value) ....   WHERE (....) AND val
!= minval AND val != maxval.

you could probably do something with a standard deviation that is more
accurate for large sets than just tossing the 2 outliers.

Re: how to identify outliers

From

Sam Mason

Date:

28 October 2009, 08:00:50

> Rhys A.D. Stewart wrote:
> >I would like to remove the outliers in distance

As others have said; an "outlier" is normally a human call and not
something that's generally valid to do automatically.  The operator
would probably want to go in and look to see why it's that far out and
either fix the typo or do whatever else it takes to "fix" the problem.

On Tue, Oct 27, 2009 at 04:58:23PM -0700, John R Pierce wrote:
> you could probably do something with a standard deviation that is more
> accurate for large sets than just tossing the 2 outliers.

I'd agree, stddev is probably best and the following should do something
reasonable for what the OP was asking:

  SELECT d.*
  FROM data d, (
    SELECT avg(distance), stddev(distance) FROM data) x
  WHERE abs(d.distance - x.avg) < x.stddev * 2;

--
  Sam  http://samason.me.uk/

Re: how to identify outliers

From

"Chris Spotts"

Date:

28 October 2009, 09:34:15

>
> I'd agree, stddev is probably best and the following should do
> something
> reasonable for what the OP was asking:
>
>   SELECT d.*
>   FROM data d, (
>     SELECT avg(distance), stddev(distance) FROM data) x
>   WHERE abs(d.distance - x.avg) < x.stddev * 2;
>
[Spotts, Christopher]
Statistically speaking if you dataset is of a fairly normal distribution the
following works "well" and is a *fairly* standard outlier definition.

First get a median function (there's a million of them on the net, doogle a
google).
You'll need one pass to get the median.
Divide your data set in half based on that median.
Get the median of the first half (this is Q1).
Get the median of the second half (this is Q3).
Then your range for your good data should be from (Q1 - (Q3-Q1)*1.5) TO (Q3
+ (Q3-Q1)*1.5).
Anything outside that range is an outlier.  Adjust the 1.5 up or down to be
more or less aggressive.

Using the "avg" formula for outliers is bad news.

I HIGHLY suggest installing PL/R for this, it makes it trivial.

Chris