Re: how to identify outliers - Mailing list pgsql-general

From Scott Bailey
Subject Re: how to identify outliers
Date
Msg-id 4AE77C8F.1000900@comcast.net
Whole thread Raw
In response to Re: how to identify outliers  ("Rhys A.D. Stewart" <rhys.stewart@gmail.com>)
List pgsql-general
Rhys A.D. Stewart wrote:
> Im asking how to get the ones that dont fall near the avg.... so for
> example lets say i have the following distances:
> 10,11,12,11,10,9,9,10,11,12,10,11,99
>
> then 99 would be an outlier. the avg would be like 16 or 17 i reckon
> with the 99. so i want a way to find aan outlier, remove it and then
> recalcuate the avg...and then i'd get a 'better' avg.....
>
> i did some seraching about outliers and most of hits are about R or
> spss or some other statistical package.....so looking for a way to do
> it wholly in pgsql.
>
>
> Rhys
>
> On Tue, Oct 27, 2009 at 4:53 PM, Ben Chobot <bench@silentmedia.com> wrote:
>> Are you asking how to find the average and standard deviation? Or how to
>> compare the your data against some set values? Perhaps an example would be
>> appropriate; it's not very clear to me what you're asking.
>>
>> Rhys A.D. Stewart wrote:
>>> Hey all,
>>> I have the following table:  data(pnum text, distance float8, route text).
>>> I would like to remove the outliers in distance, i.e. lets say i get
>>> the avg dist of pnum for each route and the std deviation of the
>>> distance what is the best way to identify the outliers?
>>>
>>>
>>> Rhys.

Oh, so you want to "cook" your data? I don't agree with that
conceptually, but:


WITH base AS (
   SELECT random(1, 100)::int AS i
   FROM generate_series(1, 100) i
),
stats AS (
   SELECT avg(i) AS dist_avg, stddev(i) AS dist_dev FROM base
)
SELECT count(i), avg(i) AS new_avg, MIN(stats.dist_avg) AS old_avg,
   stddev(i) AS new_dev, MIN(stats.dist_dev) AS old_dev
FROM base, stats
WHERE base.i BETWEEN stats.dist_avg - dist_dev
   AND stats.dist_avg + dist_dev


Scott Bailey

pgsql-general by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: how to identify outliers
Next
From: Merlin Moncure
Date:
Subject: Re: Slow running query with views...how to increase efficiency? with index?