Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Query optimizer 8.0.1 (and 8.0)
Date
Msg-id 87pszbhkq9.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Query optimizer 8.0.1 (and 8.0)  (pgsql@mohawksoft.com)
Responses Re: Query optimizer 8.0.1 (and 8.0)
List pgsql-hackers
pgsql@mohawksoft.com writes:

> The basic problem with a fixed sample is that is assumes a normal
> distribution. 

That's sort of true, but not in the way you think it is.

What's under discussion here, w.r.t. the histograms and the validity of
sampling is really basic statistics. You don't have to be a math major, in
fact math majors tend to avoid the classes that teach this because it's full
of social sciences students :)

It doesn't matter what distribution the variable you're sampling has. In fact
if you think of most applications of sampling, whether a pre-election opinion
polls, or a survey about how often people brush their teeth, you very rarely
find a real normal distribution. Usually the whole point of the survey is to
determine what the distribution is. That's exactly what Postgres is recording
with its histograms.

Where the normal curve comes into play is in measuring the likelihood of the
actual population distribution varying from the sample distribution. The most
likely population distribution is exactly equal to the sample distribution.
distributions slightly different from the sample distribution are slightly
less equal, and the further removed from the sample distribution the less and
less likely. This drops off in precisely the shape of a normal curve.

For large populations this is *always* true, provably, it doesn't matter what
the distribution the raw data or your sample has. The likelihood of the sample
diverging from the population will look like a normal curve.

The stddev of this normal curve represents how precisely you can describe the
distribution of the population based on your sample. The smaller the sample
the larger the stddev of this normal curve, and the larger the probability
that the population diverges from the sample you drew.

To calculate the stddev there's a complex formula that most scientific
calculators have preprogrammed. This function does *not* depend on the
distribution of the raw data (which you don't know in any case)

There are two versions of this function though: one for "small" populations
where the sample is a significant percentage of the population. As you point
out if you're sampling 50% of the entire population then yes you can have more
confidence in your results than .002% of the entire population.

However, once you're talking about populations like an entire state or country
then it's simpler to just calculate the confidence as if the population is
"large". In that case it no longer matters whether the sample is .002% or
.0002% of the population. If it's 600 people then you have the "+/- 3% 19
times out of 20" confidence interval that all those opinion polls you see put
in their fine print. To achieve that they only have to poll 600 people that
whether they're polling a city, a state, or the entire country.

> When you look at a sine wave on an oscilloscope, you can see it clear as
> day. When you look at music on the scope, you know there are many waves
> there, but it is difficult to make heads or tails of it. (use xmms or
> winamp to see for yourself) The waves change in frequency, amplitude, and
> duration over a very large scale. That's why you use a spectrum analyzer
> to go from time domain to frequency domain. In frequency domain, you can
> see the trends better.

That's not a bad analogy to many problems where you're measuring data that has
non-randomness in it but that are not visible in the domain that the
statistics that are being analyzed. This seems to happen a lot with geographic
data, for instance.

If you find that increasing the stats targets improves things then this isn't
true. If you find that it doesn't then what's really needed is a cleverer set
of statistics to look for.

-- 
greg



pgsql-hackers by date:

Previous
From: Mark Kirkwood
Date:
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Next
From: Michael Fuhr
Date:
Subject: Re: How can I use large object on PostgreSQL Linux Version?