Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: Query optimizer 8.0.1 (and 8.0) |
Date | |
Msg-id | 87pszbhkq9.fsf@stark.xeocode.com Whole thread Raw |
In response to | Re: Query optimizer 8.0.1 (and 8.0) (pgsql@mohawksoft.com) |
Responses |
Re: Query optimizer 8.0.1 (and 8.0)
|
List | pgsql-hackers |
pgsql@mohawksoft.com writes: > The basic problem with a fixed sample is that is assumes a normal > distribution. That's sort of true, but not in the way you think it is. What's under discussion here, w.r.t. the histograms and the validity of sampling is really basic statistics. You don't have to be a math major, in fact math majors tend to avoid the classes that teach this because it's full of social sciences students :) It doesn't matter what distribution the variable you're sampling has. In fact if you think of most applications of sampling, whether a pre-election opinion polls, or a survey about how often people brush their teeth, you very rarely find a real normal distribution. Usually the whole point of the survey is to determine what the distribution is. That's exactly what Postgres is recording with its histograms. Where the normal curve comes into play is in measuring the likelihood of the actual population distribution varying from the sample distribution. The most likely population distribution is exactly equal to the sample distribution. distributions slightly different from the sample distribution are slightly less equal, and the further removed from the sample distribution the less and less likely. This drops off in precisely the shape of a normal curve. For large populations this is *always* true, provably, it doesn't matter what the distribution the raw data or your sample has. The likelihood of the sample diverging from the population will look like a normal curve. The stddev of this normal curve represents how precisely you can describe the distribution of the population based on your sample. The smaller the sample the larger the stddev of this normal curve, and the larger the probability that the population diverges from the sample you drew. To calculate the stddev there's a complex formula that most scientific calculators have preprogrammed. This function does *not* depend on the distribution of the raw data (which you don't know in any case) There are two versions of this function though: one for "small" populations where the sample is a significant percentage of the population. As you point out if you're sampling 50% of the entire population then yes you can have more confidence in your results than .002% of the entire population. However, once you're talking about populations like an entire state or country then it's simpler to just calculate the confidence as if the population is "large". In that case it no longer matters whether the sample is .002% or .0002% of the population. If it's 600 people then you have the "+/- 3% 19 times out of 20" confidence interval that all those opinion polls you see put in their fine print. To achieve that they only have to poll 600 people that whether they're polling a city, a state, or the entire country. > When you look at a sine wave on an oscilloscope, you can see it clear as > day. When you look at music on the scope, you know there are many waves > there, but it is difficult to make heads or tails of it. (use xmms or > winamp to see for yourself) The waves change in frequency, amplitude, and > duration over a very large scale. That's why you use a spectrum analyzer > to go from time domain to frequency domain. In frequency domain, you can > see the trends better. That's not a bad analogy to many problems where you're measuring data that has non-randomness in it but that are not visible in the domain that the statistics that are being analyzed. This seems to happen a lot with geographic data, for instance. If you find that increasing the stats targets improves things then this isn't true. If you find that it doesn't then what's really needed is a cleverer set of statistics to look for. -- greg
pgsql-hackers by date: