Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: Query optimizer 8.0.1 (and 8.0) |
Date | |
Msg-id | 87vf94gfb0.fsf@stark.xeocode.com Whole thread Raw |
In response to | Re: Query optimizer 8.0.1 (and 8.0) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Query optimizer 8.0.1 (and 8.0)
|
List | pgsql-hackers |
Tom Lane <tgl@sss.pgh.pa.us> writes: > The papers that I looked at say that this rule has a good solid > statistical foundation, at least for the case of estimating histograms. Earlier I was discussing the issue of how to measure cross-column "correlation" with someone who works on precisely these types of problems. He immediately honed in on precisely this issue of what kinds of deductions can be extrapolated from a constant sample like this and what kind of deductions would require a linear sample, or worse: can't be reasonably extrapolated at all. After the discussion with him I'm left with a distinctly uneasy feeling about some of Postgres's stats. I was planning to investigate further before I wrote about this though, since I'm not sure exactly where they all stand. The histograms are indeed on solid footing, but I'm not so certain that's true for some of the other bits of statistics. The statistical foundation for histograms is the traditional sampling from a population that supports every opinion poll you see in the news. Those opinion polls only need a fixed size sample to obtain a given confidence interval for almost any population size. Postgres's histograms are just describing the distribution in a similar way. However for discrete values like the "top ten most common values" and the "total number of distinct values" it's not so clear at all that you can extrapolate from a sample at all. And it's certainly not clear that a fixed size sample gives you at all the same confidence for a large population as it does for a small population. If you sampled across the country and found your sample of 600 people had 4 different top choices for president, how do you extrapolate that to calculate the total number of top choices for president the 300 million+ people will have across the country? You could multiply by 500k but that's not going to be right. Hell you're probably better off just going with "4" but that's clearly not right either. The reality is that your sample hasn't really given you *any* information on how frequently you'll find outliers in the distribution. There are probably a couple hundred or a couple thousand values that occur only once or twice, but the sample size you would have to use to get a good estimate of how many there were will be proportional to the entire population, *unlike the histograms*. The essential point here is that to count "discrete values" you only care whether they occur or not at all, not how many times they occur. So you lose the benefit of being able to model the population as "large", the values as continuous, and the distribution as a normal curve. So you can't really talk about a 95% confidence interval any more and every outlier becomes as important as the more common values. -- greg
pgsql-hackers by date: