Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Query optimizer 8.0.1 (and 8.0)
Date
Msg-id 87vf94gfb0.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Query optimizer 8.0.1 (and 8.0)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Query optimizer 8.0.1 (and 8.0)
List pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> writes:

> The papers that I looked at say that this rule has a good solid
> statistical foundation, at least for the case of estimating histograms.

Earlier I was discussing the issue of how to measure cross-column
"correlation" with someone who works on precisely these types of problems. He
immediately honed in on precisely this issue of what kinds of deductions can
be extrapolated from a constant sample like this and what kind of deductions
would require a linear sample, or worse: can't be reasonably extrapolated at
all.

After the discussion with him I'm left with a distinctly uneasy feeling about
some of Postgres's stats. I was planning to investigate further before I wrote
about this though, since I'm not sure exactly where they all stand. The
histograms are indeed on solid footing, but I'm not so certain that's true for
some of the other bits of statistics.

The statistical foundation for histograms is the traditional sampling from a
population that supports every opinion poll you see in the news. Those opinion
polls only need a fixed size sample to obtain a given confidence interval for
almost any population size. Postgres's histograms are just describing the
distribution in a similar way.

However for discrete values like the "top ten most common values" and the
"total number of distinct values" it's not so clear at all that you can
extrapolate from a sample at all. And it's certainly not clear that a fixed
size sample gives you at all the same confidence for a large population as it
does for a small population.

If you sampled across the country and found your sample of 600 people had 4
different top choices for president, how do you extrapolate that to calculate
the total number of top choices for president the 300 million+ people will
have across the country? You could multiply by 500k but that's not going to be
right. Hell you're probably better off just going with "4" but that's clearly
not right either.

The reality is that your sample hasn't really given you *any* information on
how frequently you'll find outliers in the distribution. There are probably a
couple hundred or a couple thousand values that occur only once or twice, but
the sample size you would have to use to get a good estimate of how many there
were will be proportional to the entire population, *unlike the histograms*.

The essential point here is that to count "discrete values" you only care
whether they occur or not at all, not how many times they occur. So you lose
the benefit of being able to model the population as "large", the values as
continuous, and the distribution as a normal curve. So you can't really talk
about a 95% confidence interval any more and every outlier becomes as
important as the more common values.

-- 
greg



pgsql-hackers by date:

Previous
From: pgsql@mohawksoft.com
Date:
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Next
From: Pailloncy Jean-Gerard
Date:
Subject: Concurrent wait-lock