On Wed, Oct 12, 2011 at 2:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think it's overkill, and possibly unpleasantly unstable as well.
> For the initial attack on this we should just have VACUUM and ANALYZE
> count the number of all-visible blocks and store that in pg_class along
> with the tuple-count statistics. There's no reason to think that this
> will be any worse than the way we deal with dead tuple counts, for
> instance.
So I have a theory.
Assuming you're in a steady-state situation the amount of all-visible
blocks will fluctuate from a high just after vacuum to a low just
before the next vacuum. There are other ways a block can be marked
all-visible but for the most part I would expect the fraction to go
steadily down until vacuum comes along and cleans things up.
So if vacuum tracked the fraction of blocks marked all-visible
*before* it processed them and the fraction it marked all-visible
after processing we have an upper and lower bound. If we knew how long
it's been since vacuum we could interpolate between those, or we could
just take the mean, or we could take the lower bound as a conservative
estimate.
> What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction
> of the table expected to be scanned gets smaller.
I think there's a statistically more rigorous way of accomplishing the
same thing. If you treat the pages we estimate we're going to read as
a random sample of the population of pages then your expected value is
the fraction of the overall population that is all-visible but your
95th percentile confidence interval will be, uh, a simple formula we
can compute but I don't recall off-hand.
This gets back to a discussion long-ago of what estimates the planner
should be using. It currently uses all expected values but in many
cases it would be valuable if the planner knew what the standard
deviation of those estimates was. It might sometimes be better to pick
a plan that's slightly worse on average but is less likely to be much
worse.
--
greg