Re: query slows down with more accurate stats - Mailing list pgsql-performance

From Tom Lane
Subject Re: query slows down with more accurate stats
Date
Msg-id 29060.1082126089@sss.pgh.pa.us
Whole thread Raw
In response to Re: query slows down with more accurate stats  (Manfred Koizar <mkoi-pg@aon.at>)
Responses Re: query slows down with more accurate stats
List pgsql-performance
Manfred Koizar <mkoi-pg@aon.at> writes:
> If the number of pages is B and the sample size is n, a perfect sampling
> method collects a sample where all tuples come from different pages with
> probability (in OpenOffice.org syntax):
>     p = prod from{i = 0} to{n - 1} {{c(B - i)}  over {cB - i}}

So?  You haven't proven that either sampling method fails to do the
same.

The desired property can also be phrased as "every tuple should be
equally likely to be included in the final sample".  What we actually
have in the case of your revised algorithm is "every page is equally
likely to be sampled, and of the pages included in the sample, every
tuple is equally likely to be chosen".  Given that there are B total
pages of which we sample b pages that happen to contain T tuples (in any
distribution), the probability that a particular tuple gets chosen is
    (b/B) * (n/T)
assuming that the two selection steps are independent and unbiased.

Now b, B, and n are not dependent on which tuple we are talking about.
You could argue that a tuple on a heavily populated page is
statistically likely to see a higher T when it's part of the page sample
pool than a tuple on a near-empty page is likely to see, and therefore
there is some bias against selection of the former tuple.  But given a
sample over a reasonably large number of pages, the contribution of any
one page to T should be fairly small and so this effect ought to be
small.  In fact, because T directly determines our estimate of the total
number of tuples in the relation, your experiments showing that the new
method gives a reliable tuple count estimate directly prove that T is
pretty stable regardless of exactly which pages get included in the
sample.  So I think this method is effectively unbiased at the tuple
level.  The variation in probability of selection of individual tuples
can be no worse than the variation in the overall tuple count estimate.

            regards, tom lane

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: RESOLVED: Re: Wierd context-switching issue on Xeon
Next
From: "Jim C. Nasby"
Date:
Subject: Poor performance of group by query