On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> Well, there may be cases where the quality of the sample isn't
> terribly important, it just needs to be "reasonable."
>
> I browsed an article on the SYSTEM/BERNOULLI representations; they
> both amount to simple picks of tuples.
>
> - BERNOULLI implies picking tuples with a specified probability.
>
> - SYSTEM implies picking pages with a specified probability. (I think
> we mess with this in ways that'll be fairly biased in view that tuples
> mayn't be of uniform size, particularly if Slightly Smaller strings
> stay in the main pages, whilst Slightly Larger strings get TOASTed...)
Dealing with non uniform sizes isn't too hard. analyze.c already does
that. Given a table with B blocks it takes a uniform sample of b
blocks, and runs Vitter's reservoir sampling algorithm over the
resulting blocks to get s random tuples in a single pass. It's quite
easy to prove that this results in each tuple having an equal
probability to appear in the final table. However, it isn't fully
independent sampling - depending on the value of b compared to s and
B, there is a slightly higher probability to see multiple tuples
picked from one page. I'm too lazy to do the math, but for the analyze
case of b = s it probably isn't significant for most practical
purposes, even if B is really large. And it seems to me that when b >=
s the reservoir sampling thresholds could be tweaked at block
boundaries to even out the dependencies.
The ratio of b to s could be tweaked to get lower quality sampling
(samples are more spatially clumped) in exchange for less random I/O.
> Possibly the forms of sampling that people *actually* need, most of
> the time, are more like Dollar Unit Sampling, which are pretty
> deterministic, in ways that mandate that they be rather expensive
> (e.g. - guaranteeing Seq Scan).
I have a gut feeling that Dollar Unit Sampling and other weighted
samples can be done with a similar approach of uniformly sampling
blocks and then running weighted reservoir sampling [1] over the
result.
[1]: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
Cheers,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de