On 09/12/13 08:03, Josh Berkus wrote:
>
> So there's a set of math designed to calculate for the skew introduced
> by reading *all* of the rows in each block. That's what I meant by
> "block-based sampling"; you read, say, 400 pages, you compile statistics
> on *all* of the rows on those pages, you apply some algorithms to adjust
> for groupings of rows based on how grouped they are. And you have a
> pretty good estimate of how grouped they are, because you just looked a
> complete sets of rows on a bunch of nonadjacent pages.
>
> Obviously, you need to look at more rows than you would with a
> pure-random sample. Like I said, the 80%+ accurate point in the papers
> seemed to be at a 5% sample. However, since those rows come from the
> same pages, the cost of looking at more rows is quite small, compared to
> the cost of looking at 64 times as many disk pages.
>
> My ACM subscription has lapsed, though; someone with a current ACM
> subscription could search for this; there are several published papers,
> with math and pseudocode.
>
This makes more sense! Unfortunately you had confused everyone (well me
at least) by decreeing that we needed block based sampling - when we
started using that in 2004 - there is more than one way to sample blocks
it seems :-)
What you are actually suggesting is a way to do block based sampling
that requires reading fewer blocks, and that is a great idea! Of course
as Greg has just suggested - the details are gonna be the clincher. Can
you supply authors or titles of papers?
Also with reference to Greenplum - their use of postgres is fairly
special case, and an approach that works well for them is not always
suitable for more general purpose use.
Regards
Mark