Re: ANALYZE sampling is too good - Mailing list pgsql-hackers

From Mark Kirkwood
Subject Re: ANALYZE sampling is too good
Date
Msg-id 52A4F084.3070200@catalyst.net.nz
Whole thread Raw
In response to Re: ANALYZE sampling is too good  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
On 09/12/13 08:03, Josh Berkus wrote:
>
> So there's a set of math designed to calculate for the skew introduced
> by reading *all* of the rows in each block.  That's what I meant by
> "block-based sampling"; you read, say, 400 pages, you compile statistics
> on *all* of the rows on those pages, you apply some algorithms to adjust
> for groupings of rows based on how grouped they are.  And you have a
> pretty good estimate of how grouped they are, because you just looked a
> complete sets of rows on a bunch of nonadjacent pages.
>
> Obviously, you need to look at more rows than you would with a
> pure-random sample.  Like I said, the 80%+ accurate point in the papers
> seemed to be at a 5% sample.  However, since those rows come from the
> same pages, the cost of looking at more rows is quite small, compared to
> the cost of looking at 64 times as many disk pages.
>
> My ACM subscription has lapsed, though; someone with a current ACM
> subscription could search for this; there are several published papers,
> with math and pseudocode.
>

This makes more sense! Unfortunately you had confused everyone (well me 
at least) by decreeing that we needed block based sampling - when we 
started using that in 2004 - there is more than one way to sample blocks 
it seems :-)

What you are actually suggesting is a way to do block based sampling 
that requires reading fewer blocks, and that is a great idea! Of course 
as Greg has just suggested - the details are gonna be the clincher. Can 
you supply authors or titles of papers?

Also with reference to Greenplum - their use of postgres is fairly 
special case, and an approach that works well for them is not always 
suitable for more general purpose use.

Regards

Mark



pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: ANALYZE sampling is too good
Next
From: Jim Nasby
Date:
Subject: Re: Why we are going to have to go DirectIO