Re: ANALYZE sampling is too good - Mailing list pgsql-hackers

From Greg Stark
Subject Re: ANALYZE sampling is too good
Date
Msg-id CAM-w4HPWSnGFgk+o7sR3Y76sywp3kX9EPj6-jn28qc4p5bJ8tQ@mail.gmail.com
Whole thread Raw
In response to Re: ANALYZE sampling is too good  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: ANALYZE sampling is too good
List pgsql-hackers
On Wed, Dec 11, 2013 at 12:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

> Yes, it is not a perfect statistical sample. All sampling is subject
> to an error that is data dependent.

Well there's random variation due to the limitations of dealing with a
sample. And then there's systemic biases due to incorrect algorithms.
You wouldn't be happy if the samples discarded every row with NULLs or
every row older than some date etc. These things would not be
corrected by larger samples. That's the kind of "error" we're talking
about here.

But the more I think about things the less convinced I am that there
is a systemic bias introduced by reading the entire block. I had
assumed larger rows would be selected against but that's not really
true, they're just selected against relative to the number of bytes
they occupy which is the correct frequency to sample.

Even blocks that are mostly empty don't really bias things. Picture a
table that consists of 100 blocks with 100 rows each (value A) and
another 100 blocks with only 1 row each (value B). The rows with value
B have a 50% chance of being in any given block which is grossly
inflated however each block selected with value A will produce 100
rows. So if you sample 10 blocks you'll get 100x10xA and 1x10xB which
will be the correct proportion.

I'm not actually sure there is any systemic bias here. The larger
number of rows per block generate less precise results but from my
thought experiments they seem to still be accurate?



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: same-address mappings vs. relative pointers
Next
From: Andres Freund
Date:
Subject: Re: Why the buildfarm is all pink