On 12/11/2013 02:08 PM, Greg Stark wrote:
> On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark@mit.edu> wrote:
>> I'm not actually sure there is any systemic bias here. The larger
>> number of rows per block generate less precise results but from my
>> thought experiments they seem to still be accurate?
>
> So I've done some empirical tests for a table generated by:
> create table sizeskew as (select i,j,repeat('i',i) from
> generate_series(1,1000) as i, generate_series(1,1000) as j);
>
> I find that using the whole block doesn't cause any problem with the
> avg_width field for the "repeat" column.That does reinforce my belief
> that we might not need any particularly black magic here.
How large a sample did you use? Remember that the point of doing
block-level sampling instead of the current approach would be to allow
using a significantly smaller sample (in # of blocks), and still achieve
the same sampling error. If the sample is "large enough", it will mask
any systemic bias caused by block-sampling, but the point is to reduce
the number of sampled blocks.
The practical question here is this: What happens to the quality of the
statistics if you only read 1/2 the number of blocks than you normally
would, but included all the rows in the blocks we read in the sample?
How about 1/10 ?
Or to put it another way: could we achieve more accurate statistics by
including all rows from the sampled rows, while reading the same number
of blocks? In particular, I wonder if it would help with estimating
ndistinct. It generally helps to have a larger sample for ndistinct
estimation, so it might be beneficial.
- Heikki