Re: ANALYZE sampling is too good - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: ANALYZE sampling is too good
Date
Msg-id 52A86266.9020700@vmware.com
Whole thread Raw
In response to Re: ANALYZE sampling is too good  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On 12/11/2013 02:08 PM, Greg Stark wrote:
> On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark@mit.edu> wrote:
>> I'm not actually sure there is any systemic bias here. The larger
>> number of rows per block generate less precise results but from my
>> thought experiments they seem to still be accurate?
>
> So I've done some empirical tests for a table generated by:
> create table sizeskew as (select i,j,repeat('i',i) from
> generate_series(1,1000) as i, generate_series(1,1000) as j);
>
> I find that using the whole block doesn't cause any problem with the
> avg_width field for the "repeat" column.That does reinforce my belief
> that we might not need any particularly black magic here.

How large a sample did you use? Remember that the point of doing 
block-level sampling instead of the current approach would be to allow 
using a significantly smaller sample (in # of blocks), and still achieve 
the same sampling error. If the sample is "large enough", it will mask 
any systemic bias caused by block-sampling, but the point is to reduce 
the number of sampled blocks.

The practical question here is this: What happens to the quality of the 
statistics if you only read 1/2 the number of blocks than you normally 
would, but included all the rows in the blocks we read in the sample? 
How about 1/10 ?

Or to put it another way: could we achieve more accurate statistics by 
including all rows from the sampled rows, while reading the same number 
of blocks? In particular, I wonder if it would help with estimating 
ndistinct. It generally helps to have a larger sample for ndistinct 
estimation, so it might be beneficial.

- Heikki



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Extra functionality to createuser
Next
From: Robert Haas
Date:
Subject: Re: should we add a XLogRecPtr/LSN SQL type?