Home > mailing lists

Re: ANALYZE sampling is too good - Mailing list pgsql-hackers

From	Josh Berkus
Subject	Re: ANALYZE sampling is too good
Date	December 6, 2013 02:51:05
Msg-id	52A11162.8090104@agliodbs.com Whole thread Raw
In response to	ANALYZE sampling is too good (Greg Stark <stark@mit.edu>)
Responses	Re: ANALYZE sampling is too good
List	pgsql-hackers

Tree view

On 12/03/2013 03:30 PM, Greg Stark wrote:
> It means if your table is anywhere up to 240MB you're effectively
> doing a full table scan and then throwing out nearly all the data
> read.

There are lots of issues with our random sampling approach for ANALYZE.This is why, back in our Greenplum days, Simon
proposedchanging to a

block-based sampling approach, where we would sample random *pages*
instead of random *rows*.  That would allow us to do things like sample
5% of the table, but only read 5% of the table, although we might have
to play some with OS-FS operations to make sure of that.  In addition to
solving the issue you cite above, it would let us get MUCH more accurate
estimates for very large tables, where currently we sample only about
0.1% of the table.

There are fairly well researched algorithms for block-based sampling
which estimate for the skew introduced by looking at consecutive rows in
a block.  In general, a minimum sample size of 5% is required, and the
error is no worse than our current system.  However, the idea was shot
down at the time, partly because I think other hackers didn't get the math.

I believe that both Oracle and MSSQL use block-based sampling, but of
course, I don't know which specific algo they use.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

pgsql-hackers by date:

From: Tatsuo Ishii
Date: 06 December 2013, 02:45:26
Subject: Re: Proposal: variant of regclass

From: Tom Lane
Date: 06 December 2013, 02:59:31
Subject: Re: Proposal: variant of regclass

Re: ANALYZE sampling is too good - Mailing list pgsql-hackers

Previous

Next