Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Improving N-Distinct estimation by ANALYZE
Date
Msg-id 871wzl50wm.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Improving N-Distinct estimation by ANALYZE  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Improving N-Distinct estimation by ANALYZE
List pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:

> Greg,
> 
> > We *currently* use a block based sampling algorithm that addresses this
> > issue by taking care to select rows within the selected blocks in an
> > unbiased way. You were proposing reading *all* the records from the
> > selected blocks, which throws away that feature.
> 
> The block-based algorithms have specific math to cope with this.   Stuff 
> which is better grounded in statistical analysis than our code.   Please 
> read the papers before you judge the solution.

I'm confused since Postgres's current setup *was* based on papers on just this
topic. I haven't read any of the papers myself, I'm just repeating what was
previously discussed when the current block sampling code went in. There was
extensive discussion of the pros and cons of different algorithms taken from
various papers and how they related to Postgres. I don't recall any of them
suggesting that there was any way to take a sample which included every row
from a sampled block and then somehow unbias the sample after the fact.

> Right, which is why researchers are currently looking for something better.  
> The Brutlag & Richardson claims to be able to produce estimates which are 
> within +/- 3x 90% of the time using a 5% sample, which is far better than 
> our current accuracy.  Nobody claims to be able to estimate based on < 
> 0.1% of the table, which is what Postgres tries to do for large tables.
> 
> 5% based on block-based sampling is reasonable; that means a straight 5% of 
> the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
> sampling, that would require as much as 25% of the table, making it easier 
> to just scan the whole thing.

Postgres's current sample sizes are clearly geared towards the histograms
where they are entirely realistic. All of the distinct estimates are clearly
just ad hoc attempts based on the existing sampling. 

Is a mechanism that is only 5x faster than reading the whole table (assuming
random_page_cost of 4) and is off by more than a factor of three 10% of the
time really worth it?

-- 
greg



pgsql-hackers by date:

Previous
From: Qingqing Zhou
Date:
Subject: Re: Warm-up cache may have its virtue
Next
From: Stefan Kaltenbrunner
Date:
Subject: could not access status of transaction 0