Home > mailing lists

Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From	Josh Berkus
Subject	Re: Improving N-Distinct estimation by ANALYZE
Date	January 5, 2006 18:34:44
Msg-id	200601051140.20156.josh@agliodbs.com Whole thread Raw
In response to	Re: Improving N-Distinct estimation by ANALYZE (Greg Stark <gsstark@mit.edu>)
Responses	Re: Improving N-Distinct estimation by ANALYZE
List	pgsql-hackers

Tree view

Greg,

> We *currently* use a block based sampling algorithm that addresses this
> issue by taking care to select rows within the selected blocks in an
> unbiased way. You were proposing reading *all* the records from the
> selected blocks, which throws away that feature.

The block-based algorithms have specific math to cope with this.   Stuff 
which is better grounded in statistical analysis than our code.   Please 
read the papers before you judge the solution.

> Worse, my recollection from the paper I mentioned earlier was that
> sampling small percentages like 3-5% didn't get you an acceptable
> accuracy. Before you got anything reliable you found you were sampling
> very large percentages of the table. And note that if you have to sample
> anything over 10-20% you may as well just read the whole table. Random
> access reads are that much slower.

Right, which is why researchers are currently looking for something better.  
The Brutlag & Richardson claims to be able to produce estimates which are 
within +/- 3x 90% of the time using a 5% sample, which is far better than 
our current accuracy.  Nobody claims to be able to estimate based on < 
0.1% of the table, which is what Postgres tries to do for large tables.

5% based on block-based sampling is reasonable; that means a straight 5% of 
the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
sampling, that would require as much as 25% of the table, making it easier 
to just scan the whole thing.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

pgsql-hackers by date:

From: Bruce Momjian
Date: 05 January 2006, 16:47:47
Subject: Re: Inconsistent syntax in GRANT

From: "Jim C. Nasby"
Date: 05 January 2006, 18:58:21
Subject: Re: Improving N-Distinct estimation by ANALYZE

Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

Previous

Next