Home > mailing lists

Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From	Greg Stark
Subject	Re: Improving N-Distinct estimation by ANALYZE
Date	January 5, 2006 11:12:38
Msg-id	87d5j64ski.fsf@stark.xeocode.com Whole thread Raw
In response to	Re: Improving N-Distinct estimation by ANALYZE (Josh Berkus <josh@agliodbs.com>)
Responses	Re: Improving N-Distinct estimation by ANALYZE
List	pgsql-hackers

Tree view

Josh Berkus <josh@agliodbs.com> writes:

> > These statements are at odds with my admittedly basic understanding of
> > statistics.  Isn't the power of a sample more related to the absolute size of
> > the sample than the sample as fraction of the population?  Why not just pick
> > a smallish sample size, say about 3000, and apply it to all the tables, even
> > the ones with just a single row (modify appropriately from block sampling).
> 
> Nope, it's definitely proportional.   As a simple example, a sample of 500 rows
> in a table of 1000 rows should yeild stats estimates with 90%+ accuracy.  But a
> sample of 500 rows in a 600,000,000 row table is so small as to be nearly
> useless; it's quite possible to get all the same value in a random sample of <
> 0.1% even on a column with a D/N of 0.001.   If you look at the papers cited,
> almost all researchers more recent than Chaudhuri use a proportional sample
> size.

To be clear Josh is talking specifically about the estimate of how many
distinct values a query will see. Not the more usual estimates of how many
records the query will see.

For estimating how many records a query like 
 SELECT * FROM tab WHERE x BETWEEN ? AND ?

the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc)
+/- 2% 19 times out of 20. That's the same sample size most major opinion
polls use...

However this same logic doesn't work for estimating distinct values. Since a
single occurrence of a distinct value is just as important as hundreds of
occurrences, and your chances of finding the single occurrence is proportional
to what percentage of the overall table you sample, to maintain a given
accuracy you're going to have to maintain a sample of percentage of the
overall table.

Worse, my recollection from the paper I mentioned earlier was that sampling
small percentages like 3-5% didn't get you an acceptable accuracy. Before you
got anything reliable you found you were sampling very large percentages of
the table. And note that if you have to sample anything over 10-20% you may as
well just read the whole table. Random access reads are that much slower.

-- 
greg

pgsql-hackers by date:

From: Greg Stark
Date: 05 January 2006, 11:02:18
Subject: Re: Improving N-Distinct estimation by ANALYZE

From: Bruce Momjian
Date: 05 January 2006, 11:17:12
Subject: Re: Heads up: upcoming back-branch re-releases

Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

Previous

Next