Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Improving N-Distinct estimation by ANALYZE
Date
Msg-id 87d5j64ski.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Improving N-Distinct estimation by ANALYZE  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Improving N-Distinct estimation by ANALYZE  ("Jim C. Nasby" <jnasby@pervasive.com>)
List pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:

> > These statements are at odds with my admittedly basic understanding of
> > statistics.  Isn't the power of a sample more related to the absolute size of
> > the sample than the sample as fraction of the population?  Why not just pick
> > a smallish sample size, say about 3000, and apply it to all the tables, even
> > the ones with just a single row (modify appropriately from block sampling).
> 
> Nope, it's definitely proportional.   As a simple example, a sample of 500 rows
> in a table of 1000 rows should yeild stats estimates with 90%+ accuracy.  But a
> sample of 500 rows in a 600,000,000 row table is so small as to be nearly
> useless; it's quite possible to get all the same value in a random sample of <
> 0.1% even on a column with a D/N of 0.001.   If you look at the papers cited,
> almost all researchers more recent than Chaudhuri use a proportional sample
> size.

To be clear Josh is talking specifically about the estimate of how many
distinct values a query will see. Not the more usual estimates of how many
records the query will see.

For estimating how many records a query like 
 SELECT * FROM tab WHERE x BETWEEN ? AND ?

the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc)
+/- 2% 19 times out of 20. That's the same sample size most major opinion
polls use...

However this same logic doesn't work for estimating distinct values. Since a
single occurrence of a distinct value is just as important as hundreds of
occurrences, and your chances of finding the single occurrence is proportional
to what percentage of the overall table you sample, to maintain a given
accuracy you're going to have to maintain a sample of percentage of the
overall table.

Worse, my recollection from the paper I mentioned earlier was that sampling
small percentages like 3-5% didn't get you an acceptable accuracy. Before you
got anything reliable you found you were sampling very large percentages of
the table. And note that if you have to sample anything over 10-20% you may as
well just read the whole table. Random access reads are that much slower.

-- 
greg



pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: Improving N-Distinct estimation by ANALYZE
Next
From: Bruce Momjian
Date:
Subject: Re: Heads up: upcoming back-branch re-releases