Thread: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

From:
"Dave Held"
Date:

> -----Original Message-----
> From: Josh Berkus [mailto:]
> Sent: Wednesday, April 27, 2005 10:25 AM
> To: Andrew Dunstan
> Cc: Mischa Sandberg; pgsql-perform; 
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> [...]
> Actually, it's more to characterize how large of a sample
> we need.  For example, if we sample 0.005 of disk pages, and
> get an estimate, and then sample another 0.005 of disk pages
> and get an estimate which is not even close to the first
> estimate, then we have an idea that this is a table which
> defies analysis based on small samples.

I buy that.

> Wheras if the two estimates are < 1.0 stdev apart, we can
> have good confidence that the table is easily estimated.

I don't buy that.  A negative indication is nothing more than
proof by contradiction.  A positive indication is mathematical
induction over the set, which in this type of context is
logically unsound.  There is no reason to believe that two
small samples with a small difference imply that a table is
easily estimated rather than that you got unlucky in your
samples.

> [...]
> Yes, actually.   We need 3 different estimation methods:
> 1 for tables where we can sample a large % of pages
> (say, >= 0.1)
> 1 for tables where we sample a small % of pages but are
> "easily estimated"
> 1 for tables which are not easily estimated by we can't
> afford to sample a large % of pages.

I don't buy that the first and second need to be different
estimation methods.  I think you can use the same block
sample estimator for both, and simply stop sampling at
different points.  If you set the default to be a fixed
number of blocks, you could get a large % of pages on
small tables and a small % of pages on large tables, which
is exactly how you define the first two cases.  However,
I think such a default should also be overridable to a
% of the table or a desired accuracy.

Of course, I would recommend the distinct sample technique
for the third case.

> If we're doing sampling-based estimation, I really don't
> want people to lose sight of the fact that page-based random
> sampling is much less expensive than row-based random
> sampling.   We should really be focusing on methods which
> are page-based.

Of course, that savings comes at the expense of having to
account for factors like clustering within blocks.  So block
sampling is more efficient, but can also be less accurate.
Nonetheless, I agree that of the sampling estimators, block
sampling is the better technique.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

From:
Greg Stark
Date:

"Dave Held" <> writes:

> > Actually, it's more to characterize how large of a sample
> > we need.  For example, if we sample 0.005 of disk pages, and
> > get an estimate, and then sample another 0.005 of disk pages
> > and get an estimate which is not even close to the first
> > estimate, then we have an idea that this is a table which
> > defies analysis based on small samples.
>
> I buy that.

Better yet is to use the entire sample you've gathered of .01 and then perform
analysis on that sample to see what the confidence interval is. Which is
effectively the same as what you're proposing except looking at every possible
partition.

Unfortunately the reality according to the papers that were sent earlier is
that you will always find the results disappointing. Until your sample is
nearly the entire table your estimates for n_distinct will be extremely
unreliable.

--
greg

From:
Marko Ristola
Date:

First I will comment my original idea.
Second I will give another improved suggestion (an idea).
I hope, that they will be useful for you.

(I don't know, wether the first one was useful at all because it showed,
that I and some others of us are not very good with statistics :( )

I haven't looked about the PostgreSQL code, so I don't know, that what
is possible
now, and what is not. I do know, that the full table scan and after that
incremental
statistics changes are a very big change, without looking at the code.



I meant the following  idea:
- compare two equal sized samples. Then redo the same thing with double
sized samples. So do lots of unnecessary work.
Check out the correlation of the two samples to try to guess the
distribution.

So I tried to give you an idea, not to give you a full answer into the
whole problem.


I did read some parts of the attached PDFs. They did convince me,
that it seems, that the heuristics for the hard cases would actually read
almost the whole table in many cases.

I did cover the "too little sample" problem by stating that the
user should be able to give the minimum size of samples. This way you would
avoid the too small sampling problem. My purpose was not to achieve at
most 5% wrong estimates, but to decrease the 2000% wrong estimates, that
are
seen now sometimes.

Conclusions:
- No heuristics or similar thing of small samples will grant excellent
results.
- If you need excellent estimates, you need to process the whole table!
- Some special cases, like primary keys and the unique indexes and special
case column types do give easy ways to make estimates:
For example, wether a boolean column has zero, one or two distinct
values, it does not matter
so much ??? Hashing seems the right choise for all of them.

If I have understund correctly, the full table scans are out of
questions for large tables at this time.

The percentage idea of taking 10% samples seems good.


So here is another suggestion:
1. Do a full percentage scan, starting at an arbitrary position. If the
user's data is not
homogenous, this hurts it, but this way it is faster.
During that scan, try to figure out all those columns, that have at most
100 distinct values.

Of course, with it you can't go into 100% accuracy, but if the full
table scan is out of question now,
it is better, if the accuracy is for example at most ten times wrong.

You could also improve accuracy by instead of doing a 10% partial table
scan, you could
do 20 pieces of 0,5 percent partial table scans: This would improve
accuracy a bit, but keep
the speed almost the same as the partial table scan.

Here are questions for the statisticians for distinct values calculation:

If we want at most 1000% tolerance, how big percentage of table's one
column must be processed?

If we want at most 500% tolerance, how big percentage of table's one
column must be processed?

If we want at most 250% tolerance, how big percentage of table's one
column must be processed?

Better to assume, that there are at most 100 distinct values on a table,
if it helps calculations.

If we try to get as much with one discontinuous partial table scan
(0,1-10% sample), here is the information, we can gather:

1. We could gather a histogram for max(100) distinct values for each
column for every column.
2. We could measure variance and average, and the number of rows for
these 100 distinct values.
3. We could count the number of rows, that didn't match with these 100
distinct values:
they were left out from the histogram.
4. We could get a minimum and a maximum value for each column.

=> We could get exact information about the sample with one 0,1-10% pass
for many columns.

What you statisticans can gather about these values?
My idea is programmatical combined with statistics:
+ Performance: scan for example 100 blocks each of size 100Mb, because
disc I/O
is much faster this way.
+ Enables larger table percentage. I hope it helps with the statistics
formula.
    Required because of more robust statistics: take those blocks at random
    (not over each other) places to decrease the effect from hitting
into statistically
    bad parts on the table.
+ Less table scan passes: scan all columns with limited hashing in the
first pass.
+ All easy columns are found here with one pass.
+- Harder columns need an own pass each, but we have some preliminary
    knoledge of them on the given sample after all (minimum and maximum
values
    and the histogram of the 100 distinct values).

Marko Ristola

Greg Stark wrote:

>"Dave Held" <> writes:
>
>
>
>>>Actually, it's more to characterize how large of a sample
>>>we need.  For example, if we sample 0.005 of disk pages, and
>>>get an estimate, and then sample another 0.005 of disk pages
>>>and get an estimate which is not even close to the first
>>>estimate, then we have an idea that this is a table which
>>>defies analysis based on small samples.
>>>
>>>
>>I buy that.
>>
>>
>
>Better yet is to use the entire sample you've gathered of .01 and then perform
>analysis on that sample to see what the confidence interval is. Which is
>effectively the same as what you're proposing except looking at every possible
>partition.
>
>Unfortunately the reality according to the papers that were sent earlier is
>that you will always find the results disappointing. Until your sample is
>nearly the entire table your estimates for n_distinct will be extremely
>unreliable.
>
>
>