Re: Odd statistics behaviour in 7.2 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Odd statistics behaviour in 7.2
Date
Msg-id 29600.1013623356@sss.pgh.pa.us
Whole thread Raw
In response to Odd statistics behaviour in 7.2  ("Gordon A. Runkle" <gar@integrated-dynamics.com>)
Responses Re: Odd statistics behaviour in 7.2  ("Gordon A. Runkle" <gar@integrated-dynamics.com>)
List pgsql-hackers
Gordon Runkle <gar@www.integrated-dynamics.com> writes:
> You can retrieve the dump of my data at: [snipped]

Thanks.  Indeed, the first time I did an ANALYZE I got:
tablename | attname | null_frac | avg_width | n_distinct |       most_common_vals        |     most_common_freqs     |
                                                                   histogram_bounds
                                 | correlation
 

-----------+---------+-----------+-----------+------------+-------------------------------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------tom_help
| tdnr    |         0 |        17 |      56484 | {0557088000957,0557700369880} | {0.000666667,0.000666667} |
{0551000386411,0551504108858,0557011074656,0557050633939,0557111430036,0557151012769,0557179871119,0557698138188,0557750158740,0557783053444,0558980779763}
|  -0.199108
 

Now, I happen to know that the default size of ANALYZE's statistical
sample is 3000 rows.  What evidently happened here is that the
statistical sampling picked up two values appearing twice (namely,
0557088000957 and 0557700369880); the given frequencies for these two
values establish that they appeared twice in the 3000-row sample.
Since no other values are mentioned in most_common_vals, the remaining
2996 samples must have been values that appeared only once.

Having computed these raw statistics, ANALYZE has to try to extrapolate the
number of distinct values in the whole table.  What it's using for the
purpose is an equation I found in
"Random sampling for histogram construction: how much is enough?"by Surajit Chaudhuri, Rajeev Motwani and Vivek
Narasayya,inProceedings of ACM SIGMOD International Conference on Managementof Data, 1998, Pages 436-447.
 

namely
    sqrt(n/r) * max(f1,1) + f2 + f3 + ...
 where fk is the number of distinct values that occurred exactly k times in our sample of r rows (from a total of n).

And indeed you get 56484 when you plug in these numbers.  So the code is
operating as designed, and we can't complain that the sample is an
unreasonable sample given the true underlying distribution.  We have to
blame the equation: evidently this estimation equation doesn't apply
very well to nearly-unique columns.

I had already modified the Chaudhuri approach slightly: if the ANALYZE
sample contains no duplicate values at all (ie, f1=r, f2=f3=f4=...=0)
then their equation reduces to sqrt(n*r), but actually ANALYZE assumes
the column is unique (ie, n distinct values, not sqrt(n*r)), which seems
a lot better assumption in practice.  The runs where you got n_distinct
equal to -1 are presumably those where the ANALYZE sample chanced to
contain no duplicates.

I am thinking that we need to find another estimator equation, or at
least shade away from their equation when f1 is close to r.  Ideally the
estimate for f1=r should be a logical extrapolation of the curve for f1
close to r, but right now it's quite discontinuous.

There was some previous discussion about this, cf the thread at
http://archives.postgresql.org/pgsql-general/2001-10/msg01032.php
but nothing really emerged on how to do better.  The Chaudhuri paper
points out that estimating total number of distinct values from a sample
is inherently a hard problem and subject to large estimation errors,
so it may be that we can't do a lot better.  I think we should be wary
of ad-hoc answers, anyhow.  Something with a little math behind it would
make me feel more comfortable.

Anyone have any thoughts on how to do better?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Neil Padgett
Date:
Subject: Re: Add free-behind capability for large sequential scans
Next
From: Neil Padgett
Date:
Subject: Re: Add free-behind capability for large sequential scans