Home > mailing lists

Re: Odd statistics behaviour in 7.2 - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Odd statistics behaviour in 7.2
Date	February 13, 2002 13:21:57
Msg-id	29600.1013623356@sss.pgh.pa.us Whole thread Raw
In response to	Odd statistics behaviour in 7.2 ("Gordon A. Runkle" <gar@integrated-dynamics.com>)
Responses	Re: Odd statistics behaviour in 7.2
List	pgsql-hackers

Tree view

Gordon Runkle <gar@www.integrated-dynamics.com> writes:
> You can retrieve the dump of my data at: [snipped]

-----------+---------+-----------+-----------+------------+-------------------------------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------tom_help
| tdnr | 0 | 17 | 56484 | {0557088000957,0557700369880} | {0.000666667,0.000666667} |
{0551000386411,0551504108858,0557011074656,0557050633939,0557111430036,0557151012769,0557179871119,0557698138188,0557750158740,0557783053444,0558980779763}
| -0.199108

Now, I happen to know that the default size of ANALYZE's statistical
sample is 3000 rows. What evidently happened here is that the
statistical sampling picked up two values appearing twice (namely,
0557088000957 and 0557700369880); the given frequencies for these two
values establish that they appeared twice in the 3000-row sample.
Since no other values are mentioned in most_common_vals, the remaining
2996 samples must have been values that appeared only once.

Having computed these raw statistics, ANALYZE has to try to extrapolate the
number of distinct values in the whole table. What it's using for the
purpose is an equation I found in
"Random sampling for histogram construction: how much is enough?"by Surajit Chaudhuri, Rajeev Motwani and Vivek
Narasayya,inProceedings of ACM SIGMOD International Conference on Managementof Data, 1998, Pages 436-447.

namely
sqrt(n/r) * max(f1,1) + f2 + f3 + ...
where fk is the number of distinct values that occurred exactly k times in our sample of r rows (from a total of n).

And indeed you get 56484 when you plug in these numbers. So the code is
operating as designed, and we can't complain that the sample is an
unreasonable sample given the true underlying distribution. We have to
blame the equation: evidently this estimation equation doesn't apply
very well to nearly-unique columns.

I had already modified the Chaudhuri approach slightly: if the ANALYZE
sample contains no duplicate values at all (ie, f1=r, f2=f3=f4=...=0)
then their equation reduces to sqrt(n*r), but actually ANALYZE assumes
the column is unique (ie, n distinct values, not sqrt(n*r)), which seems
a lot better assumption in practice. The runs where you got n_distinct
equal to -1 are presumably those where the ANALYZE sample chanced to
contain no duplicates.

I am thinking that we need to find another estimator equation, or at
least shade away from their equation when f1 is close to r. Ideally the
estimate for f1=r should be a logical extrapolation of the curve for f1
close to r, but right now it's quite discontinuous.

There was some previous discussion about this, cf the thread at
http://archives.postgresql.org/pgsql-general/2001-10/msg01032.php
but nothing really emerged on how to do better. The Chaudhuri paper
points out that estimating total number of distinct values from a sample
is inherently a hard problem and subject to large estimation errors,
so it may be that we can't do a lot better. I think we should be wary
of ad-hoc answers, anyhow. Something with a little math behind it would
make me feel more comfortable.

Anyone have any thoughts on how to do better?
regards, tom lane

pgsql-hackers by date:

From: Neil Padgett
Date: 13 February 2002, 12:51:21
Subject: Re: Add free-behind capability for large sequential scans

From: Neil Padgett
Date: 13 February 2002, 13:32:16
Subject: Re: Add free-behind capability for large sequential scans

Re: Odd statistics behaviour in 7.2 - Mailing list pgsql-hackers

Previous

Next