Re: Odd statistics behaviour in 7.2 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Odd statistics behaviour in 7.2
Date
Msg-id 16868.1013818182@sss.pgh.pa.us
Whole thread Raw
In response to Re: Odd statistics behaviour in 7.2  ("Gordon A. Runkle" <gar@integrated-dynamics.com>)
Responses Re: Odd statistics behaviour in 7.2  ("Gordon A. Runkle" <gar@integrated-dynamics.com>)
Re: Odd statistics behaviour in 7.2  (Bruno Wolff III <bruno@wolff.to>)
List pgsql-hackers
"Gordon A. Runkle" <gar@integrated-dynamics.com> writes:
> [ tale of poor statistical results in a near-unique column ]

After some further digging in the literature I have come across a
number-of-distinct-values estimator that I like better than Chaudhuri's.
It runs like this:
                   n*d / (n - f1 + f1*n/N)            where f1 is the number of distinct values that occurred
exactly once in our sample of n rows (from a total of N),            and d is the total number of distinct values in
thesample.
 

The nice properties this has include:

1. At f1=0 (all values in sample occurred more than once), the estimate
reduces to d, which I had already determined to be a sensible result.

2. At f1=n (all values were distinct, hence also d=n), the estimate
reduces to N, which again is the most reasonable estimate.

3. The equation is numerically stable even when n is much smaller than
N, because the only cancellation is in the term (n - f1) which we can
compute exactly.  A lot of the other equations I looked at depend on
series like (1 - n/N)**i which are going to be really nasty when n/N
is tiny.

In particular, point 2 means that this equation should perform
reasonably well for nearly-unique columns (f1 approaching n),
which was the case you were seeing bad results for.

Attached is a patch that implements this revised equation.  Would
appreciate any feedback...
        regards, tom lane

*** src/backend/commands/analyze.c.orig    Sat Jan  5 19:37:44 2002
--- src/backend/commands/analyze.c    Fri Feb 15 18:46:45 2002
***************
*** 1009,1018 ****         {             /*----------              * Estimate the number of distinct values using the
estimator
!              * proposed by Chaudhuri et al (see citation above).  This is
!              *        sqrt(n/r) * max(f1,1) + f2 + f3 + ...
!              * where fk is the number of distinct values that occurred
!              * exactly k times in our sample of r rows (from a total of n).              * We assume (not very
reliably!)that all the multiply-occurring              * values are reflected in the final track[] list, and the other
           * nonnull values all appeared but once.  (XXX this usually
 
--- 1009,1023 ----         {             /*----------              * Estimate the number of distinct values using the
estimator
!              * proposed by Haas and Stokes in IBM Research Report RJ 10025:
!              *        n*d / (n - f1 + f1*n/N)
!              * where f1 is the number of distinct values that occurred
!              * exactly once in our sample of n rows (from a total of N),
!              * and d is the total number of distinct values in the sample.
!              * This is their Duj1 estimator; the other estimators they
!              * recommend are considerably more complex, and are numerically
!              * very unstable when n is much smaller than N.
!              *              * We assume (not very reliably!) that all the multiply-occurring              * values
arereflected in the final track[] list, and the other              * nonnull values all appeared but once.  (XXX this
usually
***************
*** 1021,1032 ****              *----------              */             int            f1 = nonnull_cnt - summultiple;
!             double        term1; 
!             if (f1 < 1)
!                 f1 = 1;
!             term1 = sqrt(totalrows / (double) numrows) * f1;
!             stats->stadistinct = floor(term1 + nmultiple + 0.5);         }          /*
--- 1026,1044 ----              *----------              */             int            f1 = nonnull_cnt - summultiple;
!             int            d = f1 + nmultiple;
!             double        numer, denom, stadistinct; 
!             numer = (double) numrows * (double) d;
!             denom = (double) (numrows - f1) +
!                 (double) f1 * (double) numrows / totalrows;
!             stadistinct = numer / denom;
!             /* Clamp to sane range in case of roundoff error */
!             if (stadistinct < (double) d)
!                 stadistinct = (double) d;
!             if (stadistinct > totalrows)
!                 stadistinct = totalrows;
!             stats->stadistinct = floor(stadistinct + 0.5);         }          /*
***************
*** 1313,1332 ****         {             /*----------              * Estimate the number of distinct values using the
estimator
!              * proposed by Chaudhuri et al (see citation above).  This is
!              *        sqrt(n/r) * max(f1,1) + f2 + f3 + ...
!              * where fk is the number of distinct values that occurred
!              * exactly k times in our sample of r rows (from a total of n).              * Overwidth values are
assumedto have been distinct.              *----------              */             int            f1 = ndistinct -
nmultiple+ toowide_cnt;
 
!             double        term1; 
!             if (f1 < 1)
!                 f1 = 1;
!             term1 = sqrt(totalrows / (double) numrows) * f1;
!             stats->stadistinct = floor(term1 + nmultiple + 0.5);         }          /*
--- 1325,1356 ----         {             /*----------              * Estimate the number of distinct values using the
estimator
!              * proposed by Haas and Stokes in IBM Research Report RJ 10025:
!              *        n*d / (n - f1 + f1*n/N)
!              * where f1 is the number of distinct values that occurred
!              * exactly once in our sample of n rows (from a total of N),
!              * and d is the total number of distinct values in the sample.
!              * This is their Duj1 estimator; the other estimators they
!              * recommend are considerably more complex, and are numerically
!              * very unstable when n is much smaller than N.
!              *              * Overwidth values are assumed to have been distinct.              *----------
 */             int            f1 = ndistinct - nmultiple + toowide_cnt;
 
!             int            d = f1 + nmultiple;
!             double        numer, denom, stadistinct; 
!             numer = (double) numrows * (double) d;
!             denom = (double) (numrows - f1) +
!                 (double) f1 * (double) numrows / totalrows;
!             stadistinct = numer / denom;
!             /* Clamp to sane range in case of roundoff error */
!             if (stadistinct < (double) d)
!                 stadistinct = (double) d;
!             if (stadistinct > totalrows)
!                 stadistinct = totalrows;
!             stats->stadistinct = floor(stadistinct + 0.5);         }          /*


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Parser conflicts in extended GRANT statement
Next
From: "Marc G. Fournier"
Date:
Subject: Re: Ready to branch 7.2/7.3 ?