Re: On Distributions In 7.2.1 - Mailing list pgsql-general

From Tom Lane
Subject Re: On Distributions In 7.2.1
Date
Msg-id 27010.1020524887@sss.pgh.pa.us
Whole thread Raw
In response to Re: On Distributions In 7.2.1  (Mark kirkwood <markir@slingshot.co.nz>)
List pgsql-general
Mark kirkwood <markir@slingshot.co.nz> writes:
> I wonder if its worth a mention in the docs to the effect :

> "ANALYZE with the default (10 or so) is ok for most cases, but for big
> tables consider using ALTER ... SET STATISTICS 100 for commonly JOINed
> or WHEREed columns"

Could be.  So far we have very little experience with setting the
statistics target --- the default of 10 was just picked out of the
air, and might not be the best general-purpose target.  Or it might
be that 10 is fine, but the multiplier that's used to go from that
to the number of rows to sample isn't large enough.  The current
multiplier is 300, ie, we sample 3000 rows by default:

        /*--------------------
         * The following choice of minrows is based on the paper
         * "Random sampling for histogram construction: how much is enough?"
         * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
         * Proceedings of ACM SIGMOD International Conference on Management
         * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
         * says that for table size n, histogram size k, maximum relative
         * error in bin size f, and error probability gamma, the minimum
         * random sample size is
         *        r = 4 * k * ln(2*n/gamma) / f^2
         * Taking f = 0.5, gamma = 0.01, n = 1 million rows, we obtain
         *        r = 305.82 * k
         * Note that because of the log function, the dependence on n is
         * quite weak; even at n = 1 billion, a 300*k sample gives <= 0.59
         * bin size error with probability 0.99.  So there's no real need to
         * scale for n, which is a good thing because we don't necessarily
         * know it at this point.
         *--------------------
         */
        stats->minrows = 300 * attr->attstattarget;

f = 0.5 isn't a very ambitious target, but because of the 1/f^2
dependency, pushing it down to say 0.1 would be expensive --- the
multiplier would become 7650 or so instead of 300.

It could be that we need two knobs here not just one --- perhaps people
want to control the accuracy of the stats without necessarily making
the number of values stored larger.  Or it could be that we don't
need to do anything.  The values you were quoting look plenty close
enough to reality for the planner's purposes.  (But this is a
particularly simple case; harder cases may show worse error...)

            regards, tom lane

pgsql-general by date:

Previous
From: "Hiroshi Inoue"
Date:
Subject: Re: Using views and MS access via odbc
Next
From: Tom Lane
Date:
Subject: Re: Using views and MS access via odbc