Thread: AW: Call for alpha testing: planner statistics revision s

AW: Call for alpha testing: planner statistics revision s

From

Zeugswetter Andreas SB

Date:

18 June 2001, 08:43:26

First of all thanks for the great effort, it will surely be appreciated :-)

> * On large tables, ANALYZE uses a random sample of rows rather than
> examining every row, so that it should take a reasonably short time
> even on very large tables.  Possible downside: inaccurate stats.
> We need to find out if the sample size is large enough.

Imho that is not optimal :-) ** ducks head, to evade flying hammer **
1. the random sample approach should be explicitly requested with some 
syntax extension
2. the sample size should also be tuneable with some analyze syntax 
extension (the dba chooses the tradeoff between accuracy and runtime)
3. if at all, an automatic analyze should do the samples on small tables,
and accurate stats on large tables

The reasoning behind this is, that when the optimizer does a "mistake"
on small tables the runtime penalty is small, and probably even beats
the cost of accurate statistics lookup. (3 page table --> no stats 
except table size needed)

When on the other hand the optimizer does a "mistake" on a huge table
the difference is easily a matter of hours, thus you want accurate stats.

Because we do not want the dba to decide which statistics are optimal,
there should probably be an analyze helper application that is invoked
with "vacuum analyze database optimal" or some such, that also decides 
whether a table was sufficiently altered to justify new stats gathering
or vacuum. The decision, what to do may also be based on a runtime limit, 
that the dba specifies ("do the most important stats/vacuums you can do 
within ~3 hours"). 

These points are also based on experience with huge SAP/R3 installations
and the way statistics are gathered there.

Andreas

Re: AW: Call for alpha testing: planner statistics revision s

From

Alex Pilosov

Date:

18 June 2001, 12:07:56

On Mon, 18 Jun 2001, Zeugswetter Andreas SB wrote:

> First of all thanks for the great effort, it will surely be appreciated :-)
> 
> > * On large tables, ANALYZE uses a random sample of rows rather than
> > examining every row, so that it should take a reasonably short time
> > even on very large tables.  Possible downside: inaccurate stats.
> > We need to find out if the sample size is large enough.
> 
> Imho that is not optimal :-) ** ducks head, to evade flying hammer **
> 1. the random sample approach should be explicitly requested with some 
> syntax extension
> 2. the sample size should also be tuneable with some analyze syntax 
> extension (the dba chooses the tradeoff between accuracy and runtime)
> 3. if at all, an automatic analyze should do the samples on small tables,
> and accurate stats on large tables
> 
> The reasoning behind this is, that when the optimizer does a "mistake"
> on small tables the runtime penalty is small, and probably even beats
> the cost of accurate statistics lookup. (3 page table --> no stats 
> except table size needed)
I disagree.

As monte carlo method shows, _as long as you_ query random rows, your
result will be sufficiently close to the real statistics. I'm not sure if
I can find math behind this, though...

-alex

Re: AW: Call for alpha testing: planner statistics revision s

From

Tom Lane

Date:

18 June 2001, 13:35:33

Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
> Imho that is not optimal :-) ** ducks head, to evade flying hammer **
> 1. the random sample approach should be explicitly requested with some 
> syntax extension

I don't think so ... with the current implementation you *must* do
approximate ANALYZE for large tables, or face memory overflow.
We can debate where the threshold should be, but you can't get around
the fact that approximation is essential with large tables.

> 2. the sample size should also be tuneable with some analyze syntax 
> extension (the dba chooses the tradeoff between accuracy and runtime)

The sample size is already driven by the largest SET STATISTICS value
for any of the columns of the table being analyzed.  I'm not sure if we
need a user-tweakable multiplier or not.  The current multiplier is 300
(ie, 3000 sample rows with the default SET STATISTICS target of 10).
This is not a random choice; there is some theory behind it:
    * The following choice of minrows is based on the paper    * "Random sampling for histogram construction: how much
isenough?"    * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in    * Proceedings of ACM SIGMOD
InternationalConference on Management    * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5    * says that
fortable size n, histogram size k, maximum relative    * error in bin size f, and error probability gamma, the minimum
 * random sample size is    *    r = 4 * k * ln(2*n/gamma) / f^2    * Taking f = 0.5, gamma = 0.01, n = 1 million rows,
weobtain    *    r = 305.82 * k    * Note that because of the log function, the dependence on n is    * quite weak;
evenat n = 1 billion, a 300*k sample gives <= 0.59    * bin size error with probability 0.99.  So there's no real need
to   * scale for n, which is a good thing because we don't necessarily    * know it at this point.
 

> 3. if at all, an automatic analyze should do the samples on small tables,
> and accurate stats on large tables

Other way 'round, surely?  It already does that: if your table has fewer
rows than the sampling target, they all get used.

> When on the other hand the optimizer does a "mistake" on a huge table
> the difference is easily a matter of hours, thus you want accurate stats.

Not if it takes hours to get the stats.  I'm more interested in keeping
ANALYZE cheap and encouraging DBAs to run it frequently, so that the
stats stay up-to-date.  It doesn't matter how perfect the stats were
when they were made, if the table has changed since then.

> Because we do not want the dba to decide which statistics are optimal,
> there should probably be an analyze helper application that is invoked
> with "vacuum analyze database optimal" or some such, that also decides 
> whether a table was sufficiently altered to justify new stats gathering
> or vacuum.

And on what are you going to base "sufficiently altered"?
        regards, tom lane