Re: AW: Call for alpha testing: planner statistics revision s - Mailing list pgsql-hackers

From Tom Lane
Subject Re: AW: Call for alpha testing: planner statistics revision s
Date
Msg-id 1731.992874911@sss.pgh.pa.us
Whole thread Raw
In response to AW: Call for alpha testing: planner statistics revision s  (Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>)
List pgsql-hackers
Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
> Imho that is not optimal :-) ** ducks head, to evade flying hammer **
> 1. the random sample approach should be explicitly requested with some 
> syntax extension

I don't think so ... with the current implementation you *must* do
approximate ANALYZE for large tables, or face memory overflow.
We can debate where the threshold should be, but you can't get around
the fact that approximation is essential with large tables.

> 2. the sample size should also be tuneable with some analyze syntax 
> extension (the dba chooses the tradeoff between accuracy and runtime)

The sample size is already driven by the largest SET STATISTICS value
for any of the columns of the table being analyzed.  I'm not sure if we
need a user-tweakable multiplier or not.  The current multiplier is 300
(ie, 3000 sample rows with the default SET STATISTICS target of 10).
This is not a random choice; there is some theory behind it:
    * The following choice of minrows is based on the paper    * "Random sampling for histogram construction: how much
isenough?"    * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in    * Proceedings of ACM SIGMOD
InternationalConference on Management    * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5    * says that
fortable size n, histogram size k, maximum relative    * error in bin size f, and error probability gamma, the minimum
 * random sample size is    *    r = 4 * k * ln(2*n/gamma) / f^2    * Taking f = 0.5, gamma = 0.01, n = 1 million rows,
weobtain    *    r = 305.82 * k    * Note that because of the log function, the dependence on n is    * quite weak;
evenat n = 1 billion, a 300*k sample gives <= 0.59    * bin size error with probability 0.99.  So there's no real need
to   * scale for n, which is a good thing because we don't necessarily    * know it at this point.
 

> 3. if at all, an automatic analyze should do the samples on small tables,
> and accurate stats on large tables

Other way 'round, surely?  It already does that: if your table has fewer
rows than the sampling target, they all get used.

> When on the other hand the optimizer does a "mistake" on a huge table
> the difference is easily a matter of hours, thus you want accurate stats.

Not if it takes hours to get the stats.  I'm more interested in keeping
ANALYZE cheap and encouraging DBAs to run it frequently, so that the
stats stay up-to-date.  It doesn't matter how perfect the stats were
when they were made, if the table has changed since then.

> Because we do not want the dba to decide which statistics are optimal,
> there should probably be an analyze helper application that is invoked
> with "vacuum analyze database optimal" or some such, that also decides 
> whether a table was sufficiently altered to justify new stats gathering
> or vacuum.

And on what are you going to base "sufficiently altered"?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alex Pilosov
Date:
Subject: Re: AW: Call for alpha testing: planner statistics revision s
Next
From: Tom Lane
Date:
Subject: Re: Doc translation