Thread: AW: Call for alpha testing: planner statistics revision s
First of all thanks for the great effort, it will surely be appreciated :-) > * On large tables, ANALYZE uses a random sample of rows rather than > examining every row, so that it should take a reasonably short time > even on very large tables. Possible downside: inaccurate stats. > We need to find out if the sample size is large enough. Imho that is not optimal :-) ** ducks head, to evade flying hammer ** 1. the random sample approach should be explicitly requested with some syntax extension 2. the sample size should also be tuneable with some analyze syntax extension (the dba chooses the tradeoff between accuracy and runtime) 3. if at all, an automatic analyze should do the samples on small tables, and accurate stats on large tables The reasoning behind this is, that when the optimizer does a "mistake" on small tables the runtime penalty is small, and probably even beats the cost of accurate statistics lookup. (3 page table --> no stats except table size needed) When on the other hand the optimizer does a "mistake" on a huge table the difference is easily a matter of hours, thus you want accurate stats. Because we do not want the dba to decide which statistics are optimal, there should probably be an analyze helper application that is invoked with "vacuum analyze database optimal" or some such, that also decides whether a table was sufficiently altered to justify new stats gathering or vacuum. The decision, what to do may also be based on a runtime limit, that the dba specifies ("do the most important stats/vacuums you can do within ~3 hours"). These points are also based on experience with huge SAP/R3 installations and the way statistics are gathered there. Andreas
On Mon, 18 Jun 2001, Zeugswetter Andreas SB wrote: > First of all thanks for the great effort, it will surely be appreciated :-) > > > * On large tables, ANALYZE uses a random sample of rows rather than > > examining every row, so that it should take a reasonably short time > > even on very large tables. Possible downside: inaccurate stats. > > We need to find out if the sample size is large enough. > > Imho that is not optimal :-) ** ducks head, to evade flying hammer ** > 1. the random sample approach should be explicitly requested with some > syntax extension > 2. the sample size should also be tuneable with some analyze syntax > extension (the dba chooses the tradeoff between accuracy and runtime) > 3. if at all, an automatic analyze should do the samples on small tables, > and accurate stats on large tables > > The reasoning behind this is, that when the optimizer does a "mistake" > on small tables the runtime penalty is small, and probably even beats > the cost of accurate statistics lookup. (3 page table --> no stats > except table size needed) I disagree. As monte carlo method shows, _as long as you_ query random rows, your result will be sufficiently close to the real statistics. I'm not sure if I can find math behind this, though... -alex
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes: > Imho that is not optimal :-) ** ducks head, to evade flying hammer ** > 1. the random sample approach should be explicitly requested with some > syntax extension I don't think so ... with the current implementation you *must* do approximate ANALYZE for large tables, or face memory overflow. We can debate where the threshold should be, but you can't get around the fact that approximation is essential with large tables. > 2. the sample size should also be tuneable with some analyze syntax > extension (the dba chooses the tradeoff between accuracy and runtime) The sample size is already driven by the largest SET STATISTICS value for any of the columns of the table being analyzed. I'm not sure if we need a user-tweakable multiplier or not. The current multiplier is 300 (ie, 3000 sample rows with the default SET STATISTICS target of 10). This is not a random choice; there is some theory behind it: * The following choice of minrows is based on the paper * "Random sampling for histogram construction: how much isenough?" * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in * Proceedings of ACM SIGMOD InternationalConference on Management * of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5 * says that fortable size n, histogram size k, maximum relative * error in bin size f, and error probability gamma, the minimum * random sample size is * r = 4 * k * ln(2*n/gamma) / f^2 * Taking f = 0.5, gamma = 0.01, n = 1 million rows, weobtain * r = 305.82 * k * Note that because of the log function, the dependence on n is * quite weak; evenat n = 1 billion, a 300*k sample gives <= 0.59 * bin size error with probability 0.99. So there's no real need to * scale for n, which is a good thing because we don't necessarily * know it at this point. > 3. if at all, an automatic analyze should do the samples on small tables, > and accurate stats on large tables Other way 'round, surely? It already does that: if your table has fewer rows than the sampling target, they all get used. > When on the other hand the optimizer does a "mistake" on a huge table > the difference is easily a matter of hours, thus you want accurate stats. Not if it takes hours to get the stats. I'm more interested in keeping ANALYZE cheap and encouraging DBAs to run it frequently, so that the stats stay up-to-date. It doesn't matter how perfect the stats were when they were made, if the table has changed since then. > Because we do not want the dba to decide which statistics are optimal, > there should probably be an analyze helper application that is invoked > with "vacuum analyze database optimal" or some such, that also decides > whether a table was sufficiently altered to justify new stats gathering > or vacuum. And on what are you going to base "sufficiently altered"? regards, tom lane