Re: AW: Call for alpha testing: planner statistics revision s - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: AW: Call for alpha testing: planner statistics revision s |
Date | |
Msg-id | 1731.992874911@sss.pgh.pa.us Whole thread Raw |
In response to | AW: Call for alpha testing: planner statistics revision s (Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>) |
List | pgsql-hackers |
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes: > Imho that is not optimal :-) ** ducks head, to evade flying hammer ** > 1. the random sample approach should be explicitly requested with some > syntax extension I don't think so ... with the current implementation you *must* do approximate ANALYZE for large tables, or face memory overflow. We can debate where the threshold should be, but you can't get around the fact that approximation is essential with large tables. > 2. the sample size should also be tuneable with some analyze syntax > extension (the dba chooses the tradeoff between accuracy and runtime) The sample size is already driven by the largest SET STATISTICS value for any of the columns of the table being analyzed. I'm not sure if we need a user-tweakable multiplier or not. The current multiplier is 300 (ie, 3000 sample rows with the default SET STATISTICS target of 10). This is not a random choice; there is some theory behind it: * The following choice of minrows is based on the paper * "Random sampling for histogram construction: how much isenough?" * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in * Proceedings of ACM SIGMOD InternationalConference on Management * of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5 * says that fortable size n, histogram size k, maximum relative * error in bin size f, and error probability gamma, the minimum * random sample size is * r = 4 * k * ln(2*n/gamma) / f^2 * Taking f = 0.5, gamma = 0.01, n = 1 million rows, weobtain * r = 305.82 * k * Note that because of the log function, the dependence on n is * quite weak; evenat n = 1 billion, a 300*k sample gives <= 0.59 * bin size error with probability 0.99. So there's no real need to * scale for n, which is a good thing because we don't necessarily * know it at this point. > 3. if at all, an automatic analyze should do the samples on small tables, > and accurate stats on large tables Other way 'round, surely? It already does that: if your table has fewer rows than the sampling target, they all get used. > When on the other hand the optimizer does a "mistake" on a huge table > the difference is easily a matter of hours, thus you want accurate stats. Not if it takes hours to get the stats. I'm more interested in keeping ANALYZE cheap and encouraging DBAs to run it frequently, so that the stats stay up-to-date. It doesn't matter how perfect the stats were when they were made, if the table has changed since then. > Because we do not want the dba to decide which statistics are optimal, > there should probably be an analyze helper application that is invoked > with "vacuum analyze database optimal" or some such, that also decides > whether a table was sufficiently altered to justify new stats gathering > or vacuum. And on what are you going to base "sufficiently altered"? regards, tom lane
pgsql-hackers by date: