Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers
From | Josh Berkus |
---|---|
Subject | Re: Improving N-Distinct estimation by ANALYZE |
Date | |
Msg-id | 200601041607.37590.josh@agliodbs.com Whole thread Raw |
In response to | Improving N-Distinct estimation by ANALYZE (Simon Riggs <simon@2ndquadrant.com>) |
Responses |
Re: Improving N-Distinct estimation by ANALYZE
|
List | pgsql-hackers |
Simon, > - Are there any performance issues that can be directly attributed to > mis-estimation of N-Distinct ("D") by the ANALYZE command? Yes. There's at least one query (maybe two) from TPC-H which bombs because of bad N-distinct estimation, even with stats_target =1000. Based on my experience with data warehouses, this also occurs in the field. > - If so, can we do better than we currently achieve? How? Replace the current algorithm and broaden the sample. > - Would altering the estimate of D cause problems in other places? Unlike Index Cost Estimation, I wouldn't expect it to. We make pretty "proper" use of D right now, it's just that for some common cases our estimates of D are bad. > The estimation of D is difficult and imprecise. The current method works > well in many cases, yet breaks down *badly* in one particular very > common use case of database design: a large dependent table with a > multi-column Primary Key. In some cases the estimate of D *decreases* as > the size of the table increases, though the estimate of D is an > underestimate in almost all cases, whatever the table design. Actually, the current estimator underestimates D for *any* large table's high-cardinality columns, primary key, multi-column, or not. Chaudhuri's calculation seems to be designed to yield the smallest number of cardinal values that could reasonably be expected to yield the provided sample. That is, if the estimate range within a stdev of 2.0 gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000. This conservative approach makes sense when you're only considering join strategies. That is, given an unreliable estimate you want to estimate D low so that you don't wrongly choose a nested loop, the cost for which mistake being much higher than the cost of performing an unnecessary hash join. It's "conservative" in that sense. However, PostgreSQL now has a whole set of hash operations and other query types for which a too-low estimate of D causes query lockup. So for these operations, Chaudhuri ceases to be conservative and becomes high-risk. FWIW, my testing with TPCH showed that estimate error is usually OK within +/- 5x. Beyond that any you start to get bad query plans. (yes, I know all of the above begs examples. I'm swamped. I believe I posted examples when I first started talking about n-distinct estimation a year ago) So I think it's vital that we look at algorithms designed to deliver us the median estimated D, not the lowest reasonable, in addition to increasing sample size. The block-based estimator functions which Andrew and I looked at seem designed to do that provided a sample of between 1% and 10%. > 1. *All* methods of statistical analysis are improved by larger sample > fractions. The D estimator method currently in use shows an optimum of > accuracy and sample fraction at around 5% of a table, as shown in the > author's original paper [Haas Stokes (1998)]. The current > implementation's error rates climb higher as table size increases. I read 5 different papers on ACM about sampling D. All of them were united in saying that you couldn't get even slightly accurate estimates with less than 3% sampling. > 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a > row sampling technique rather than a block sampling technique. This > would translate directly into a performance drop from large sample > ratios, but since we currently use a fixed sample size this problem is > not yet visible for larger tables. With a 2GB table, we would typically > sample 1% of the blocks, yet around 0.025 - 0.05% of the rows. This woudl be a reason to use block-sampling ONLY, rather than hybrid sampling. > 3. Large values of statistics target (up to 1000) could cause a number > of problems with statistics catalog table growth (mentioned on > -perform). Setting these values appropriately can take significant > effort. Automatic scaling of such parameters is desirable. Well, decoupling the n-distinct sample size from the # of heuristics rows would fix that. > 4. ANALYZE doesn't use more memory if maintenance_work_mem is set high, > nor does it use less if it is set low. Actual memory usage isn't > measured at all. With very long rows this is max 24 MB with default > stats target settings and BLCKSZ, or 2.4 GB with highest stats target > (1000). This probably is of lower importance since stats targets are > only usually set higher on larger databases, which typically have larger > memory configurations anyway - and very long rows are uncommon because > of TOAST. Typical memory usage by ANALYZE would be < 1 MB with default > settings i.e. maybe as low as a 0.01% sample for a very large table. Yeah, I think I pointed out that ANALYZE doesn't seem to be using maintenance_mem a year ago. It should, although there should be options to use less than maintenance_mem if the user wants low-impact analyze. > There is a further effect of concern here. We currently apply a "lift" > to the D estimate when we think that the number of distinct values is > high enough to imply that it should be scaled according to N. In the > above example, when sample ratio is too small, D will hit the point > where it is too low to be scaled and we suddenly "bomb out" to a much > lower value. Yes, this is another problem with the current algorithm. This kind of "thresholding" is, well, hackish. More importantly, it leads to unpredictable query behavior as a query on a table only 10 rows larger yeilds a radically different plan at the edge of the threshold. > The test results don't seem too bad if you view the estimate of D as at > most a factor of 10 wrong. However, since the error scales up with the > size of the table, we can imagine very large estimation errors. Yes. My tests showed that for a tpch of 100G, with 600 million rows in Lineitem, D was an average of 30x low and could not be less than 10x low even with the luckiest sample. This misestimate gets worse as the table gets larger. > Chaudhuri's estimator is based on a least risk approach, rather than a > greatest accuracy approach, which does sound appealing should we not be > able to apply an improved estimator. As I point out above, though, Chaudhuri's understanding of "least risk" is flawed. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
pgsql-hackers by date: