Thread: Odd statistics behaviour in 7.2
Hello, all, I'm having a strange problem with v7.2 relating to statistics collection and plan calculation. I'm not sure if this relates to the problems Marc was seeing, but here goes. I have a table with 1,066,673 rows. The column I'm interested in has this distribution of values: tdnr_ct | ct ---------+-------- 16 | 1 4 | 1 3 | 58 2 | 68904 1 | 928171 This means that 'ct' records have 'tdnr_ct' duplicate values. As you can see, the index I have on this column is highly selective, and should be used to look up records based on this column. In v7.1.3, it always does. Under v7.2, it only sometimes does. I've looked at the statistics, thanks to what I learned from Tom and Marc's discussion, and I see that sometimes when I VACUUM ANALYZE the table, 'n_distinct' for this column gets a value of '-1' (desireable), and other times a value such as 56596 or something. This is with the default setting for the statistics. Doing a 'SET STATISTICS 40' on the column got me to '-0.106047', which is better. But even so, the values do change somewhat over subsequent runs of VACUUM ANALYZE. And sometimes I get the coveted '-1'. The query I'm running is fairly complex. The difference between getting the index lookup versus the sequential scan causes an order of magnitude difference in run time. The query plans are below. Same query, no changes, just the difference in statistics. The desireable query plan: Unique (cost=176572.08..177673.89 rows=3673 width=176) -> Sort (cost=176572.08..176572.08 rows=36727 width=176) -> Merge Join (cost=172982.30..173787.35 rows=36727 width=176) -> Sort (cost=169436.41..169436.41 rows=27883width=142) -> Nested Loop (cost=0.00..167377.66 rows=27883 width=142) -> Seq Scan on pprv_ticket ptk (cost=0.00..3345.83 rows=27883 width=125) -> Index Scan using xie01_cat24 on cat24_ticket_doc_id c24 (cost=0.00..5.87 rows=1 width=17) -> Sort (cost=3545.89..3545.89 rows=37048 width=34) -> Seq Scan on pprv_violation pe (cost=0.00..734.48 rows=37048 width=34) SubPlan -> Aggregate (cost=5.87..5.87 rows=1 width=17) -> Index Scan using xie01_cat24 on cat24_ticket_doc_id (cost=0.00..5.87 rows=1 width=17) -> Aggregate (cost=5.88..5.88 rows=1 width=17) -> Index Scan using xie01_cat24 on cat24_ticket_doc_id (cost=0.00..5.88 rows=1 width=17) The undesireable query plan: Unique (cost=1129322.57..1187392.58 rows=193567 width=176) -> Sort (cost=1129322.57..1129322.57 rows=1935667 width=176) -> Merge Join (cost=204226.57..249046.32 rows=1935667 width=176) -> Merge Join (cost=200135.91..209436.90 rows=525268 width=142) -> Sort (cost=6435.89..6435.89 rows=27883 width=125) -> Seq Scan on pprv_ticket ptk (cost=0.00..3335.83 rows=27883 width=125) -> Sort (cost=193700.02..193700.02 rows=1066173 width=17) -> Seq Scan on cat24_ticket_doc_id c24 (cost=0.00..50164.73 rows=1066173 width=17) -> Sort (cost=4090.66..4090.66 rows=37048 width=34) -> Seq Scan on pprv_violation pv (cost=0.00..734.48 rows=37048 width=34) SubPlan -> Aggregate (cost=74.72..74.72 rows=1 width=17) -> Index Scan using xie01_cat24 on cat24_ticket_doc_id (cost=0.00..74.67 rows=19 width=17) -> Aggregate (cost=29.12..29.12 rows=1 width=17) -> Index Scan using xie07_cat24 on cat24_ticket_doc_id (cost=0.00..29.12 rows=1 width=17) I hope I've given enough information that it makes sense. If there's anything I can do my end to help figure this out, let me know. Thanks, Gordon. -- "Far and away the best prize that life has to offer is the chance to work hard at work worth doing." -- Theodore Roosevelt
> ... sometimes ... 'n_distinct' for this column > gets a value of '-1' (desireable), and other times a value such as 56596 > or something. > This is with the default setting for the statistics. > Doing a 'SET STATISTICS 40' on the column got me to '-0.106047', which > is better. But even so, the values do change somewhat over subsequent > runs of VACUUM ANALYZE. And sometimes I get the coveted '-1'. I'm guessing that your table is not randomly populated, so that the new "statistically sampled ANALYZE" is sometimes getting a good sample (or at least one with the result you want) and sometimes getting a terrible one. Tom? - Thomas
Gordon Runkle <gar@www.integrated-dynamics.com> writes: > You can retrieve the dump of my data at: [snipped] Thanks. Indeed, the first time I did an ANALYZE I got: tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation -----------+---------+-----------+-----------+------------+-------------------------------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------tom_help | tdnr | 0 | 17 | 56484 | {0557088000957,0557700369880} | {0.000666667,0.000666667} | {0551000386411,0551504108858,0557011074656,0557050633939,0557111430036,0557151012769,0557179871119,0557698138188,0557750158740,0557783053444,0558980779763} | -0.199108 Now, I happen to know that the default size of ANALYZE's statistical sample is 3000 rows. What evidently happened here is that the statistical sampling picked up two values appearing twice (namely, 0557088000957 and 0557700369880); the given frequencies for these two values establish that they appeared twice in the 3000-row sample. Since no other values are mentioned in most_common_vals, the remaining 2996 samples must have been values that appeared only once. Having computed these raw statistics, ANALYZE has to try to extrapolate the number of distinct values in the whole table. What it's using for the purpose is an equation I found in "Random sampling for histogram construction: how much is enough?"by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya,inProceedings of ACM SIGMOD International Conference on Managementof Data, 1998, Pages 436-447. namely sqrt(n/r) * max(f1,1) + f2 + f3 + ... where fk is the number of distinct values that occurred exactly k times in our sample of r rows (from a total of n). And indeed you get 56484 when you plug in these numbers. So the code is operating as designed, and we can't complain that the sample is an unreasonable sample given the true underlying distribution. We have to blame the equation: evidently this estimation equation doesn't apply very well to nearly-unique columns. I had already modified the Chaudhuri approach slightly: if the ANALYZE sample contains no duplicate values at all (ie, f1=r, f2=f3=f4=...=0) then their equation reduces to sqrt(n*r), but actually ANALYZE assumes the column is unique (ie, n distinct values, not sqrt(n*r)), which seems a lot better assumption in practice. The runs where you got n_distinct equal to -1 are presumably those where the ANALYZE sample chanced to contain no duplicates. I am thinking that we need to find another estimator equation, or at least shade away from their equation when f1 is close to r. Ideally the estimate for f1=r should be a logical extrapolation of the curve for f1 close to r, but right now it's quite discontinuous. There was some previous discussion about this, cf the thread at http://archives.postgresql.org/pgsql-general/2001-10/msg01032.php but nothing really emerged on how to do better. The Chaudhuri paper points out that estimating total number of distinct values from a sample is inherently a hard problem and subject to large estimation errors, so it may be that we can't do a lot better. I think we should be wary of ad-hoc answers, anyhow. Something with a little math behind it would make me feel more comfortable. Anyone have any thoughts on how to do better? regards, tom lane
On Wed, 2002-02-13 at 13:02, Tom Lane wrote: [ Lots of enlightening info ] Thanks, Tom! Would it be fair to say that the correct workaround for now would be to use ALTER TABLE SET STATISTICS on columns of interest which have this near-unique characteristic? Does ALTER TABLE SET STATISTICS only increase the histogram size, or does it also cause more rows to be sampled? If not, how would one increase the sample size (assuming this would be desirable)? I have quite a few tables that have columns of interest with near-unique values, so I'm keenly interested. I'm sorry I didn't have time to do all this testing during the beta phase, there just wasn't time. :-( Thanks again, Gordon. -- "Far and away the best prize that life has to offer is the chance to work hard at work worth doing." -- Theodore Roosevelt
"Gordon A. Runkle" <gar@integrated-dynamics.com> writes: > Would it be fair to say that the correct workaround for now would > be to use ALTER TABLE SET STATISTICS on columns of interest which have > this near-unique characteristic? Yeah, that's probably the best we can do until we can think of a better estimation equation. > Does ALTER TABLE SET STATISTICS only increase the histogram size, or > does it also cause more rows to be sampled? Both. The Chaudhuri paper I referred to has some math purporting to prove that the required sample size is directly proportional to the histogram size, for fixed relative error in the histogram boundaries. So I made the same parameter control both. Actually the sample size is driven by the largest SET STATISTICS value for any column of the table. So you can pick which one you think a larger histogram would be most useful for; it doesn't have to be the same column that's got the bad-number-of-distinct-values problem. Which columns, if any, do you do range queries on? Those would be the ones where a bigger histogram would be useful. regards, tom lane
"Gordon A. Runkle" <gar@integrated-dynamics.com> writes: > [ tale of poor statistical results in a near-unique column ] After some further digging in the literature I have come across a number-of-distinct-values estimator that I like better than Chaudhuri's. It runs like this: n*d / (n - f1 + f1*n/N) where f1 is the number of distinct values that occurred exactly once in our sample of n rows (from a total of N), and d is the total number of distinct values in thesample. The nice properties this has include: 1. At f1=0 (all values in sample occurred more than once), the estimate reduces to d, which I had already determined to be a sensible result. 2. At f1=n (all values were distinct, hence also d=n), the estimate reduces to N, which again is the most reasonable estimate. 3. The equation is numerically stable even when n is much smaller than N, because the only cancellation is in the term (n - f1) which we can compute exactly. A lot of the other equations I looked at depend on series like (1 - n/N)**i which are going to be really nasty when n/N is tiny. In particular, point 2 means that this equation should perform reasonably well for nearly-unique columns (f1 approaching n), which was the case you were seeing bad results for. Attached is a patch that implements this revised equation. Would appreciate any feedback... regards, tom lane *** src/backend/commands/analyze.c.orig Sat Jan 5 19:37:44 2002 --- src/backend/commands/analyze.c Fri Feb 15 18:46:45 2002 *************** *** 1009,1018 **** { /*---------- * Estimate the number of distinct values using the estimator ! * proposed by Chaudhuri et al (see citation above). This is ! * sqrt(n/r) * max(f1,1) + f2 + f3 + ... ! * where fk is the number of distinct values that occurred ! * exactly k times in our sample of r rows (from a total of n). * We assume (not very reliably!)that all the multiply-occurring * values are reflected in the final track[] list, and the other * nonnull values all appeared but once. (XXX this usually --- 1009,1023 ---- { /*---------- * Estimate the number of distinct values using the estimator ! * proposed by Haas and Stokes in IBM Research Report RJ 10025: ! * n*d / (n - f1 + f1*n/N) ! * where f1 is the number of distinct values that occurred ! * exactly once in our sample of n rows (from a total of N), ! * and d is the total number of distinct values in the sample. ! * This is their Duj1 estimator; the other estimators they ! * recommend are considerably more complex, and are numerically ! * very unstable when n is much smaller than N. ! * * We assume (not very reliably!) that all the multiply-occurring * values arereflected in the final track[] list, and the other * nonnull values all appeared but once. (XXX this usually *************** *** 1021,1032 **** *---------- */ int f1 = nonnull_cnt - summultiple; ! double term1; ! if (f1 < 1) ! f1 = 1; ! term1 = sqrt(totalrows / (double) numrows) * f1; ! stats->stadistinct = floor(term1 + nmultiple + 0.5); } /* --- 1026,1044 ---- *---------- */ int f1 = nonnull_cnt - summultiple; ! int d = f1 + nmultiple; ! double numer, denom, stadistinct; ! numer = (double) numrows * (double) d; ! denom = (double) (numrows - f1) + ! (double) f1 * (double) numrows / totalrows; ! stadistinct = numer / denom; ! /* Clamp to sane range in case of roundoff error */ ! if (stadistinct < (double) d) ! stadistinct = (double) d; ! if (stadistinct > totalrows) ! stadistinct = totalrows; ! stats->stadistinct = floor(stadistinct + 0.5); } /* *************** *** 1313,1332 **** { /*---------- * Estimate the number of distinct values using the estimator ! * proposed by Chaudhuri et al (see citation above). This is ! * sqrt(n/r) * max(f1,1) + f2 + f3 + ... ! * where fk is the number of distinct values that occurred ! * exactly k times in our sample of r rows (from a total of n). * Overwidth values are assumedto have been distinct. *---------- */ int f1 = ndistinct - nmultiple+ toowide_cnt; ! double term1; ! if (f1 < 1) ! f1 = 1; ! term1 = sqrt(totalrows / (double) numrows) * f1; ! stats->stadistinct = floor(term1 + nmultiple + 0.5); } /* --- 1325,1356 ---- { /*---------- * Estimate the number of distinct values using the estimator ! * proposed by Haas and Stokes in IBM Research Report RJ 10025: ! * n*d / (n - f1 + f1*n/N) ! * where f1 is the number of distinct values that occurred ! * exactly once in our sample of n rows (from a total of N), ! * and d is the total number of distinct values in the sample. ! * This is their Duj1 estimator; the other estimators they ! * recommend are considerably more complex, and are numerically ! * very unstable when n is much smaller than N. ! * * Overwidth values are assumed to have been distinct. *---------- */ int f1 = ndistinct - nmultiple + toowide_cnt; ! int d = f1 + nmultiple; ! double numer, denom, stadistinct; ! numer = (double) numrows * (double) d; ! denom = (double) (numrows - f1) + ! (double) f1 * (double) numrows / totalrows; ! stadistinct = numer / denom; ! /* Clamp to sane range in case of roundoff error */ ! if (stadistinct < (double) d) ! stadistinct = (double) d; ! if (stadistinct > totalrows) ! stadistinct = totalrows; ! stats->stadistinct = floor(stadistinct + 0.5); } /*
On Fri, 2002-02-15 at 19:09, Tom Lane wrote: > "Gordon A. Runkle" <gar@integrated-dynamics.com> writes: > > [ tale of poor statistical results in a near-unique column ] > [ new algorithm and patch ] Hi Tom, This works much better (I did SET STATISTICS back to 10). Not always '-1', but almost always negative. Is "-0.503824" the same as "503824 with a predicted increase in the number of distinct values" (as opposed to using "-503824")? I checked a couple of other tables for which my field is a bit less nearly-unique, and they have better stats now also. I've attached a file with distributions and stats, if you're interested. And, getting to where the rubber meets the road, my queries are running well again! I'm running a batch of other queries (which were unaffected by this problem) overnight to see if they still work well. Are you planning to include this patch in v7.2.1, or would it require too much testing by others? Thanks for all your hard work, it is greatly appreciated, Gordon. -- "Far and away the best prize that life has to offer is the chance to work hard at work worth doing." -- Theodore Roosevelt
Attachment
"Gordon A. Runkle" <gar@integrated-dynamics.com> writes: > Is "-0.503824" the same as "503824 with a predicted increase in the > number of distinct values" (as opposed to using "-503824")? No, it means "0.503824 times the number of rows in the table". Although your table was ~ 1 million rows, so that's approximately right in your case. Given the stats you cited, the exactly correct stadistinct value would be -0.9348085. In testing I got -1, -0.808612, -0.678641, or once -0.584611 from your data, depending on whether the sample chanced to find none, one, two, or three repeated values. Any of these strike me as plenty close enough for statistical purposes. But the Chaudhuri estimator was off by more than a factor of 10. > Are you planning to include this patch in v7.2.1, or would it require > too much testing by others? I'm going to put it in 7.2.1 unless there are objections. regards, tom lane
BTW, while we're thinking about this, there's another aspect of the number-of-distinct-values estimator that could use some peer review. That's the decision whether to assume that the number of distinct values in a column is fixed, or will vary with the size of the table. (For example, in a boolean column, ndistinct should clearly be 2 no matter how large the table gets; but in any unique column ndistinct should equal the table size.) This is important since there are times when we update the table size estimate (pg_class.reltuples) without recomputing the statistics in pg_statistic. The "negative stadistinct" convention in pg_statistic is used to signal which case ANALYZE thinks applies. Presently the decision is pretty simplistic: if the estimated number of distinct values is more than 10% of the number of rows, then assume the number of distinct values scales with the number of rows. I believe that some rule of this form is reasonable, but the 10% threshold was just picked out of the air. Can anyone suggest an argument in favor of some other value, or a better way to look at it? regards, tom lane
On Fri, Feb 15, 2002 at 07:09:42PM -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > 3. The equation is numerically stable even when n is much smaller than > N, because the only cancellation is in the term (n - f1) which we can > compute exactly. A lot of the other equations I looked at depend on > series like (1 - n/N)**i which are going to be really nasty when n/N > is tiny. You can work with the above for n << N by using power series. For n << N, (1 - n/N)**i ~= 1 - in/N.
Bruno Wolff III <bruno@wolff.to> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A lot of the other equations I looked at depend on >> series like (1 - n/N)**i which are going to be really nasty when n/N >> is tiny. > You can work with the above for n << N by using power series. For n << N, > (1 - n/N)**i ~= 1 - in/N. Yeah, but the pain-in-the-neck aspect comes from the fact that n/N isn't necessarily tiny --- it could approach 1. So you'd need code smart enough to handle both cases with accuracy. We could probably do this if it proves necessary, but I'd like to stick to simpler equations if at all possible. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > It would seem that if you could determine if the number of distinct > values is _increasing_ as you scan more rows, that an increase in table > size would also cause an increase, e.g. if you have X distinct values > looking at N rows, and 2X distinct values looking at 2N rows, that > clearly would show a scale. [ thinks for awhile... ] I don't think that'll help. You could not expect an exact 2:1 increase, except in the case of a simple unique column, which isn't the problem anyway. So the above would really have to be coded as "count the number of distinct values in the sample (d1) and the number in half of the sample (d2); then if d1/d2 >= X assume the number of distinct values scales". X is a constant somewhere between 1 and 2, but where? I think you've only managed to trade one arbitrary threshold for another one. A more serious problem is that the above could easily be fooled by a distribution that contains a few very-popular values and a larger number of seldom-seen ones. Consider for example a column "number of children" over a database of families. In a sample of a thousand or so, you might well see only values 0..4 (or so); if you double the size of the sample, and find a few rows with 5 to 10 kids, are you then correct to label the column as scaling with the size of the database? regards, tom lane