Thread: hist boundary duplicates bug in head and 8.3
For heavy tailed distributions, it is possible for analyze to duplicate histogram boundaries. Here is the output from a test against HEAD; I've attached the test data. =# create table bug(f float); COPY 100000 =# copy bug from '/tmp/test_data.txt'; COPY 100000 =# analyze bug; ANALYZE =# select histogram_bounds from pg_stats where tablename='bug'; histogram_bounds ------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------ -------- {34,34,34,34,34,34,34,36,36,36,36,36,37,37,37,37,37,38,38,38,38,39,39,39,39,40,40,40,41,41,41,41,42,42,42,43,43,44,44,44,45,45,45,46,46,46,47,47,48,48,48,4 9,49,50,50,51,51,52,53,53,54,55,55,56,57,58,58,59,60,61,62,63,64,65,67,68,70,72,74,76,79,81,83,86,89,91,94,98,101,105,110,117,122,131,139,152,171,202,236,30 8,1457} (1 row) Analyze assumes that if a value has a sample count less than samplesize/num_buckets, ( maxmincount near line 2170 in commands/analyze.c) then it is safe to include it in a histogram. However, because the histogram only contains non mcv's, this is incorrect. As far as I can see, there are 4 solutions: 1) track all of the distinct values This wouldn't be *too* expensive in analyze, especially considering we are tracking all of the sampled values as it is. However, this opens up the possibility of having huge mcv's lists in the worst case.To see this, consider a distribution such that the most common value was 20% of the table, the next mcv was 20% of the remaining entries, etc. Clearly, for stats targets greater than 5, every value in the table would overrun a histogram boundary, leading to an mcv list that contained every distinct value in the sample. 2) reduce number_of_bins if values exist with frequency greater than 1/nbins This would fix the bug, but at the cost of reducing the utility of the histogram ( it would introduce a large skew to the ndistinct distribution, which is assumed to be uniform over non-mcvs ). 3) use variable width histogram bins over all values. This is probably the cleanest solution, but the most invasive. 4) Fix the binary search in ineqsel to correctly find the boundaries, even with duplicates This would also be relatively clean, but are the hist boundaries assumption of being strictly increasing being satisfied anywhere else besides ineqsel? I've attached a patch that is a compromise between 1 and 2. It puts a hard limit on the number of mcv's at 2x the stats target, and then, if there are still values with too high a frequency, it reduces the number of histogram buckets. -Nathan
Attachment
"Nathan Boley" <npboley@gmail.com> writes: > For heavy tailed distributions, it is possible for analyze to > duplicate histogram boundaries. I don't think this is a bug. You've got values that didn't make it into the MCV list, but nonetheless occupy multiple buckets' worth of space in the remainder of the distribution. They *should* appear multiple times in the histogram. If they didn't, the histogram would be understating their frequency. regards, tom lane
>> For heavy tailed distributions, it is possible for analyze to >> duplicate histogram boundaries. > > I don't think this is a bug. hmmm... Well, I assumed it was a bug from a comment in analyze. From ( near ) line 2130 in analyze.c * least 2 instances in the sample. Also, we won't suppress values* that have a frequency of at least 1/K where K is theintended* number of histogram bins; such values might otherwise cause us to* emit duplicate histogram bin boundaries.*/ If this is expected, I'm also not sure what the use of maxmincount in analyze is... Thanks for the response, Nathan
"Nathan Boley" <npboley@gmail.com> writes: >> I don't think this is a bug. > hmmm... Well, I assumed it was a bug from a comment in analyze. > From ( near ) line 2130 in analyze.c > * least 2 instances in the sample. Also, we won't suppress values > * that have a frequency of at least 1/K where K is the intended > * number of histogram bins; such values might otherwise cause us to > * emit duplicate histogram bin boundaries. That's talking about a case where we have a choice whether to include a value in the MCV list or not. Once the MCV list is maxed out, we can't do anything to avoid duplicates. regards, tom lane
On Tue, 2009-01-06 at 18:40 -0500, Tom Lane wrote: > "Nathan Boley" <npboley@gmail.com> writes: > >> I don't think this is a bug. > > > hmmm... Well, I assumed it was a bug from a comment in analyze. > > > From ( near ) line 2130 in analyze.c > > > * least 2 instances in the sample. Also, we won't suppress values > > * that have a frequency of at least 1/K where K is the intended > > * number of histogram bins; such values might otherwise cause us to > > * emit duplicate histogram bin boundaries. > > That's talking about a case where we have a choice whether to include a > value in the MCV list or not. Once the MCV list is maxed out, we can't > do anything to avoid duplicates. Surely the most important point in the OP was that ineqsel does not correctly binary search in the presence of duplicates. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> Surely the most important point in the OP was that ineqsel does not > correctly binary search in the presence of duplicates. > It would have been if I were correct :-( . Looking at it again, that was from a bug in my code. Thanks for your time, and sorry about the noise. -Nathan