Re: Statistics and selectivity estimation for ranges - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Statistics and selectivity estimation for ranges |
Date | |
Msg-id | 513F51E6.6040004@vmware.com Whole thread Raw |
In response to | Re: Statistics and selectivity estimation for ranges (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Statistics and selectivity estimation for ranges
|
List | pgsql-hackers |
On 01.03.2013 16:22, Alexander Korotkov wrote: > I've been staring at this code for a very long time now, trying to > understand how the math in calc_hist_selectivity_contained works. I > think I understand it now, but it probably needs a lot more comments and > perhaps some refactoring, so that the next reader won't need to spend > hours deciphering it. > > I'll walk through an example of a calc_hist_selectivity_contained > invocation, to verify that my understanding is correct. This isn't 100% > identical to how the function works; I explain it as if it holds some > temporary bins in memory and modifies them in steps, but in reality, it > keeps the bin distances and some other information only for the > current/previous bin it's processing, in local variables. > > Assume that the query is "col <@ int4range(15, 50)", and the lower > bounds histogram is (10, 20, 40, 100, 120). Visually, the histogram > looks like this: > > > Boundary 10 20 40 100 120 > -+------+------+------+------+- > Fraction 0.25 0.25 0.25 0.25 > > Each bin, 10-20, 20-40, 40-100 and 100-120, contains the same > proportion, 25%, of all the tuples in the table. The function first > finds the bins containing the lower and upper bounds, 15 and 55. All the > bins outside those bounds are ignored, as there are no matching tuples > there. The fractions and the bounds of first and last bin, ie. those > containing the lower and upper bounds, are adjusted according to the > boundary values, using linear interpolation. The lower bound, 15, falls > in the middle of the bin 10-20, and the upper bound, 55, splits the > 40-100 bin at ratio of 1/5. The adjusted bins look like this: > > Boundary 15 20 40 55 > -+------+------+------+ > Fraction 0.125 0.25 0.05 > > Next, we need to calculate what proportion of tuples in each bin has a > small enough length to be contained withing the query range. For that, > the distance of each bin boundary to the upper bound is calculated: > > Distance 40 35 15 0 > -+------+------+------+ > Fraction 0.125 0.25 0.05 > > The bins are walked starting from the highest bin, ie. starting from > distance 0, walking up in increasing order of distance. For each bin, > the proportion of tuples within that range that have a suitable length > is calculated. The calc_length_hist_frac function does that. That > calculation is more complicated than it sounds: for example, for the > middle bin above, calc_length_hist_frac is passed both distance > boundaries, 15 and 35. calc_length_hist frac calculates the average of > P(x), when x slides linearly from 15 to 35, where P(x) is the fraction > of tuples with length <= x. > > Now, here's a question, on something I didn't quite understand: > >> * Returns average fraction of histogram which is greater than given >> length. >> * While this length is increasing from length1 to *length2. If histogram >> * ends up before *length2 then set covered fraction of (length1, >> *length2) >> * interval to *fraction and set end of histogram to *length2. >> */ >> static double >> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues, >> double length1, double *length2, double *fraction) >> { > > Why the behavior explained in the last sentence in the above comment? It > seems that the abstraction provided by calc_length_hist_frac() is leaky; > the caller shouldn't need to know anything about the boundaries of the > length bins. > > Ignoring that, I believe that calc_length_hist_frac can also be > explained like this: > >> /* >> * Let P(x) be the fraction of tuples with length <= x. >> * >> * calc_length_hist_frac calculates the average of P(x), in the >> interval [A, B]. >> * >> * This can be calculated by the formula: >> * >> * B >> * 1 / >> * ------- | P(x)dx >> * B - A / >> * A >> */ >> static double >> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues, >> double A, double B) > > Am I correct this far? The function doesn't use the above formula as is, > but it could.. > > I'll continue trying to understand this and add comments.. So, after some hacking, I ended up with this version. I find it more readable, I hope I didn't miss anything. This seems to produce results that are close, but not identical, to the original patch. I'm not sure where the discrepancy is coming from, or which patch is more correct in that respect. I'll continue from this tomorrow, but if you have the time, please take a look and let me know what you think. - Heikki
Attachment
pgsql-hackers by date: