Re: Statistics and selectivity estimation for ranges - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Statistics and selectivity estimation for ranges |
Date | |
Msg-id | 513F13FE.3020509@vmware.com Whole thread Raw |
In response to | Re: Statistics and selectivity estimation for ranges (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: Statistics and selectivity estimation for ranges
|
List | pgsql-hackers |
On 01.03.2013 16:22, Alexander Korotkov wrote: > These changes were made in attached patch. Thanks. I've been staring at this code for a very long time now, trying to understand how the math in calc_hist_selectivity_contained works. I think I understand it now, but it probably needs a lot more comments and perhaps some refactoring, so that the next reader won't need to spend hours deciphering it. I'll walk through an example of a calc_hist_selectivity_contained invocation, to verify that my understanding is correct. This isn't 100% identical to how the function works; I explain it as if it holds some temporary bins in memory and modifies them in steps, but in reality, it keeps the bin distances and some other information only for the current/previous bin it's processing, in local variables. Assume that the query is "col <@ int4range(15, 50)", and the lower bounds histogram is (10, 20, 40, 100, 120). Visually, the histogram looks like this: Boundary 10 20 40 100 120 -+------+------+------+------+- Fraction 0.25 0.25 0.25 0.25 Each bin, 10-20, 20-40, 40-100 and 100-120, contains the same proportion, 25%, of all the tuples in the table. The function first finds the bins containing the lower and upper bounds, 15 and 55. All the bins outside those bounds are ignored, as there are no matching tuples there. The fractions and the bounds of first and last bin, ie. those containing the lower and upper bounds, are adjusted according to the boundary values, using linear interpolation. The lower bound, 15, falls in the middle of the bin 10-20, and the upper bound, 55, splits the 40-100 bin at ratio of 1/5. The adjusted bins look like this: Boundary 15 20 40 55 -+------+------+------+ Fraction 0.125 0.25 0.05 Next, we need to calculate what proportion of tuples in each bin has a small enough length to be contained withing the query range. For that, the distance of each bin boundary to the upper bound is calculated: Distance 40 35 15 0 -+------+------+------+ Fraction 0.125 0.25 0.05 The bins are walked starting from the highest bin, ie. starting from distance 0, walking up in increasing order of distance. For each bin, the proportion of tuples within that range that have a suitable length is calculated. The calc_length_hist_frac function does that. That calculation is more complicated than it sounds: for example, for the middle bin above, calc_length_hist_frac is passed both distance boundaries, 15 and 35. calc_length_hist frac calculates the average of P(x), when x slides linearly from 15 to 35, where P(x) is the fraction of tuples with length <= x. Now, here's a question, on something I didn't quite understand: > * Returns average fraction of histogram which is greater than given length. > * While this length is increasing from length1 to *length2. If histogram > * ends up before *length2 then set covered fraction of (length1, *length2) > * interval to *fraction and set end of histogram to *length2. > */ > static double > calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues, > double length1, double *length2, double *fraction) > { Why the behavior explained in the last sentence in the above comment? It seems that the abstraction provided by calc_length_hist_frac() is leaky; the caller shouldn't need to know anything about the boundaries of the length bins. Ignoring that, I believe that calc_length_hist_frac can also be explained like this: > /* > * Let P(x) be the fraction of tuples with length <= x. > * > * calc_length_hist_frac calculates the average of P(x), in the interval [A, B]. > * > * This can be calculated by the formula: > * > * B > * 1 / > * ------- | P(x)dx > * B - A / > * A > */ > static double > calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues, > double A, double B) Am I correct this far? The function doesn't use the above formula as is, but it could.. I'll continue trying to understand this and add comments.. - Heikki
pgsql-hackers by date: