Re: Statistics and selectivity estimation for ranges - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Statistics and selectivity estimation for ranges
Date
Msg-id 513F13FE.3020509@vmware.com
Whole thread Raw
In response to Re: Statistics and selectivity estimation for ranges  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: Statistics and selectivity estimation for ranges
List pgsql-hackers
On 01.03.2013 16:22, Alexander Korotkov wrote:
> These changes were made in attached patch.

Thanks.

I've been staring at this code for a very long time now, trying to 
understand how the math in calc_hist_selectivity_contained works. I 
think I understand it now, but it probably needs a lot more comments and 
perhaps some refactoring, so that the next reader won't need to spend 
hours deciphering it.

I'll walk through an example of a calc_hist_selectivity_contained 
invocation, to verify that my understanding is correct. This isn't 100% 
identical to how the function works; I explain it as if it holds some 
temporary bins in memory and modifies them in steps, but in reality, it 
keeps the bin distances and some other information only for the 
current/previous bin it's processing, in local variables.

Assume that the query is "col <@ int4range(15, 50)", and the lower 
bounds histogram is (10, 20, 40, 100, 120). Visually, the histogram 
looks like this:


Boundary   10     20     40    100    120          -+------+------+------+------+-
Fraction     0.25   0.25   0.25   0.25

Each bin, 10-20, 20-40, 40-100 and 100-120, contains the same 
proportion, 25%, of all the tuples in the table. The function first 
finds the bins containing the lower and upper bounds, 15 and 55. All the 
bins outside those bounds are ignored, as there are no matching tuples 
there. The fractions and the bounds of first and last bin, ie. those 
containing the lower and upper bounds, are adjusted according to the 
boundary values, using linear interpolation. The lower bound, 15, falls 
in the middle of the bin 10-20, and the upper bound, 55, splits the 
40-100 bin at ratio of 1/5. The adjusted bins look like this:

Boundary   15     20     40     55          -+------+------+------+
Fraction     0.125  0.25   0.05

Next, we need to calculate what proportion of tuples in each bin has a 
small enough length to be contained withing the query range. For that, 
the distance of each bin boundary to the upper bound is calculated:

Distance   40     35     15     0          -+------+------+------+
Fraction     0.125  0.25   0.05

The bins are walked starting from the highest bin, ie. starting from 
distance 0, walking up in increasing order of distance. For each bin, 
the proportion of tuples within that range that have a suitable length 
is calculated. The calc_length_hist_frac function does that. That 
calculation is more complicated than it sounds: for example, for the 
middle bin above, calc_length_hist_frac is passed both distance 
boundaries, 15 and 35. calc_length_hist frac calculates the average of 
P(x), when x slides linearly from 15 to 35, where P(x) is the fraction 
of tuples with length <= x.

Now, here's a question, on something I didn't quite understand:

>  * Returns average fraction of histogram which is greater than given length.
>  * While this length is increasing from length1 to *length2. If histogram
>  * ends up before *length2 then set covered fraction of (length1, *length2)
>  * interval to *fraction and set end of histogram to *length2.
>  */
> static double
> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
>                       double length1, double *length2, double *fraction)
> {

Why the behavior explained in the last sentence in the above comment? It 
seems that the abstraction provided by calc_length_hist_frac() is leaky; 
the caller shouldn't need to know anything about the boundaries of the 
length bins.

Ignoring that, I believe that calc_length_hist_frac can also be 
explained like this:

> /*
>  * Let P(x) be the fraction of tuples with length <= x.
>  *
>  * calc_length_hist_frac calculates the average of P(x), in the interval [A, B].
>  *
>  * This can be calculated by the formula:
>  *
>  *          B
>  *    1     /
>  * -------  | P(x)dx
>  *  B - A   /
>  *          A
>  */
> static double
> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
>             double A, double B)

Am I correct this far? The function doesn't use the above formula as is, 
but it could..

I'll continue trying to understand this and add comments..

- Heikki



pgsql-hackers by date:

Previous
From: Albe Laurenz
Date:
Subject: Re: Column defaults for foreign tables (was Re: [v9.3] writable foreign tables)
Next
From: Andres Freund
Date:
Subject: Re: transforms