On Mon, Aug 6, 2012 at 6:09 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 04.08.2012 12:31, Alexander Korotkov wrote:
Hackers,
attached patch is for collecting statistics and selectivity estimation for
ranges.
In order to make our estimations accurate for every distribution of
ranges, we would collect 2d-distribution of lower and upper bounds of range
into some kind of 2d-histogram. However, this patch use some simplification
and assume distribution of lower bound and distribution of length to be
independent.
Sounds reasonable. Another possibility would be to calculate the average length for each lower-bound bin. So you would e.g know the average length of values with lower bound between 1-10, and the average length of values with lower bound between 10-20, and so forth. Within a bin, you would have to assume that the distribution of the lengths is fixed.
Interesting idea. AFAICS, if we store average length for each lower-bound bin, we still have to assume some kind of distribution of range length in order to do estimates. For example, assume that range length have exponential distribution. Correspondingly, we've following trade off: we don't have to assume lower bound distribution to be independent from length distribution, but we have to assume kind of length distribution. Actually, I don't know what is better.
Ideally, we would have range length histogram for each lower-bound bin, or upper-bound histogram for each lower-bound bin. But, storing such amount of data seems too expensive.
------
With best regards,
Alexander Korotkov.