Re: Statistics and selectivity estimation for ranges - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Statistics and selectivity estimation for ranges |
Date | |
Msg-id | 502A72CA.80704@enterprisedb.com Whole thread Raw |
In response to | Re: Statistics and selectivity estimation for ranges (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: Statistics and selectivity estimation for ranges
|
List | pgsql-hackers |
On 14.08.2012 09:45, Alexander Korotkov wrote: > After fixing few more bugs, I've a version with much more reasonable > accuracy. Great! One little thing just occurred to me: You're relying on the regular scalar selectivity estimators for the <<, >>, &< and &> operators. That seems bogus, in particularfor << and &<, because ineq_histogram_selectivity then performs a binary search of the histogram using those operators. << and &< compare the *upper* bound of the value in table against the lower bound of constant, but the histogram is constructed using regular < operator, which sorts the entries by lower bound. I think the estimates you now get for those operators are quite bogus if there is a non-trivial amount of overlap between ranges. For example: postgres=# create table range_test as select int4range(-a, a) as r from generate_series(1,1000000) a; analyze range_test; SELECT 1000000 ANALYZE postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r << int4range(200000, 200001); QUERY PLAN -------------------------------------------------------------------------------- ----------------------------------- Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14) (actual time=0. 060..1340.147 rows=200000 loops=1) Filter: (r << '[200000,200001)'::int4range) Rows Removed by Filter: 800000 Total runtime:1371.865 ms (4 rows) It would be quite easy to provide reasonable estimates for those operators, if we had a separate histogram of upper bounds. I also note that the estimation of overlap selectivity could be implemented using separate histograms of lower bounds and upper bounds, without requiring a histogram of range lengths, because a && b == NOT (a << b OR a >> b). I'm not sure if the estimates we'd get that way would be better or worse than your current method, but I think it would be easier to understand. I don't think the &< and &> operators could be implemented in terms of a lower and upper bound histogram, though, so you'd still need the current length histogram method for that. The code in that traverses the lower bound and length histograms in lockstep looks quite scary. Any ideas on how to simplify that? My first thought is that there should be helper function that gets a range length as argument, and returns the fraction of tuples with length >= argument. It would do the lookup in the length histogram to find the right histogram bin, and do the linear interpolation within the bin. You're assuming that length is independent of lower/upper bound, so you shouldn't need any other parameters than range length for that estimation. You could then loop through only the lower bounds, and call the helper function for each bin to get the fraction of ranges long enough in that bin, instead dealing with both histograms in the same loop. I think a helper function like that might simplify those scary loops significantly, but I wasn't sure if there's some more intelligence in the way you combine values from the length and lower bound histograms that you couldn't do with such a helper function. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: