Re: Statistics and selectivity estimation for ranges - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Statistics and selectivity estimation for ranges
Date
Msg-id 502A72CA.80704@enterprisedb.com
Whole thread Raw
In response to Re: Statistics and selectivity estimation for ranges  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: Statistics and selectivity estimation for ranges
List pgsql-hackers
On 14.08.2012 09:45, Alexander Korotkov wrote:
> After fixing few more bugs, I've a version with much more reasonable
> accuracy.

Great! One little thing just occurred to me:

You're relying on the regular scalar selectivity estimators for the <<, >>, &< and &> operators. That seems bogus, in
particularfor << and &<, 
 
because ineq_histogram_selectivity then performs a binary search of the 
histogram using those operators. << and &< compare the *upper* bound of 
the value in table against the lower bound of constant, but the 
histogram is constructed using regular < operator, which sorts the 
entries by lower bound. I think the estimates you now get for those 
operators are quite bogus if there is a non-trivial amount of overlap 
between ranges. For example:

postgres=# create table range_test as
select int4range(-a, a) as r from generate_series(1,1000000) a; analyze 
range_test;
SELECT 1000000
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
int4range(200000, 200001);                                                    QUERY PLAN 


--------------------------------------------------------------------------------
----------------------------------- Seq Scan on range_test  (cost=0.00..17906.00 rows=100 width=14) 
(actual time=0.
060..1340.147 rows=200000 loops=1)   Filter: (r << '[200000,200001)'::int4range)   Rows Removed by Filter: 800000 Total
runtime:1371.865 ms
 
(4 rows)

It would be quite easy to provide reasonable estimates for those 
operators, if we had a separate histogram of upper bounds. I also note 
that the estimation of overlap selectivity could be implemented using 
separate histograms of lower bounds and upper bounds, without requiring 
a histogram of range lengths, because a && b == NOT (a << b OR a >> b). 
I'm not sure if the estimates we'd get that way would be better or worse 
than your current method, but I think it would be easier to understand.

I don't think the &< and &> operators could be implemented in terms of a 
lower and upper bound histogram, though, so you'd still need the current 
length histogram method for that.

The code in that traverses the lower bound and length histograms in 
lockstep looks quite scary. Any ideas on how to simplify that? My first 
thought is that there should be helper function that gets a range length 
as argument, and returns the fraction of tuples with length >= argument. 
It would do the lookup in the length histogram to find the right 
histogram bin, and do the linear interpolation within the bin. You're 
assuming that length is independent of lower/upper bound, so you 
shouldn't need any other parameters than range length for that estimation.

You could then loop through only the lower bounds, and call the helper 
function for each bin to get the fraction of ranges long enough in that 
bin, instead dealing with both histograms in the same loop. I think a 
helper function like that might simplify those scary loops 
significantly, but I wasn't sure if there's some more intelligence in 
the way you combine values from the length and lower bound histograms 
that you couldn't do with such a helper function.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: WIP patch for consolidating misplaced-aggregate checks
Next
From: Tom Lane
Date:
Subject: Re: WIP patch for consolidating misplaced-aggregate checks