Thread: patternsel() and histogram_selectivity() and the hard cutoff of 100

patternsel() and histogram_selectivity() and the hard cutoff of 100

From

Gregory Stark

Date:

05 February 2008, 10:25:49

So I had a thought about how to soften the controversial hard cutoff of 100
for the use of the histogram selectivity. Instead of switching 100% one way or
the other between the two heuristics why not calculate both and combine them.
The larger the sample size from the histogram the more we can weight the
histogram calculation. The smaller the histogram size the more we weight the
heuristic.

My first thought was to scale it linearly so we use 10% of the histogram
sample + 90% of the heuristic for default statistic sizes of 10 samples. That
degenerates to the status quo for 100 samples and up.

But actually I wonder if we can't get some solid statistics behind the
percentages. The lower the sample size the larger the 95th percentile
confidence interval and we want to use heuristic to adjust the result within
the confidence interval. I think there are even ways of calculating
pessimistic confidence intervals for unrepresentative samples.

This would allow the results to degrade smoothly. So a sample size of 50 would
be expected to give results somewhere between those of 10 and 100. Instead of
the current behaviour where the results will be exactly the same until you hit
100 and then suddenly jump to a different value.

I would do it by just making histogram_selectivity() never fail unless the
histogram is less than 2 * the ignored values. There would be an additional
parameter with a double* where the function would store the percentage weight
to give the result. The caller would be responsible for combining the result
just as it is with the MCV estimates.

-- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!

Re: patternsel() and histogram_selectivity() and the hard cutoff of 100

From

Gregory Stark

Date:

05 February 2008, 12:35:40

"Gregory Stark" <stark@enterprisedb.com> writes:

> So I had a thought about how to soften the controversial hard cutoff of 100
> for the use of the histogram selectivity. Instead of switching 100% one way or
> the other between the two heuristics why not calculate both and combine them.
> The larger the sample size from the histogram the more we can weight the
> histogram calculation. The smaller the histogram size the more we weight the
> heuristic.
>
> My first thought was to scale it linearly so we use 10% of the histogram
> sample + 90% of the heuristic for default statistic sizes of 10 samples. That
> degenerates to the status quo for 100 samples and up.

Incidentally I hacked up a patch to do this:

postgres=# create table xx as (select i||'x'||i from generate_series(1,10000) as i(i));
SELECT

postgres=# alter table xx alter x set statistics 1;
ANALYZE

postgres=# analyze xx;
ALTER TABLE

postgres=# explain analyze select * from xx where x like '%x1%';
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Seq Scan on xx  (cost=0.00..174.00 rows=2000 width=9) (actual time=0.095..11.814 rows=1112 loops=1)
   Filter: (x ~~ '%x1%'::text)
 Total runtime: 13.957 ms
(3 rows)

postgres=# alter table xx alter x set statistics 10; analyze xx; explain analyze select * from xx where x like '%x1%';
...
 Seq Scan on xx  (cost=0.00..174.00 rows=1920 width=9) (actual time=0.036..11.454 rows=1112 loops=1)
...
postgres=# alter table xx alter x set statistics 20; analyze xx; explain analyze select * from xx where x like '%x1%';
 Seq Scan on xx  (cost=0.00..174.00 rows=1820 width=9) (actual time=0.036..11.446 rows=1112 loops=1)
...
postgres=# alter table xx alter x set statistics 50; analyze xx; explain analyze select * from xx where x like '%x1%';
 Seq Scan on xx  (cost=0.00..174.00 rows=1520 width=9) (actual time=0.036..11.406 rows=1112 loops=1)
...
postgres=# alter table xx alter x set statistics 70; analyze xx; explain analyze select * from xx where x like '%x1%';
 Seq Scan on xx  (cost=0.00..174.00 rows=1320 width=9) (actual time=0.036..10.725 rows=1112 loops=1)
...
postgres=# alter table xx alter x set statistics 90; analyze xx; explain analyze select * from xx where x like '%x1%';
 Seq Scan on xx  (cost=0.00..174.00 rows=1220 width=9) (actual time=0.036..10.326 rows=1112 loops=1)
...
postgres=# alter table xx alter x set statistics 100; analyze xx; explain analyze select * from xx where x like '%x1%';
 Seq Scan on xx  (cost=0.00..174.00 rows=1120 width=9) (actual time=0.037..11.411 rows=1112 loops=1)
...
postgres=# alter table xx alter x set statistics 200; analyze xx; explain analyze select * from xx where x like '%x1%';
 Seq Scan on xx  (cost=0.00..174.00 rows=1106 width=9) (actual time=0.037..11.328 rows=1112 loops=1)
...
postgres=# alter table xx alter x set statistics 1; analyze xx; explain analyze select * from xx where x like '%x1%';
 Seq Scan on xx  (cost=0.00..174.00 rows=2000 width=9) (actual time=0.037..11.810 rows=1112 loops=1)

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Attachment

patternsel-combine-heuristics.patch.gz

Re: patternsel() and histogram_selectivity() and the hard cutoff of 100

From

Matteo Beccati

Date:

12 February 2008, 04:41:31

Hi Greg,

>> So I had a thought about how to soften the controversial hard cutoff of 100
>> for the use of the histogram selectivity. Instead of switching 100% one way or
>> the other between the two heuristics why not calculate both and combine them.
>> The larger the sample size from the histogram the more we can weight the
>> histogram calculation. The smaller the histogram size the more we weight the
>> heuristic.
>>
>> My first thought was to scale it linearly so we use 10% of the histogram
>> sample + 90% of the heuristic for default statistic sizes of 10 samples. That
>> degenerates to the status quo for 100 samples and up.
> 
> Incidentally I hacked up a patch to do this:

Sounds sensible to me, at least much more than a hardcoded magic number 
a few people know about...


Cheers

-- 
Matteo Beccati

Openads - http://www.openads.org

Re: patternsel() and histogram_selectivity() and the hard cutoff of 100

From

Tom Lane

Date:

08 March 2008, 20:34:02

Gregory Stark <stark@enterprisedb.com> writes:
> "Gregory Stark" <stark@enterprisedb.com> writes:
>> So I had a thought about how to soften the controversial hard cutoff of 100
>> for the use of the histogram selectivity. Instead of switching 100% one way or
>> the other between the two heuristics why not calculate both and combine them.
>> The larger the sample size from the histogram the more we can weight the
>> histogram calculation. The smaller the histogram size the more we weight the
>> heuristic.

> Incidentally I hacked up a patch to do this:

Applied with revisions --- I thought it was better to let the caller of
histogram_selectivity make the decision about how to combine the
estimates, instead of hard-wiring the choice into that subroutine.
        regards, tom lane