Thread: patternsel() and histogram_selectivity() and the hard cutoff of 100
So I had a thought about how to soften the controversial hard cutoff of 100 for the use of the histogram selectivity. Instead of switching 100% one way or the other between the two heuristics why not calculate both and combine them. The larger the sample size from the histogram the more we can weight the histogram calculation. The smaller the histogram size the more we weight the heuristic. My first thought was to scale it linearly so we use 10% of the histogram sample + 90% of the heuristic for default statistic sizes of 10 samples. That degenerates to the status quo for 100 samples and up. But actually I wonder if we can't get some solid statistics behind the percentages. The lower the sample size the larger the 95th percentile confidence interval and we want to use heuristic to adjust the result within the confidence interval. I think there are even ways of calculating pessimistic confidence intervals for unrepresentative samples. This would allow the results to degrade smoothly. So a sample size of 50 would be expected to give results somewhere between those of 10 and 100. Instead of the current behaviour where the results will be exactly the same until you hit 100 and then suddenly jump to a different value. I would do it by just making histogram_selectivity() never fail unless the histogram is less than 2 * the ignored values. There would be an additional parameter with a double* where the function would store the percentage weight to give the result. The caller would be responsible for combining the result just as it is with the MCV estimates. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
"Gregory Stark" <stark@enterprisedb.com> writes: > So I had a thought about how to soften the controversial hard cutoff of 100 > for the use of the histogram selectivity. Instead of switching 100% one way or > the other between the two heuristics why not calculate both and combine them. > The larger the sample size from the histogram the more we can weight the > histogram calculation. The smaller the histogram size the more we weight the > heuristic. > > My first thought was to scale it linearly so we use 10% of the histogram > sample + 90% of the heuristic for default statistic sizes of 10 samples. That > degenerates to the status quo for 100 samples and up. Incidentally I hacked up a patch to do this: postgres=# create table xx as (select i||'x'||i from generate_series(1,10000) as i(i)); SELECT postgres=# alter table xx alter x set statistics 1; ANALYZE postgres=# analyze xx; ALTER TABLE postgres=# explain analyze select * from xx where x like '%x1%'; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on xx (cost=0.00..174.00 rows=2000 width=9) (actual time=0.095..11.814 rows=1112 loops=1) Filter: (x ~~ '%x1%'::text) Total runtime: 13.957 ms (3 rows) postgres=# alter table xx alter x set statistics 10; analyze xx; explain analyze select * from xx where x like '%x1%'; ... Seq Scan on xx (cost=0.00..174.00 rows=1920 width=9) (actual time=0.036..11.454 rows=1112 loops=1) ... postgres=# alter table xx alter x set statistics 20; analyze xx; explain analyze select * from xx where x like '%x1%'; Seq Scan on xx (cost=0.00..174.00 rows=1820 width=9) (actual time=0.036..11.446 rows=1112 loops=1) ... postgres=# alter table xx alter x set statistics 50; analyze xx; explain analyze select * from xx where x like '%x1%'; Seq Scan on xx (cost=0.00..174.00 rows=1520 width=9) (actual time=0.036..11.406 rows=1112 loops=1) ... postgres=# alter table xx alter x set statistics 70; analyze xx; explain analyze select * from xx where x like '%x1%'; Seq Scan on xx (cost=0.00..174.00 rows=1320 width=9) (actual time=0.036..10.725 rows=1112 loops=1) ... postgres=# alter table xx alter x set statistics 90; analyze xx; explain analyze select * from xx where x like '%x1%'; Seq Scan on xx (cost=0.00..174.00 rows=1220 width=9) (actual time=0.036..10.326 rows=1112 loops=1) ... postgres=# alter table xx alter x set statistics 100; analyze xx; explain analyze select * from xx where x like '%x1%'; Seq Scan on xx (cost=0.00..174.00 rows=1120 width=9) (actual time=0.037..11.411 rows=1112 loops=1) ... postgres=# alter table xx alter x set statistics 200; analyze xx; explain analyze select * from xx where x like '%x1%'; Seq Scan on xx (cost=0.00..174.00 rows=1106 width=9) (actual time=0.037..11.328 rows=1112 loops=1) ... postgres=# alter table xx alter x set statistics 1; analyze xx; explain analyze select * from xx where x like '%x1%'; Seq Scan on xx (cost=0.00..174.00 rows=2000 width=9) (actual time=0.037..11.810 rows=1112 loops=1) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
Attachment
Hi Greg, >> So I had a thought about how to soften the controversial hard cutoff of 100 >> for the use of the histogram selectivity. Instead of switching 100% one way or >> the other between the two heuristics why not calculate both and combine them. >> The larger the sample size from the histogram the more we can weight the >> histogram calculation. The smaller the histogram size the more we weight the >> heuristic. >> >> My first thought was to scale it linearly so we use 10% of the histogram >> sample + 90% of the heuristic for default statistic sizes of 10 samples. That >> degenerates to the status quo for 100 samples and up. > > Incidentally I hacked up a patch to do this: Sounds sensible to me, at least much more than a hardcoded magic number a few people know about... Cheers -- Matteo Beccati Openads - http://www.openads.org
Gregory Stark <stark@enterprisedb.com> writes: > "Gregory Stark" <stark@enterprisedb.com> writes: >> So I had a thought about how to soften the controversial hard cutoff of 100 >> for the use of the histogram selectivity. Instead of switching 100% one way or >> the other between the two heuristics why not calculate both and combine them. >> The larger the sample size from the histogram the more we can weight the >> histogram calculation. The smaller the histogram size the more we weight the >> heuristic. > Incidentally I hacked up a patch to do this: Applied with revisions --- I thought it was better to let the caller of histogram_selectivity make the decision about how to combine the estimates, instead of hard-wiring the choice into that subroutine. regards, tom lane