Thread: Row number estimation...
Hi, Yesterday I was trying to calculate by hand the row number estimates using the examples from http://www.postgresql.org/docs/9.3/static/row-estimation-examples.html It is usually correct, except when you try to get an estimate near the first (or last) histogram bound. Let me demonstrate with an example: create table a( id serial not null ); insert into a(id) (select i from generate_series(1,1000000) i); analyse a; select histogram_bounds from pg_stats where tablename = 'a'; -- "{38,9399,19852,...}" explain analyse select * from a where id <= 2000; -- Expected 2095.929, got 2096. This is OK. explain analyse select * from a where id <= 50; -- Expected 12.819, got 100. explain analyse select * from a where id <= 10; -- Expected -29.911, or better 1, got 100. The estimate should not be < 0, so I was expecting at least 1. It seems that when the selectivity calculated using the histogram is lower than 0.0001 or higher than 0.9999 it is capped to this values. Looking at the code in selfuncs.c it seems that is doing just that. The thing that intrigued me was what happened when I added an index (primary key) to the column. alter table a add primary key (id); explain analyse select * from a where id <= 2000; -- expected 2096, got 2127 explain analyse select * from a where id <= 50; -- expected 100, got 52 explain analyse select * from a where id <= 10; -- expected 100, got 10 As no statistics are collected for the index (as stated in the pg_statistic documentation "No entry is made for an ordinary non-expression index column, however, since it would be redundant with the entry for the underlying table column.") it is mystery to me how these estimates are calculated. I tried to look at the code (my guess was that I could find it in btcostestimate) but I wasn't able to figure it out. Can somebody explain how are the estimates calculated when there is an index involved? Regards, Mladen Marinović
<marin@kset.org> writes: > Yesterday I was trying to calculate by hand the row number estimates using > the examples from > http://www.postgresql.org/docs/9.3/static/row-estimation-examples.html > It is usually correct, except when you try to get an estimate near the > first (or last) histogram bound. Let me demonstrate with an example: > ... > As no statistics are collected for the index (as stated in the > pg_statistic documentation "No entry is made for an ordinary non-expression > index column, however, since it would be redundant with the entry for the > underlying table column.") it is mystery to me how these estimates are > calculated. I tried to look at the code (my guess was that I could find it > in btcostestimate) but I wasn't able to figure it out. scalarineqsel() is where to look. > Can somebody explain how are the estimates calculated when there is an > index involved? If the comparison value is outside the range recorded in the histogram, and there's a suitable index available, the planner uses the index to find out the actual column min or max rather than believing the histogram completely. See get_actual_variable_range(). Possibly this behavior ought to be mentioned in the docs ... regards, tom lane
On Sat, 18 Oct 2014 11:34:18 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: > If the comparison value is outside the range recorded in the histogram, > and there's a suitable index available, the planner uses the index to > find out the actual column min or max rather than believing the histogram > completely. See get_actual_variable_range(). Substituting the first histogram value with 1 (the lowest value in the column/index) and redoing the calculation give the same numbers as the EXPLAIN ANALISE does. But how so that the selectivity is now not capped to 0.0001? Without the index I couldn't get an estimate lower than 100 rows, but with the index the estimates go all the way down to 1. > Possibly this behavior ought to be mentioned in the docs ... It would help a lot :) Regards, Mladen Marinović