Thread: Row count estimation bug in BETWEEN?
Hi. I've noticed strange row count estimations in BETWEEN predicate: ------ -- I'm using this: SELECT version(); -->> PostgreSQL 9.4.1, compiled by Visual C++ build 1800, 32-bit -- Prepare table: CREATE TABLE t1 ( id bigserial NOT NULL, f0 bigint NOT NULL, f1 bigint NOT NULL, CONSTRAINT t1_pkey PRIMARY KEY (id) ); INSERT INTO t1(f0, f1) SELECT f0.n, f1.n FROM generate_series(1,1000) AS f0(n), generate_series(1,1000) AS f1(n), generate_series(1,11); CREATE INDEX ON t1(f0); CREATE INDEX ON t1(f1); VACUUM ANALYZE t1; -- Real count: SELECT COUNT(*) FROM t1 WHERE f1 = 42; -->> 11000 ---- Estimation 1: EXPLAIN SELECT * FROM t1 WHERE f1 = 42; -->> Bitmap Heap Scan on t1 (cost=203.07..28792.94 rows=10662 width=24) <skipped> EXPLAIN SELECT * FROM t1 WHERE f1 BETWEEN 42 AND 42; -->> Index Scan using table1_field1_idx on t1 (cost=0.44..8.46 rows=1 width=24) <skipped> ------ Why row count estimations for two logically equivalent queries are so different?
=?koi8-r?B?/cXLyc4g8dLP08zB1w==?= <ladayaroslav@yandex.ru> writes: > I've noticed strange row count estimations in BETWEEN predicate: > EXPLAIN > SELECT * > FROM t1 > WHERE f1 = 42; > -->> Bitmap Heap Scan on t1 (cost=203.07..28792.94 rows=10662 width=24) <skipped> > EXPLAIN > SELECT * > FROM t1 > WHERE f1 BETWEEN 42 AND 42; > -->> Index Scan using table1_field1_idx on t1 (cost=0.44..8.46 rows=1 width=24) <skipped> > Why row count estimations for two logically equivalent queries are so > different? PG doesn't try to estimate inequalities exactly, because it usually doesn't make enough of a difference to matter. Currently we don't even bother to distinguish say ">" from ">=" for estimation purposes, though certainly we would need to in order to deal with zero-width ranges with any great amount of precision. regards, tom lane
Tom Lane-2 wrote > PG doesn't try to estimate inequalities exactly, because it usually > doesn't make enough of a difference to matter. Currently we don't > even bother to distinguish say ">" from ">=" for estimation purposes, > though certainly we would need to in order to deal with zero-width ranges > with any great amount of precision. Thank you for your answer! I'm sorry, but after looking into documentation and sources (scalarineqsel function in selfuncs.c, clauselist_selectivity and addRangeClause functions in clausesel.c) and experimenting a little I've got an impression that PostgreSQL actually bothers to distinguish ">" from ">=" for estimation purposes sometimes (probably, when MCV is used), but in my example it uses histogram and indeed doesn't distinguish them. My simple test (using MCVs) is below: ----- CREATE TABLE t2(n int); INSERT INTO t2(n) VALUES (0),(0),(0),(0),(1),(1),(1),(1),(2),(2),(2),(2); ANALYZE t2; EXPLAIN SELECT * FROM t2 WHERE n > 0 AND n < 2 -- rows=4 EXPLAIN SELECT * FROM t2 WHERE n > 0 AND n < 2 -- rows=12 ------ Looking further, I found ineq_histogram_selectivity function in selfuncs.c, and this fragment seems relevant: ----- /* * We have values[i-1] <= constant <= values[i]. * * Convert the constant and the two nearest bin boundary * values to a uniform comparison scale, and do a linear * interpolation within this bin. */ <skip> binfrac = (val - low) / (high - low); ----- And now I'm stuck. Can ">" operators can be distinguished from ">=" operators at this point? ----- WBR, Yaroslav Schekin. -- View this message in context: http://postgresql.nabble.com/Row-count-estimation-bug-in-BETWEEN-tp5853687p5853725.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
> My simple test (using MCVs) is below ... I've posted wrong second query in the test, should have been: EXPLAIN SELECT * FROM t2 WHERE n >= 0 AND n <= 2 -- rows=12 ----- WBR, Yaroslav Schekin. -- View this message in context: http://postgresql.nabble.com/Row-count-estimation-bug-in-BETWEEN-tp5853687p5853771.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Yaroslav <ladayaroslav@yandex.ru> writes: > Tom Lane-2 wrote >> PG doesn't try to estimate inequalities exactly, because it usually >> doesn't make enough of a difference to matter. Currently we don't >> even bother to distinguish say ">" from ">=" for estimation purposes, >> though certainly we would need to in order to deal with zero-width ranges >> with any great amount of precision. > Thank you for your answer! > I'm sorry, but after looking into documentation and sources > (scalarineqsel function in selfuncs.c, clauselist_selectivity and > addRangeClause functions in clausesel.c) and experimenting a little I've > got an impression that PostgreSQL actually bothers to distinguish ">" > from ">=" for estimation purposes sometimes (probably, when MCV is > used), but in my example it uses histogram and indeed doesn't > distinguish them. Well, I was oversimplifying a bit. When testing the MCV list we use the original operator, so that if the comparison constant is equal to some MCV entry, it will indeed matter whether you said ">" or ">=". When dealing with the histogram, however, we don't pay attention to the difference. The assumption is that the histogram represents a continuous distribution of values in which no one value occurs often enough to be interesting (if it did, it would be in the MCV list...). Therefore it does not matter much whether any specific histogram entry is exactly "=". And of course, for comparison values that are between histogram entries, we have no idea whatsoever whether there are any "=" entries in the table; so even if the code did distinguish ">" from ">=", it would be unclear what to do with the knowledge. regards, tom lane
Tom Lane-2 wrote > The assumption is that the histogram represents a > continuous distribution of values in which no one value occurs often > enough to be interesting (if it did, it would be in the MCV list...). > Therefore it does not matter much whether any specific histogram entry > is exactly "=". And of course, for comparison values that are between > histogram entries, we have no idea whatsoever whether there are any > "=" entries in the table; This assumption is correct for continuous types, but in my example the type (bigint) is discrete (as some other types like date, numerics (with defined scale) and even varchar/text are), so the assumption is wrong for it. Tom Lane-2 wrote > so even if the code did distinguish ">" from > ">=", it would be unclear what to do with the knowledge. The vague idea that popped into my head is: As the code in convert_to_scalar already switches on the value type, a flag to distinguish ">=" operators from ">" operators could be added there. It would use the equality of "a > const::sometype" to "a >= next_value(const::sometype)", i.e. that "a > 2::int" equals "a >= 3::int". So, corresponding convert_to... functions would use "value+1" instead of "value" in my case, next date if the type is date, "value+0.01" if type is numeric(n, 2), etc. IMHO, the problem with these estimations is that they are horribly off. I've searched the archives, and it seems that PostgreSQL's users are bitten by it sometimes, like: http://www.postgresql.org/message-id/4583.1358289018@sss.pgh.pa.us. ----- WBR, Yaroslav Schekin. -- View this message in context: http://postgresql.nabble.com/Row-count-estimation-bug-in-BETWEEN-tp5853687p5853938.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.