Thread: Row count estimation bug in BETWEEN?

Row count estimation bug in BETWEEN?

From

Щекин Ярослав

Date:

13 June 2015, 09:20:34

Hi.

I've noticed strange row count estimations in BETWEEN predicate:
------
-- I'm using this:
SELECT version();
-->> PostgreSQL 9.4.1, compiled by Visual C++ build 1800, 32-bit
-- Prepare table:
CREATE TABLE t1 (
id bigserial NOT NULL,
f0 bigint NOT NULL,
f1 bigint NOT NULL,
CONSTRAINT t1_pkey PRIMARY KEY (id)
);

INSERT INTO t1(f0, f1)
SELECT f0.n, f1.n
  FROM generate_series(1,1000) AS f0(n),
       generate_series(1,1000) AS f1(n),
       generate_series(1,11);
CREATE INDEX ON t1(f0);
CREATE INDEX ON t1(f1);
VACUUM ANALYZE t1;
-- Real count:
SELECT COUNT(*)
  FROM t1
 WHERE f1 = 42;
-->> 11000
---- Estimation 1:
EXPLAIN
SELECT *
  FROM t1
 WHERE f1 = 42;
-->> Bitmap Heap Scan on t1  (cost=203.07..28792.94 rows=10662 width=24) <skipped>

EXPLAIN
SELECT *
  FROM t1
 WHERE f1 BETWEEN 42 AND 42;
-->> Index Scan using table1_field1_idx on t1  (cost=0.44..8.46 rows=1 width=24) <skipped>

------
Why row count estimations for two logically equivalent queries are so different?

Re: Row count estimation bug in BETWEEN?

From

Tom Lane

Date:

13 June 2015, 16:35:28

=?koi8-r?B?/cXLyc4g8dLP08zB1w==?= <ladayaroslav@yandex.ru> writes:
> I've noticed strange row count estimations in BETWEEN predicate:

> EXPLAIN
> SELECT *
>   FROM t1
>  WHERE f1 = 42;
> -->> Bitmap Heap Scan on t1  (cost=203.07..28792.94 rows=10662 width=24) <skipped>

> EXPLAIN
> SELECT *
>   FROM t1
>  WHERE f1 BETWEEN 42 AND 42;
> -->> Index Scan using table1_field1_idx on t1  (cost=0.44..8.46 rows=1 width=24) <skipped>

> Why row count estimations for two logically equivalent queries are so
> different?

PG doesn't try to estimate inequalities exactly, because it usually
doesn't make enough of a difference to matter.  Currently we don't
even bother to distinguish say ">" from ">=" for estimation purposes,
though certainly we would need to in order to deal with zero-width ranges
with any great amount of precision.

            regards, tom lane

Re: Row count estimation bug in BETWEEN?

From

Yaroslav

Date:

13 June 2015, 20:51:58

Tom Lane-2 wrote
> PG doesn't try to estimate inequalities exactly, because it usually
> doesn't make enough of a difference to matter.  Currently we don't
> even bother to distinguish say ">" from ">=" for estimation purposes,
> though certainly we would need to in order to deal with zero-width ranges
> with any great amount of precision.

Thank you for your answer!

I'm sorry, but after looking into documentation and sources (scalarineqsel
function in selfuncs.c, clauselist_selectivity and addRangeClause functions
in
clausesel.c) and experimenting a little I've got an impression that
PostgreSQL
actually bothers to distinguish ">" from ">=" for estimation purposes
sometimes (probably, when MCV is used), but in my example it uses histogram
and indeed doesn't distinguish them.

My simple test (using MCVs) is below:
-----
CREATE TABLE t2(n int);
INSERT INTO t2(n) VALUES (0),(0),(0),(0),(1),(1),(1),(1),(2),(2),(2),(2);
ANALYZE t2;

EXPLAIN SELECT * FROM t2 WHERE n > 0 AND n < 2
-- rows=4
EXPLAIN SELECT * FROM t2 WHERE n > 0 AND n < 2
-- rows=12
------

Looking further, I found ineq_histogram_selectivity function in selfuncs.c,
and this fragment seems relevant:
-----
/*
 * We have values[i-1] <= constant <= values[i].
 *
 * Convert the constant and the two nearest bin boundary
 * values to a uniform comparison scale, and do a linear
 * interpolation within this bin.
 */
<skip>
binfrac = (val - low) / (high - low);
-----
And now I'm stuck. Can ">" operators can be distinguished from ">="
operators at this point?




-----
WBR, Yaroslav Schekin.
--
View this message in context: http://postgresql.nabble.com/Row-count-estimation-bug-in-BETWEEN-tp5853687p5853725.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Row count estimation bug in BETWEEN?

From

Yaroslav

Date:

14 June 2015, 10:59:13

> My simple test (using MCVs) is below ...
I've posted wrong second query in the test, should have been:
EXPLAIN SELECT * FROM t2 WHERE n >= 0 AND n <= 2
-- rows=12



-----
WBR, Yaroslav Schekin.
--
View this message in context: http://postgresql.nabble.com/Row-count-estimation-bug-in-BETWEEN-tp5853687p5853771.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Row count estimation bug in BETWEEN?

From

Tom Lane

Date:

14 June 2015, 16:09:12

Yaroslav <ladayaroslav@yandex.ru> writes:
> Tom Lane-2 wrote
>> PG doesn't try to estimate inequalities exactly, because it usually
>> doesn't make enough of a difference to matter.  Currently we don't
>> even bother to distinguish say ">" from ">=" for estimation purposes,
>> though certainly we would need to in order to deal with zero-width ranges
>> with any great amount of precision.

> Thank you for your answer!

> I'm sorry, but after looking into documentation and sources
> (scalarineqsel function in selfuncs.c, clauselist_selectivity and
> addRangeClause functions in clausesel.c) and experimenting a little I've
> got an impression that PostgreSQL actually bothers to distinguish ">"
> from ">=" for estimation purposes sometimes (probably, when MCV is
> used), but in my example it uses histogram and indeed doesn't
> distinguish them.

Well, I was oversimplifying a bit.  When testing the MCV list we use the
original operator, so that if the comparison constant is equal to some
MCV entry, it will indeed matter whether you said ">" or ">=".  When
dealing with the histogram, however, we don't pay attention to the
difference.  The assumption is that the histogram represents a
continuous distribution of values in which no one value occurs often
enough to be interesting (if it did, it would be in the MCV list...).
Therefore it does not matter much whether any specific histogram entry
is exactly "=".  And of course, for comparison values that are between
histogram entries, we have no idea whatsoever whether there are any
"=" entries in the table; so even if the code did distinguish ">" from
">=", it would be unclear what to do with the knowledge.

            regards, tom lane

Re: Row count estimation bug in BETWEEN?

From

Yaroslav

Date:

15 June 2015, 18:56:04

Tom Lane-2 wrote
> The assumption is that the histogram represents a
> continuous distribution of values in which no one value occurs often
> enough to be interesting (if it did, it would be in the MCV list...).
> Therefore it does not matter much whether any specific histogram entry
> is exactly "=".  And of course, for comparison values that are between
> histogram entries, we have no idea whatsoever whether there are any
> "=" entries in the table;

This assumption is correct for continuous types,
but in my example the type (bigint) is discrete (as some other types like
date,
numerics (with defined scale) and even varchar/text are), so the assumption
is wrong for it.


Tom Lane-2 wrote
> so even if the code did distinguish ">" from
> ">=", it would be unclear what to do with the knowledge.

The vague idea that popped into my head is:

As the code in convert_to_scalar already switches on the value type, a flag
to
distinguish ">=" operators from ">" operators could be added there. It would
use the equality of "a > const::sometype" to "a >=
next_value(const::sometype)",
i.e. that "a > 2::int" equals "a >= 3::int". So, corresponding convert_to...
functions would use "value+1" instead of "value" in my case, next date if
the type is date, "value+0.01" if type is numeric(n, 2), etc.

IMHO, the problem with these estimations is that they are horribly off.
I've searched the archives, and it seems that PostgreSQL's users are bitten
by it sometimes,
like: http://www.postgresql.org/message-id/4583.1358289018@sss.pgh.pa.us.







-----
WBR, Yaroslav Schekin.
--
View this message in context: http://postgresql.nabble.com/Row-count-estimation-bug-in-BETWEEN-tp5853687p5853938.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.