Thread: BUG #18643: EXPLAIN estimated rows mismatch

BUG #18643: EXPLAIN estimated rows mismatch

From

PG Bug reporting form

Date:

01 October 2024, 10:47:34

The following bug has been logged on the website:

Bug reference:      18643
Logged by:          ming wei tan
Email address:      mingwei.tc@gmail.com
PostgreSQL version: 12.20
Operating system:   Debian 12.20-1.pgdg120+1 (Docker)
Description:

Given predicate A and B, it is expected that size (SELECT X where A)  <=
size (SELECT X WHERE A or B)  

However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
rows=2537
                      QUERY PLAN
------------------------------------------------------
 Seq Scan on t2  (cost=0.00..35.50 rows=2537 width=4)
   Filter: (c0 IS NOT NULL)


While, `EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 >
4)` returns rows=858
                     QUERY PLAN
-----------------------------------------------------
 Seq Scan on t2  (cost=0.00..48.25 rows=858 width=4)
   Filter: ((c0 = c0) OR (c0 > 4))


Running on docker image postgres:12

Test cases:

DROP DATABASE IF EXISTS database4;
CREATE DATABASE database4 WITH ENCODING 'UTF8' TEMPLATE template0;
\c database4;
CREATE TABLE t2(c0 int);
INSERT INTO t2(c0) VALUES(1);
INSERT INTO t2(c0) VALUES(2);
EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0);
EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4);



DROP DATABASE IF EXISTS database4;
CREATE DATABASE database4 WITH ENCODING 'UTF8' TEMPLATE template0;
\c database4;
CREATE TABLE t2(c0 int);
INSERT INTO t2(c0) VALUES(1);
INSERT INTO t2(c0) VALUES(2);
EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0)=(t2.c0);
EXPLAIN SELECT t2.c0 FROM t2 WHERE ((t2.c0)=(t2.c0) OR (t2.c0 > 4);

Re: BUG #18643: EXPLAIN estimated rows mismatch

From

Tom Lane

Date:

01 October 2024, 19:43:46

PG Bug reporting form <noreply@postgresql.org> writes:
> Given predicate A and B, it is expected that size (SELECT X where A)  <=
> size (SELECT X WHERE A or B)  
> However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
> rows=2537

I don't see any particular bug here.  If you look closely at the
EXPLAIN output, you'll see that "t2.c0 IN (t2.c0)" is transformed
to "c0 IS NOT NULL" --- but only if it's at top level.  So we're
estimating selectivities for two quite different conditions in
this example.

The NOT NULL bit happens because a top-level equality clause
is transformed into an "EquivalenceClass", and then when we
notice the class has only one member, we prefer to spit out
"x IS NOT NULL" rather than "x = x".  That has the same effect
(at top level of WHERE, anyway) and tends to be estimated
more accurately.

In any case, in this toy example that lacks an ANALYZE step,
the selectivity estimates are mostly going to be garbage.

            regards, tom lane

Re: BUG #18643: EXPLAIN estimated rows mismatch

From

Andrei Lepikhov

Date:

02 October 2024, 03:35:04

On 1/10/2024 18:43, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
>> Given predicate A and B, it is expected that size (SELECT X where A)  <=
>> size (SELECT X WHERE A or B)
>> However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
>> rows=2537
> 
> I don't see any particular bug here.  If you look closely at the
> EXPLAIN output, you'll see that "t2.c0 IN (t2.c0)" is transformed
> to "c0 IS NOT NULL" --- but only if it's at top level.  So we're
> estimating selectivities for two quite different conditions in
> this example.
> 
> The NOT NULL bit happens because a top-level equality clause
> is transformed into an "EquivalenceClass", and then when we
> notice the class has only one member, we prefer to spit out
> "x IS NOT NULL" rather than "x = x".  That has the same effect
> (at top level of WHERE, anyway) and tends to be estimated
> more accurately.
I think their question was about why 'x IN (x)' transforms differently 
at the top and inside the OR clause. It is pretty typical question.

-- 
regards, Andrei Lepikhov

Re: BUG #18643: EXPLAIN estimated rows mismatch

From

ming wei tan

Date:

02 October 2024, 05:04:36

On 1/10/2024 18:43, Tom Lane wrote:
> In any case, in this toy example that lacks an ANALYZE step,
> the selectivity estimates are mostly going to be garbage.

Thanks for the replies. I'm just checking if a bug is present here
is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows
compared to the second, even though the second WHERE clause is

less restrictive.

ANALYZE;

ANALYZE

EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0);
QUERY PLAN
--------------------------------------------------
Seq Scan on t2 (cost=0.00..1.02 rows=2 width=4)
Filter: (c0 IS NOT NULL)
(2 rows)

EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4);
QUERY PLAN
--------------------------------------------------
Seq Scan on t2 (cost=0.00..1.03 rows=1 width=4)
Filter: ((c0 = c0) OR (c0 > 4))
(2 rows)

DROP DATABASE IF EXISTS database4;
CREATE DATABASE database4 WITH ENCODING 'UTF8' TEMPLATE template0;
\c database4;
CREATE TABLE t2(c0 int);
INSERT INTO t2(c0) VALUES(1);
INSERT INTO t2(c0) VALUES(2);
ANALYZE;
EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0);
EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4);

Regards,
Ming Wei

On Wed, 2 Oct 2024 at 08:35, Andrei Lepikhov <lepihov@gmail.com> wrote:

On 1/10/2024 18:43, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
>> Given predicate A and B, it is expected that size (SELECT X where A) <=
>> size (SELECT X WHERE A or B)
>> However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
>> rows=2537
>
> I don't see any particular bug here. If you look closely at the
> EXPLAIN output, you'll see that "t2.c0 IN (t2.c0)" is transformed
> to "c0 IS NOT NULL" --- but only if it's at top level. So we're
> estimating selectivities for two quite different conditions in
> this example.
>
> The NOT NULL bit happens because a top-level equality clause
> is transformed into an "EquivalenceClass", and then when we
> notice the class has only one member, we prefer to spit out
> "x IS NOT NULL" rather than "x = x". That has the same effect
> (at top level of WHERE, anyway) and tends to be estimated
> more accurately.
I think their question was about why 'x IN (x)' transforms differently
at the top and inside the OR clause. It is pretty typical question.

--
regards, Andrei Lepikhov

Re: BUG #18643: EXPLAIN estimated rows mismatch

From

Andrei Lepikhov

Date:

02 October 2024, 12:40:01

On 1/10/2024 09:47, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      18643
> Logged by:          ming wei tan
> Email address:      mingwei.tc@gmail.com
> PostgreSQL version: 12.20
> Operating system:   Debian 12.20-1.pgdg120+1 (Docker)
> Description:
> 
> Given predicate A and B, it is expected that size (SELECT X where A)  <=
> size (SELECT X WHERE A or B)
> 
> However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
> rows=2537
>                        QUERY PLAN
> ------------------------------------------------------
>   Seq Scan on t2  (cost=0.00..35.50 rows=2537 width=4)
>     Filter: (c0 IS NOT NULL)
> 
> 
> While, `EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 >
> 4)` returns rows=858
>                       QUERY PLAN
> -----------------------------------------------------
>   Seq Scan on t2  (cost=0.00..48.25 rows=858 width=4)
>     Filter: ((c0 = c0) OR (c0 > 4))
I found one link, which is connected to your case:
https://www.postgresql.org/message-id/flat/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw%40mail.gmail.com
there you can find answer to your question. In general - it is too 
expensive to test each clause (especially non-top level one) on possible 
evaluation into constant expression.

-- 
regards, Andrei Lepikhov

Re: BUG #18643: EXPLAIN estimated rows mismatch

From

David Rowley

Date:

02 October 2024, 23:57:48

On Thu, 3 Oct 2024 at 07:17, ming wei tan <mingwei.tc@gmail.com> wrote:
>
> On 1/10/2024 18:43, Tom Lane wrote:
> > In any case, in this toy example that lacks an ANALYZE step,
> > the selectivity estimates are mostly going to be garbage.
>
> Thanks for the replies. I'm just checking if a bug is present here
> is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows
> compared to the second, even though the second WHERE clause is
> less restrictive.

I think you already checked that and Tom answered mentioning the
reason that this happens.

We certainly could do better here, but, as Tom mentioned your example
does not seem convincing enough to warrant much effort.

You should read the commit message in [1] as that might help you
understand the project's point of view for these sort of
optimisations. See in particular the first sentence of the second
paragraph.

David

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8ec5429e2f422f4d570d4909507db0d4ca83bbac

Re: BUG #18643: EXPLAIN estimated rows mismatch

From

Andrei Lepikhov

Date:

03 October 2024, 07:23:14

On 10/3/24 03:57, David Rowley wrote:
> On Thu, 3 Oct 2024 at 07:17, ming wei tan <mingwei.tc@gmail.com> wrote:
>>
>> On 1/10/2024 18:43, Tom Lane wrote:
>>> In any case, in this toy example that lacks an ANALYZE step,
>>> the selectivity estimates are mostly going to be garbage.
>>
>> Thanks for the replies. I'm just checking if a bug is present here
>> is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows
>> compared to the second, even though the second WHERE clause is
>> less restrictive.
> 
> I think you already checked that and Tom answered mentioning the
> reason that this happens.
> 
> We certainly could do better here, but, as Tom mentioned your example
> does not seem convincing enough to warrant much effort.
> 
> You should read the commit message in [1] as that might help you
> understand the project's point of view for these sort of
> optimisations. See in particular the first sentence of the second
> paragraph.
I can agree with the source reason. But presence of prepqual.c makes it 
less clear:
Optimiser already attempts to remove duplicated ORs by the 
find_duplicate_ors function. And does it in some strange manner:

explain
SELECT oid,relname FROM pg_class WHERE oid=1 OR oid=1;

Index Scan using pg_class_oid_index on pg_class
    Index Cond: (oid = '1'::oid)

explain
SELECT oid,relname FROM pg_class WHERE oid=1 OR oid=1 OR oid=2;

Seq Scan on pg_class
    Filter: ((oid = '1'::oid) OR (oid = '1'::oid) OR (oid = 2))

So, we already pass through the OR clauses. Why not to check semi-equal 
clauses and remove duplicates even if not all clauses are such 
duplicates? At least, it continually raises users' questions.

-- 
regards, Andrei Lepikhov

Re: BUG #18643: EXPLAIN estimated rows mismatch

From

Tom Lane

Date:

03 October 2024, 07:40:02

Andrei Lepikhov <lepihov@gmail.com> writes:
> So, we already pass through the OR clauses. Why not to check semi-equal 
> clauses and remove duplicates even if not all clauses are such 
> duplicates? At least, it continually raises users' questions.

It's difficult to justify spending extra planner cycles to optimize
what are fundamentally stupidly-written queries.  Who writes "X=X"
in the first place?  (Other than ORM authors who need to spend some
time in a re-education camp.)  And it would not be a trivial number
of extra cycles, either.  As pointed out in the commit message
David mentioned, it's basically free to make this improvement
when we're looking at a potential EquivalenceClass clause.
We've already paid the cost of checking that the operator is a btree
equality operator, and we know that the clause is at top level of
WHERE (else we couldn't fuzz over the difference between false and
null results), and besides we have to check whether it's "X=X" because
not doing so causes some semantic problems for the EquivalenceClass
machinery.  In the case of a random clause-underneath-OR, we would
have to make a brand new check whether it's btree equality, and we
would have to somehow track whether we had descended to an expression
level where "false and null are known equivalent" is no longer true.
So I really doubt that a case can be made that that is worth doing.

            regards, tom lane

Re: BUG #18643: EXPLAIN estimated rows mismatch

From

Andrei Lepikhov

Date:

03 October 2024, 09:15:10

On 10/3/24 11:40, Tom Lane wrote:
> Andrei Lepikhov <lepihov@gmail.com> writes:
>> So, we already pass through the OR clauses. Why not to check semi-equal
>> clauses and remove duplicates even if not all clauses are such
>> duplicates? At least, it continually raises users' questions.
> 
> It's difficult to justify spending extra planner cycles to optimize
> what are fundamentally stupidly-written queries.  Who writes "X=X"
> in the first place?  (Other than ORM authors who need to spend some
> time in a re-education camp.)
I don't oppose your arguments at first. I want to say that ORMs and any 
APIs (I sometimes see data analysts who use an AI to request a database) 
are already essential to development (at least in startups). It may not 
be suitable to deal with such cases in the core, but it is too costly to 
do it in an extension right now. In this sense 'silicon' users aren't 
equal to more smart carbon-made ones. And here I see two options:
1. canonicalize_qual_hook - the most blunt approach. Of course, we will 
need more hooks in the near future with this approach.
2. An 'ORM' GUC to enable multiple optimisations in the core, 
'smoothing' users' mistakes.

It would be great to discuss other options.

>  And it would not be a trivial number
> of extra cycles, either.  As pointed out in the commit message
> David mentioned, it's basically free to make this improvement
> when we're looking at a potential EquivalenceClass clause.
> We've already paid the cost of checking that the operator is a btree
> equality operator, and we know that the clause is at top level of
> WHERE (else we couldn't fuzz over the difference between false and
> null results), and besides we have to check whether it's "X=X" because
> not doing so causes some semantic problems for the EquivalenceClass
> machinery.  In the case of a random clause-underneath-OR, we would
> have to make a brand new check whether it's btree equality, and we
> would have to somehow track whether we had descended to an expression
> level where "false and null are known equivalent" is no longer true.
> So I really doubt that a case can be made that that is worth doing.
Thanks, this explanation is quite valuable for me.

-- 
regards, Andrei Lepikhov