Thread: BUG #18643: EXPLAIN estimated rows mismatch
The following bug has been logged on the website: Bug reference: 18643 Logged by: ming wei tan Email address: mingwei.tc@gmail.com PostgreSQL version: 12.20 Operating system: Debian 12.20-1.pgdg120+1 (Docker) Description: Given predicate A and B, it is expected that size (SELECT X where A) <= size (SELECT X WHERE A or B) However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns rows=2537 QUERY PLAN ------------------------------------------------------ Seq Scan on t2 (cost=0.00..35.50 rows=2537 width=4) Filter: (c0 IS NOT NULL) While, `EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4)` returns rows=858 QUERY PLAN ----------------------------------------------------- Seq Scan on t2 (cost=0.00..48.25 rows=858 width=4) Filter: ((c0 = c0) OR (c0 > 4)) Running on docker image postgres:12 Test cases: DROP DATABASE IF EXISTS database4; CREATE DATABASE database4 WITH ENCODING 'UTF8' TEMPLATE template0; \c database4; CREATE TABLE t2(c0 int); INSERT INTO t2(c0) VALUES(1); INSERT INTO t2(c0) VALUES(2); EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0); EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4); DROP DATABASE IF EXISTS database4; CREATE DATABASE database4 WITH ENCODING 'UTF8' TEMPLATE template0; \c database4; CREATE TABLE t2(c0 int); INSERT INTO t2(c0) VALUES(1); INSERT INTO t2(c0) VALUES(2); EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0)=(t2.c0); EXPLAIN SELECT t2.c0 FROM t2 WHERE ((t2.c0)=(t2.c0) OR (t2.c0 > 4);
PG Bug reporting form <noreply@postgresql.org> writes: > Given predicate A and B, it is expected that size (SELECT X where A) <= > size (SELECT X WHERE A or B) > However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns > rows=2537 I don't see any particular bug here. If you look closely at the EXPLAIN output, you'll see that "t2.c0 IN (t2.c0)" is transformed to "c0 IS NOT NULL" --- but only if it's at top level. So we're estimating selectivities for two quite different conditions in this example. The NOT NULL bit happens because a top-level equality clause is transformed into an "EquivalenceClass", and then when we notice the class has only one member, we prefer to spit out "x IS NOT NULL" rather than "x = x". That has the same effect (at top level of WHERE, anyway) and tends to be estimated more accurately. In any case, in this toy example that lacks an ANALYZE step, the selectivity estimates are mostly going to be garbage. regards, tom lane
On 1/10/2024 18:43, Tom Lane wrote: > PG Bug reporting form <noreply@postgresql.org> writes: >> Given predicate A and B, it is expected that size (SELECT X where A) <= >> size (SELECT X WHERE A or B) >> However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns >> rows=2537 > > I don't see any particular bug here. If you look closely at the > EXPLAIN output, you'll see that "t2.c0 IN (t2.c0)" is transformed > to "c0 IS NOT NULL" --- but only if it's at top level. So we're > estimating selectivities for two quite different conditions in > this example. > > The NOT NULL bit happens because a top-level equality clause > is transformed into an "EquivalenceClass", and then when we > notice the class has only one member, we prefer to spit out > "x IS NOT NULL" rather than "x = x". That has the same effect > (at top level of WHERE, anyway) and tends to be estimated > more accurately. I think their question was about why 'x IN (x)' transforms differently at the top and inside the OR clause. It is pretty typical question. -- regards, Andrei Lepikhov
On 1/10/2024 18:43, Tom Lane wrote:
> In any case, in this toy example that lacks an ANALYZE step,
> the selectivity estimates are mostly going to be garbage.
Thanks for the replies. I'm just checking if a bug is present here
is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows
compared to the second, even though the second WHERE clause is
> In any case, in this toy example that lacks an ANALYZE step,
> the selectivity estimates are mostly going to be garbage.
Thanks for the replies. I'm just checking if a bug is present here
is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows
compared to the second, even though the second WHERE clause is
less restrictive.
ANALYZE;
ANALYZE
EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0);
QUERY PLAN
--------------------------------------------------
Seq Scan on t2 (cost=0.00..1.02 rows=2 width=4)
Filter: (c0 IS NOT NULL)
(2 rows)
EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4);
QUERY PLAN
--------------------------------------------------
Seq Scan on t2 (cost=0.00..1.03 rows=1 width=4)
Filter: ((c0 = c0) OR (c0 > 4))
(2 rows)
DROP DATABASE IF EXISTS database4;
CREATE DATABASE database4 WITH ENCODING 'UTF8' TEMPLATE template0;
\c database4;
CREATE TABLE t2(c0 int);
INSERT INTO t2(c0) VALUES(1);
INSERT INTO t2(c0) VALUES(2);
ANALYZE;
EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0);
EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4);
Regards,
Ming Wei
On Wed, 2 Oct 2024 at 08:35, Andrei Lepikhov <lepihov@gmail.com> wrote:
On 1/10/2024 18:43, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
>> Given predicate A and B, it is expected that size (SELECT X where A) <=
>> size (SELECT X WHERE A or B)
>> However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
>> rows=2537
>
> I don't see any particular bug here. If you look closely at the
> EXPLAIN output, you'll see that "t2.c0 IN (t2.c0)" is transformed
> to "c0 IS NOT NULL" --- but only if it's at top level. So we're
> estimating selectivities for two quite different conditions in
> this example.
>
> The NOT NULL bit happens because a top-level equality clause
> is transformed into an "EquivalenceClass", and then when we
> notice the class has only one member, we prefer to spit out
> "x IS NOT NULL" rather than "x = x". That has the same effect
> (at top level of WHERE, anyway) and tends to be estimated
> more accurately.
I think their question was about why 'x IN (x)' transforms differently
at the top and inside the OR clause. It is pretty typical question.
--
regards, Andrei Lepikhov
On 1/10/2024 09:47, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 18643 > Logged by: ming wei tan > Email address: mingwei.tc@gmail.com > PostgreSQL version: 12.20 > Operating system: Debian 12.20-1.pgdg120+1 (Docker) > Description: > > Given predicate A and B, it is expected that size (SELECT X where A) <= > size (SELECT X WHERE A or B) > > However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns > rows=2537 > QUERY PLAN > ------------------------------------------------------ > Seq Scan on t2 (cost=0.00..35.50 rows=2537 width=4) > Filter: (c0 IS NOT NULL) > > > While, `EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > > 4)` returns rows=858 > QUERY PLAN > ----------------------------------------------------- > Seq Scan on t2 (cost=0.00..48.25 rows=858 width=4) > Filter: ((c0 = c0) OR (c0 > 4)) I found one link, which is connected to your case: https://www.postgresql.org/message-id/flat/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw%40mail.gmail.com there you can find answer to your question. In general - it is too expensive to test each clause (especially non-top level one) on possible evaluation into constant expression. -- regards, Andrei Lepikhov
On Thu, 3 Oct 2024 at 07:17, ming wei tan <mingwei.tc@gmail.com> wrote: > > On 1/10/2024 18:43, Tom Lane wrote: > > In any case, in this toy example that lacks an ANALYZE step, > > the selectivity estimates are mostly going to be garbage. > > Thanks for the replies. I'm just checking if a bug is present here > is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows > compared to the second, even though the second WHERE clause is > less restrictive. I think you already checked that and Tom answered mentioning the reason that this happens. We certainly could do better here, but, as Tom mentioned your example does not seem convincing enough to warrant much effort. You should read the commit message in [1] as that might help you understand the project's point of view for these sort of optimisations. See in particular the first sentence of the second paragraph. David [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8ec5429e2f422f4d570d4909507db0d4ca83bbac
On 10/3/24 03:57, David Rowley wrote: > On Thu, 3 Oct 2024 at 07:17, ming wei tan <mingwei.tc@gmail.com> wrote: >> >> On 1/10/2024 18:43, Tom Lane wrote: >>> In any case, in this toy example that lacks an ANALYZE step, >>> the selectivity estimates are mostly going to be garbage. >> >> Thanks for the replies. I'm just checking if a bug is present here >> is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows >> compared to the second, even though the second WHERE clause is >> less restrictive. > > I think you already checked that and Tom answered mentioning the > reason that this happens. > > We certainly could do better here, but, as Tom mentioned your example > does not seem convincing enough to warrant much effort. > > You should read the commit message in [1] as that might help you > understand the project's point of view for these sort of > optimisations. See in particular the first sentence of the second > paragraph. I can agree with the source reason. But presence of prepqual.c makes it less clear: Optimiser already attempts to remove duplicated ORs by the find_duplicate_ors function. And does it in some strange manner: explain SELECT oid,relname FROM pg_class WHERE oid=1 OR oid=1; Index Scan using pg_class_oid_index on pg_class Index Cond: (oid = '1'::oid) explain SELECT oid,relname FROM pg_class WHERE oid=1 OR oid=1 OR oid=2; Seq Scan on pg_class Filter: ((oid = '1'::oid) OR (oid = '1'::oid) OR (oid = 2)) So, we already pass through the OR clauses. Why not to check semi-equal clauses and remove duplicates even if not all clauses are such duplicates? At least, it continually raises users' questions. -- regards, Andrei Lepikhov
Andrei Lepikhov <lepihov@gmail.com> writes: > So, we already pass through the OR clauses. Why not to check semi-equal > clauses and remove duplicates even if not all clauses are such > duplicates? At least, it continually raises users' questions. It's difficult to justify spending extra planner cycles to optimize what are fundamentally stupidly-written queries. Who writes "X=X" in the first place? (Other than ORM authors who need to spend some time in a re-education camp.) And it would not be a trivial number of extra cycles, either. As pointed out in the commit message David mentioned, it's basically free to make this improvement when we're looking at a potential EquivalenceClass clause. We've already paid the cost of checking that the operator is a btree equality operator, and we know that the clause is at top level of WHERE (else we couldn't fuzz over the difference between false and null results), and besides we have to check whether it's "X=X" because not doing so causes some semantic problems for the EquivalenceClass machinery. In the case of a random clause-underneath-OR, we would have to make a brand new check whether it's btree equality, and we would have to somehow track whether we had descended to an expression level where "false and null are known equivalent" is no longer true. So I really doubt that a case can be made that that is worth doing. regards, tom lane
On 10/3/24 11:40, Tom Lane wrote: > Andrei Lepikhov <lepihov@gmail.com> writes: >> So, we already pass through the OR clauses. Why not to check semi-equal >> clauses and remove duplicates even if not all clauses are such >> duplicates? At least, it continually raises users' questions. > > It's difficult to justify spending extra planner cycles to optimize > what are fundamentally stupidly-written queries. Who writes "X=X" > in the first place? (Other than ORM authors who need to spend some > time in a re-education camp.) I don't oppose your arguments at first. I want to say that ORMs and any APIs (I sometimes see data analysts who use an AI to request a database) are already essential to development (at least in startups). It may not be suitable to deal with such cases in the core, but it is too costly to do it in an extension right now. In this sense 'silicon' users aren't equal to more smart carbon-made ones. And here I see two options: 1. canonicalize_qual_hook - the most blunt approach. Of course, we will need more hooks in the near future with this approach. 2. An 'ORM' GUC to enable multiple optimisations in the core, 'smoothing' users' mistakes. It would be great to discuss other options. > And it would not be a trivial number > of extra cycles, either. As pointed out in the commit message > David mentioned, it's basically free to make this improvement > when we're looking at a potential EquivalenceClass clause. > We've already paid the cost of checking that the operator is a btree > equality operator, and we know that the clause is at top level of > WHERE (else we couldn't fuzz over the difference between false and > null results), and besides we have to check whether it's "X=X" because > not doing so causes some semantic problems for the EquivalenceClass > machinery. In the case of a random clause-underneath-OR, we would > have to make a brand new check whether it's btree equality, and we > would have to somehow track whether we had descended to an expression > level where "false and null are known equivalent" is no longer true. > So I really doubt that a case can be made that that is worth doing. Thanks, this explanation is quite valuable for me. -- regards, Andrei Lepikhov