Thread: Suspicious Estimated Number of Returned Rows

Suspicious Estimated Number of Returned Rows

From
Ba Jinsheng
Date:

Hi everyone,

 

Consider the two queries below:

 

```sql

CREATE TABLE t0(c0 INT);

EXPLAIN SELECT * FROM t0 WHERE t0.c0 IN (t0.c0); --2537

                      QUERY PLAN                     

------------------------------------------------------

Seq Scan on t0  (cost=0.00..35.50 rows=2537 width=4)

   Filter: (c0 IS NOT NULL)

(2 rows)

 

EXPLAIN SELECT * FROM t0 WHERE (t0.c0 IN (t0.c0))OR(t0.c0 > 0); --858

                     QUERY PLAN                     

-----------------------------------------------------

Seq Scan on t0  (cost=0.00..48.25 rows=858 width=4)

   Filter: ((c0 = c0) OR (c0 > 0))

(2 rows)

```

 

Based on my understanding, the second query should logically return a superset of the results obtained from the first query. However, when examining the query plans, I noticed that the estimated number of rows returned in the second query is significantly lower than that in the first query (2537 vs. 858, respectively). This issue can also be reproduced on a table with data after executing `ANALYZE`, but with a smaller estimate. I think this is because in the first query, `t0.c0 IN (t0.c0)` is optimized to a `NULL` check, while the optimization is not performed in the subexpression of the subquery. Is this expected, or is this a missed optimization that could be addressed?

 

I used this version of PostgreSQL:

Commit: 503b0556d96f2c8df6ed91c5a8cf11b23f37ce6d

 

 

 

 

 

Best regards,

Jinsheng Ba