The following bug has been logged on the website:
Bug reference: 17545
Logged by: Roope Salmi
Email address: rpsalmi@gmail.com
PostgreSQL version: 14.4
Operating system: Ubuntu 22.04
Description:
Hi,
It appears that selectivity for "IS NOT DISTINCT FROM" clauses regard the
proportion of NULLs in the same way as "=", leading to polar opposite row
count
estimates.
In practice this causes nested loop joins to be used when that is not
appropriate. Not sure if this is a known issue, but I was unable to find
any
previous mentions of it.
Here I create a table with one column and 1000 rows of NULL. A cartesian
product
with "WHERE x.a = y.a" correctly estimates that there are around zero
matching
rows. "x.a IS NOT DISTINCT FROM y.a" incorrectly gives the same estimate,
whereas "x.a = y.a OR (x.a IS NULL AND y.a IS NULL)", which should be
equivalent, gives the correct 1000000.
postgres=# CREATE TABLE test(a INTEGER);
CREATE TABLE
postgres=# INSERT INTO test(a) SELECT NULL FROM generate_series(1, 1000);
INSERT 0 1000
postgres=# ANALYZE test;
ANALYZE
postgres=# EXPLAIN SELECT FROM test x, test y WHERE x.a = y.a;
QUERY PLAN
----------------------------------------------------------------------
Merge Join (cost=127.66..137.67 rows=1 width=0)
Merge Cond: (x.a = y.a)
-> Sort (cost=63.83..66.33 rows=1000 width=4)
Sort Key: x.a
-> Seq Scan on test x (cost=0.00..14.00 rows=1000 width=4)
-> Sort (cost=63.83..66.33 rows=1000 width=4)
Sort Key: y.a
-> Seq Scan on test y (cost=0.00..14.00 rows=1000 width=4)
(8 rows)
postgres=# EXPLAIN SELECT FROM test x, test y
postgres=# WHERE x.a IS NOT DISTINCT FROM y.a;
QUERY PLAN
----------------------------------------------------------------------
Nested Loop (cost=0.00..15030.50 rows=1 width=0)
Join Filter: (NOT (x.a IS DISTINCT FROM y.a))
-> Seq Scan on test x (cost=0.00..14.00 rows=1000 width=4)
-> Materialize (cost=0.00..19.00 rows=1000 width=4)
-> Seq Scan on test y (cost=0.00..14.00 rows=1000 width=4)
(5 rows)
postgres=# EXPLAIN SELECT FROM test x, test y
postgres-# WHERE x.a = y.a OR (x.a IS NULL AND y.a IS NULL);
QUERY PLAN
----------------------------------------------------------------------
Nested Loop (cost=0.00..15030.50 rows=1000000 width=0)
Join Filter: ((x.a = y.a) OR ((x.a IS NULL) AND (y.a IS NULL)))
-> Seq Scan on test x (cost=0.00..14.00 rows=1000 width=4)
-> Materialize (cost=0.00..19.00 rows=1000 width=4)
-> Seq Scan on test y (cost=0.00..14.00 rows=1000 width=4)
(5 rows)