Thread: BUG #17545: Incorrect selectivity for IS NOT DISTINCT FROM and NULLs
BUG #17545: Incorrect selectivity for IS NOT DISTINCT FROM and NULLs
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17545 Logged by: Roope Salmi Email address: rpsalmi@gmail.com PostgreSQL version: 14.4 Operating system: Ubuntu 22.04 Description: Hi, It appears that selectivity for "IS NOT DISTINCT FROM" clauses regard the proportion of NULLs in the same way as "=", leading to polar opposite row count estimates. In practice this causes nested loop joins to be used when that is not appropriate. Not sure if this is a known issue, but I was unable to find any previous mentions of it. Here I create a table with one column and 1000 rows of NULL. A cartesian product with "WHERE x.a = y.a" correctly estimates that there are around zero matching rows. "x.a IS NOT DISTINCT FROM y.a" incorrectly gives the same estimate, whereas "x.a = y.a OR (x.a IS NULL AND y.a IS NULL)", which should be equivalent, gives the correct 1000000. postgres=# CREATE TABLE test(a INTEGER); CREATE TABLE postgres=# INSERT INTO test(a) SELECT NULL FROM generate_series(1, 1000); INSERT 0 1000 postgres=# ANALYZE test; ANALYZE postgres=# EXPLAIN SELECT FROM test x, test y WHERE x.a = y.a; QUERY PLAN ---------------------------------------------------------------------- Merge Join (cost=127.66..137.67 rows=1 width=0) Merge Cond: (x.a = y.a) -> Sort (cost=63.83..66.33 rows=1000 width=4) Sort Key: x.a -> Seq Scan on test x (cost=0.00..14.00 rows=1000 width=4) -> Sort (cost=63.83..66.33 rows=1000 width=4) Sort Key: y.a -> Seq Scan on test y (cost=0.00..14.00 rows=1000 width=4) (8 rows) postgres=# EXPLAIN SELECT FROM test x, test y postgres=# WHERE x.a IS NOT DISTINCT FROM y.a; QUERY PLAN ---------------------------------------------------------------------- Nested Loop (cost=0.00..15030.50 rows=1 width=0) Join Filter: (NOT (x.a IS DISTINCT FROM y.a)) -> Seq Scan on test x (cost=0.00..14.00 rows=1000 width=4) -> Materialize (cost=0.00..19.00 rows=1000 width=4) -> Seq Scan on test y (cost=0.00..14.00 rows=1000 width=4) (5 rows) postgres=# EXPLAIN SELECT FROM test x, test y postgres-# WHERE x.a = y.a OR (x.a IS NULL AND y.a IS NULL); QUERY PLAN ---------------------------------------------------------------------- Nested Loop (cost=0.00..15030.50 rows=1000000 width=0) Join Filter: ((x.a = y.a) OR ((x.a IS NULL) AND (y.a IS NULL))) -> Seq Scan on test x (cost=0.00..14.00 rows=1000 width=4) -> Materialize (cost=0.00..19.00 rows=1000 width=4) -> Seq Scan on test y (cost=0.00..14.00 rows=1000 width=4) (5 rows)
On Mon, Jul 11, 2022 at 10:36 PM PG Bug reporting form <noreply@postgresql.org> wrote:
Hi,
It appears that selectivity for "IS NOT DISTINCT FROM" clauses regard the
proportion of NULLs in the same way as "=", leading to polar opposite row
count
estimates.
In practice this causes nested loop joins to be used when that is not
appropriate. Not sure if this is a known issue, but I was unable to find
any
previous mentions of it.
# EXPLAIN SELECT FROM test x, test y WHERE x.a IS DISTINCT FROM y.a;
QUERY PLAN
----------------------------------------------------------------------
Nested Loop (cost=0.00..15030.50 rows=1000000 width=0)
Join Filter: (x.a IS DISTINCT FROM y.a)
-> Seq Scan on test x (cost=0.00..14.00 rows=1000 width=4)
-> Materialize (cost=0.00..19.00 rows=1000 width=4)
-> Seq Scan on test y (cost=0.00..14.00 rows=1000 width=4)
(5 rows)
DistinctExpr has the same representation as OpExpr with operator "=". So
we estimate its selectivity just as if it is 'x.a = y.a', and then
invert the result.
For OpExpr 'x.a = y.a', because no MCV lists are available, we estimate
its selectivity as MIN(1/nd1,1/nd2)*(1-nullfrac1)*(1-nullfrac2). Note
that both nullfrac1 and nullfrac2 are 100% in the case, so the
selectivity is calculated as zero, and the inverted result is 1, which
is for DistinctExpr.
The comment in the codes has explained this behavior, as it says:
/*
* DistinctExpr has the same representation as OpExpr, but the
* contained operator is "=" not "<>", so we must negate the result.
* This estimation method doesn't give the right behavior for nulls,
* but it's better than doing nothing.
*/
if (IsA(clause, DistinctExpr))
s1 = 1.0 - s1;
So maybe this is a known issue?
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes: > The comment in the codes has explained this behavior, as it says: > * This estimation method doesn't give the right behavior for nulls, > * but it's better than doing nothing. > So maybe this is a known issue? It's a nobodys-bothered-to-make-it-better issue. In general, PG's treatment of IS [NOT] DISTINCT FROM is pretty lame --- for example, you might wish that IS NOT DISTINCT FROM could use an index in the way that "=" can, but that's never been made to happen. regards, tom lane