I noticed an issue in a simple query with WHERE NOT IN (SELECT ...). I am
aware that anti-joins with NOT IN are currently not optimized and should be
rewritten as WHERE NOT EXISTS (SELECT ...), so if this is irrelevant please
just ignore it.
Here is a setup that works:
CREATE TABLE a
(
a_id serial NOT NULL,
PRIMARY KEY (a_id)
);
CREATE TABLE b
(
b_id serial NOT NULL,
a_id int NOT NULL,
PRIMARY KEY (b_id)
);
INSERT INTO a(a_id) SELECT generate_series(1, 20000);
INSERT INTO b(b_id, a_id) SELECT generate_series(1, 500000), floor(random()
* 22000 + 1)::int;
ANALYZE a;
ANALYZE b;
EXPLAIN SELECT count(*) FROM b WHERE a_id NOT IN (SELECT a_id FROM a);
Finalize Aggregate (cost=7596.23..7596.24 rows=1 width=8)
-> Gather (cost=7596.12..7596.23 rows=1 width=8)
Workers Planned: 1
-> Partial Aggregate (cost=6596.12..6596.13 rows=1 width=8)
-> Parallel Seq Scan on b (cost=339.00..6228.47 rows=147059
width=0)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on a (cost=0.00..289.00 rows=20000
width=4)
Fiddle:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=497ab1d5eec6e02d4d1c0f6630b6f1
f1
Now if you change
INSERT INTO a(a_id) SELECT generate_series(1, 20000);
to
INSERT INTO a(a_id) SELECT generate_series(1, 200000);
i.e. add a zero, the plan becomes this:
Finalize Aggregate (cost=759860198.41..759860198.42 rows=1 width=8)
-> Gather (cost=759860198.29..759860198.40 rows=1 width=8)
Workers Planned: 1
-> Partial Aggregate (cost=759859198.29..759859198.30 rows=1
width=8)
-> Parallel Seq Scan on b (cost=0.00..759858830.65
rows=147059 width=0)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..4667.00 rows=200000
width=4)
-> Seq Scan on a (cost=0.00..2885.00
rows=200000 width=4)
Fiddle:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=bec018196195635cb6ec05ccae3213
7c