Re: Convert NOT IN sublinks to anti-joins when safe - Mailing list pgsql-hackers
| From | David Geier |
|---|---|
| Subject | Re: Convert NOT IN sublinks to anti-joins when safe |
| Date | |
| Msg-id | 2df5912f-8888-4f27-9384-4c69ba22105c@gmail.com Whole thread Raw |
| In response to | Re: Convert NOT IN sublinks to anti-joins when safe (Richard Guo <guofenglinux@gmail.com>) |
| Responses |
Re: Convert NOT IN sublinks to anti-joins when safe
|
| List | pgsql-hackers |
Hi! On 04.02.2026 10:47, Richard Guo wrote: > On Tue, Feb 3, 2026 at 4:12 PM Richard Guo <guofenglinux@gmail.com> wrote: >> This topic has been discussed several times in the past. Due to the >> semantic mismatch regarding NULL handling, NOT IN is not ordinarily >> equivalent to an anti-join. However, if we can prove that neither the >> outer expressions nor the subquery outputs can yield NULL values, it >> should be safe to convert NOT IN to an anti-join. I used to play around with query rewrites of the same form, turning NOT IN into NOT EXISTS. As you rightfully pointed out, the straight forward rewrite only works if neither the outer expression nor the sub-query cannot yield NULLs, limiting the optimization considerably. Both cases can be fixed by amending the basic form of the rewrite. Basic form: SELECT t1.c1 FROM t1 WHERE t1.c1 NOT IN (SELECT t2.c1 FROM t2) => SELECT t1.c1 FROM t1 WHERE NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) If the sub-select can yield NULLs, the rewrite can be fixed by adding an OR t2.c1 IS NULL clause, such as: SELECT t1.c1 FROM t1 WHERE NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1 OR t2.c1 IS NULL) which is equivalent to the following SQL which avoids the OR: SELECT t1.c1 FROM t1 WHERE NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL) If the outer expression can yield NULLs, the rewrite can be fixed by adding a t1.c1 IS NOT NULL clause, such as: SELECT t1.c1 FROM T1 WHERE t1.c1 IS NOT NULL AND NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) Both fixes can of course be combined yielding: SELECT t1.c1 FROM T1 WHERE t1.c1 IS NOT NULL AND NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL) What's our today's take on doing more involved transformations inside the planner to support such cases? It would greatly open up the scope of the optimization. > I've noticed a loose end in the v1 patch. > > The semantic gap between NOT IN and anti-join actually exists whenever > the operator returns NULL. For NOT IN, if (A op B) returns NULL, then > NOT (NULL) evaluates to NULL (effectively false), and the row is > discarded. In contrast, for an anti-join, if (A op B) returns NULL, > it implies no match was found, and the anti-join logic dictates that > the row should be kept. > > To guarantee that (A op B) never returns NULL, the current patch > verifies that both A and B are non-nullable. However, this is not > sufficient. The "op" might be an operator that returns NULL on > non-null inputs. > > On the other hand, if "op" does not return NULL on NULL inputs, like > IS DISTINCT FROM, we technically would not even need to require that A > and B are non-nullable. > > Is there a convenient way to verify that an operator never returns > NULL on non-null inputs? Would it be sufficient to insist that the > operator belongs to btree opclass (assuming that the strict ordering > requirements of btree imply this safety)? There's lots of code that e.g. calls FunctionCall2[Coll]() on operators, where that function raises an ERROR in case the returned value is NULL. Hence, I would say it's safe to make that assumption. Otherwise, you could restrict the check to only builtin operators. For those we know they have the expected semantics. -- David Geier
pgsql-hackers by date: