Re: IN vs EXISTS equivalence - Mailing list pgsql-hackers

From Tom Lane
Subject Re: IN vs EXISTS equivalence
Date
Msg-id 17519.1218227037@sss.pgh.pa.us
Whole thread Raw
In response to Re: IN vs EXISTS equivalence  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: IN vs EXISTS equivalence  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Re: IN vs EXISTS equivalence  (Decibel! <decibel@decibel.org>)
Re: IN vs EXISTS equivalence  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>>> I'm adding some NOT EXISTS examples to the thread for completeness
>>> of what someone might want to address while working on it.  For two
>>> queries which can easily be shown (to a human viewer, anyway) to
>>> return identical results, I see performance differences of over
>>> five orders of magnitude.

I've been looking at this a bit and have an action plan worked out.

I believe that the optimizable cases for EXISTS are those where the
EXISTS() is either at the top level of WHERE, or just underneath a NOT,
and the contained subselect:

* has no set operations (UNION etc), grouping, set-returning functions
in the SELECT list, LIMIT, or a few other funny cases

* references outer-level variables only in its WHERE clause.

Given these conditions, an EXISTS is equivalent to a standard semijoin
between the outer relations used in its WHERE clause and the sub-select
with the WHERE removed, where the join condition is just the WHERE
clause.  (A semijoin returns one copy of each LHS row for which there
exists at least one RHS row satisfying the join condition.)

Similarly, a NOT EXISTS is equivalent to a standard anti-semijoin.
(An anti-semijoin returns one copy of each LHS row for which there
exists no RHS row satisfying the join condition.)

So what I think we should do is implement JOIN_SEMI and JOIN_ANTI
as variant outer-join types in the planner and executor, and convert
EXISTS into those.  JOIN_SEMI would replace the existing JOIN_IN support.
(It's effectively the same thing so far as the executor is concerned,
but there are some places in the planner that assume an IN join condition
consists of one or more btree equality operators.  I'm envisioning folding
the current planner support for IN into the more general outer-join
planning infrastructure, and fixing it so that it doesn't assume the
join clause must be of that form.)

Explicit support for JOIN_ANTI is probably not strictly necessary:
we could represent it using the "LEFT JOIN WHERE rhs-variable IS NULL"
hack.  However this only works if the join's ON-condition can be proved
strict for at least one RHS variable, which is an assumption I'd rather
not require for optimizing EXISTS.  Also, the planner should be able to
make much better estimates of the size of an antijoin result if it has an
explicit concept that that's what's happening.  (The existing kluge in
nulltestsel() is not only ugly but has little hope of ever giving an
accurate estimate.)  So I'd prefer to go the other way: make the planner
recognize the IS NULL trick and remove the IS NULL clause in favor of
marking the LEFT JOIN as being really an antijoin.

As far as IN goes: IN at the top level of WHERE can still be optimized
to a semijoin under the same conditions as now.  NOT IN is a lot trickier,
for the same reason that typically trips up novices who try to use it:
if any row of the subselect produces a NULL comparison result, then it is
impossible for the NOT IN to result in TRUE, which means that it does not
function as a standard antijoin.  I thought about optimizing it only in
the case where we can prove that the subselect outputs and the compared-to
values are known NOT NULL (which in typical cases we could prove by
looking for NOT NULL constraints on those table columns).  The trouble
with this is that that's not a sufficient condition: you must also assume
that the comparison operator involved never yields NULL for non-null
inputs.  That might be okay for btree comparison functions but it's not a
very comfy assumption in general; we certainly haven't got any explicit
knowledge that any functions are guaranteed to act that way.  So this
case might be worth doing later but I'm not feeling excited about it.
We generally tell people to avoid NOT IN and I'm happy to keep on
saying that.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Replay attack of query cancel
Next
From: Tom Lane
Date:
Subject: Re: Replay attack of query cancel