Re: IN vs EXISTS equivalence - Mailing list pgsql-hackers

From Tom Lane
Subject Re: IN vs EXISTS equivalence
Date
Msg-id 6531.1218473967@sss.pgh.pa.us
Whole thread Raw
In response to Re: IN vs EXISTS equivalence  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-hackers
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> I believe that the optimizable cases for EXISTS are those where the
>> EXISTS() is either at the top level of WHERE, or just underneath a
>> NOT,
> The rest of the plan makes sense to me, but this part seems narrow. 
> There's probably a good reason for that which is beyond my depth, but
> attached is a view that is used for calculating statistics for a
> database which is primarily used for "case management" purposes.  If
> EXISTS could also be optimized in the contexts used there, it would be
> great.
I chewed on that for awhile.  We can certainly optimize EXISTS that's
appearing in the ON-condition of a regular JOIN, because that's not
really semantically different from a WHERE-condition.  But I don't think
it's going to be reasonable to improve EXISTS in outer-JOIN ON
conditions.  There are a couple of problems.  Consider

t1 LEFT JOIN t2 ON (t1.f1 = t2.f2 AND     EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx AND t3.f4 = t2.fy))

To implement this with the correct semantics, we'd have to have the
t1/t2 join and the t1/t2/t3 join going on in the same execution node,
with two different join behaviors (LEFT and SEMI).  There isn't any
way AFAICS to factor it into two separate steps.  That's unreasonably
complicated, and it's not clear that you'd get any better performance
anyway than the current implementation (which'd treat the EXISTS as
a subplan).

Even worse, let the EXISTS be a degenerate case:

t1 LEFT JOIN t2 ON (t1.f1 = t2.f2 AND     EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx));

We can't actually treat this EXISTS as a semijoin between t1 and t3,
either before or after the LEFT JOIN; because then the behavior would be
to drop t1 rows that have no t3 match, which is not what this query
specifies.

(Note: the other degenerate case, where the EXISTS depends only on
t2, *could* be optimized since we could just let the semijoin be
performed within the RHS of the LEFT JOIN.)

So this is not something I'm going to tackle; at least not this
devel cycle.

One small step we can take in this direction, though, is to improve the
planner's internal handling of the qual conditions for IN and EXISTS.
Right now the process is just to throw the sub-select into the main
range table and put the IN join conditions into the same place in WHERE
that the IN-clause was to start with.  The trouble with this is that the
distribute_quals_to_rels processing has no idea that there's anything
special about the IN join conditions.  We got away with that for the
limited case of IN clauses at the top level of WHERE, but it's become
clear to me over the weekend that this has no hope of working for NOT
EXISTS --- since that's effectively an outer join, it absolutely has to
have the same kinds of qual-scheduling constraints as ordinary outer
joins do.  So we need a data structure that distribute_quals_to_rels can
work with.  What I think needs to happen is that the initial pass that
pulls up optimizable IN/EXISTS sub-selects should not merge the
SubLink's replacement qual clauses seamlessly, but put them underneath a
new node type, say "FlattenedSubLink", that retains knowledge of the
join it's representing.  The FlattenedSubLink would survive only as far
as distribute_quals_to_rels, which would distribute out the contained
qual conditions instead of the FlattenedSubLink itself --- but only
after marking them properly for the outer-join restrictions.  This
representation would make it feasible to work with IN/EXISTS that are
inside JOIN ON conditions, which the present representation using a
single in_info_list really can't do.  The semantic issues are still
there but at least the representation isn't getting in the way ...
        regards, tom lane


pgsql-hackers by date:

Previous
From: Zdenek Kotala
Date:
Subject: WIP: New Page API
Next
From: "David E. Wheeler"
Date:
Subject: Re: Type Categories for User-Defined Types