Re: NOT IN subquery optimization - Mailing list pgsql-hackers

From David Rowley
Subject Re: NOT IN subquery optimization
Date
Msg-id CAKJS1f9Ybn6BH0H7s9GsqRnfauKe7nmeLgMz5=YbqZ1LWnChtA@mail.gmail.com
Whole thread Raw
In response to Re: NOT IN subquery optimization  (Jim Finnerty <jfinnert@amazon.com>)
List pgsql-hackers
On Sun, 3 Mar 2019 at 01:34, Jim Finnerty <jfinnert@amazon.com> wrote:
> I agree with Tom's comment above - when the cost of the NOT IN is dominated
> by the cost of the outer scan (i.e. when the cardinality of the outer
> relation is large relative to the cardinality of the subquery), and if the
> inner cardinality is small enough to fit in memory, then the current
> implementation does an efficient hash lookup into an in-memory structure,
> and that's a very fast way to do the NOT IN.  It almost achieves the
> lower-bound cost of scanning the outer relation.  It can also parallelizes
> easily, whether or not we currently can do that.  In these cases, the
> current plan is the preferred plan, and we should keep it.

If you do the conversion to anti-join (without hacking at the join
quals and assuming no nulls are possible), then the planner can decide
what's best.  The planner may choose a hash join which is not hugely
different from a hashed subplan, however from the testing I've done
the Hash Join is a bit faster. I imagine there's been more motivation
over the years to optimise that over hashed subplans.  As far as I
know, there's no parallel query support for hashed subplans, but I
know there is for hash joins.  In short, I don't think it makes any
sense to not translate into an anti-join (when possible). I think the
best anti-join plan will always be a win over the subquery. The
planner could make a mistake of course, but that's a different issue.
We certainly don't consider keeping the subquery around for NOT
EXISTS.

> This is a case that we would want to avoid the transform, because when both
> the inner and outer are nullable and the outer is large and the inner is
> small, the transformed plan would Scan and Materialize the inner for each
> row of the outer row, which is very slow compared to the untransformed plan:
>
> slow case for the transformation: https://explain.depesz.com/s/0CBB

Well, that's because you're forcing the planner into a corner in
regards to the join condition. It has no choice but to nested loop
that join.

> However, if the inner is too large to fit into memory, then the transformed
> plan is faster on all of our other test cases, although our test cases are
> far from complete.  If the current solution supports parallel scan of the
> outer, for example, then PQ could have lower elapsed time than the non-PQ
> nested loop solution.

I'm having a little trouble understanding this.  From what I
understand the code adds an OR .. IS NULL clause to the join
condition. Is this still the case with what you've been testing here?
If so, I'm surprised to hear all your test cases are faster. If
there's an OR clause in the join condition then the planner has no
choice but to use a nested loop join, so it's very surprising that you
would find that faster with larger data sets.

Or does the code your testing implement this a different way? Perhaps
with some execution level support?

> Also, remember that the issue with the empty inner was just a bug that was
> the result of trying to do an additional optimization in the case where
> there is no WHERE clause in the subquery.  That bug has been fixed.  The
> general case transformation described in the base note produces the correct
> result in all cases, including the empty subquery case.

I'm not sure why lack of WHERE clause in the subquery counts for
anything here.  The results set from the subquery can be empty or not
empty with or without a WHERE clause.  The only way you'll know it's
empty during planning is if some gating qual says so, but that's yet
to be determined by the time the transformation should be done.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Problems with plan estimates in postgres_fdw
Next
From: Etsuro Fujita
Date:
Subject: Re: Problems with plan estimates in postgres_fdw