Planning for improved versions of IN/NOT IN - Mailing list pgsql-hackers

From Tom Lane
Subject Planning for improved versions of IN/NOT IN
Date
Msg-id 29128.1038441391@sss.pgh.pa.us
Whole thread Raw
Responses Re: Planning for improved versions of IN/NOT IN
List pgsql-hackers
I've been thinking about how to improve the performance of queries using
"WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)".

In the existing implementation, the subquery result is rescanned to look
for a match to x each time the WHERE clause is executed; this essentially
makes it work like a nestloop join of the stupidest variety.  (We do
stick a Materialize node atop the subselect if it looks complicated, but
that's not a big help in typical cases.)

I've thought of three alternative implementations that would perform
better in various scenarios.  Each would be relatively simple to
implement; the problem I'm having is figuring out how to get the planner
to choose the best one.  The alternatives are basically:

1. Add DISTINCT to the subquery, pull it up into the rangetable (as a subquery
RT entry), and change the = SUBLINK WHERE clause to a simple = against the
subquery output var(s).  Essentially this transformsSELECT ... FROM foo WHERE foo.x IN (SELECT y FROM ...)
toSELECT ... FROM foo, (SELECT DISTINCT y FROM ...) ss       WHERE foo.x = ss.y

This is not useful for NOT IN, and there are a few restrictions even for
IN (no correlation variables in subquery, no LIMIT, maybe others)?
Also I think it would only work correctly for IN appearing at the top
level of WHERE, though that might be too conservative.
The main case where it could be a win is where the subquery is expected to
produce relatively few output rows, so we could run it as the outer side of
some join plan.  In particular, when x is an indexed column in a large
table, fetching x as the inner side of an indexscan nestloop would be much
better than scanning the whole of the outer table.

2. Hash-based implementations: read the subquery once, load its values
into an in-memory hashtable (discarding duplicates), and then probe the
hashtable for each execution of the WHERE clause.  This works for both
IN and NOT IN, though we still need an uncorrelated subquery (else the
hashtable can't be reused from row to row).  This probably wins for a
moderate number of rows in the subquery result (not too many to hash in
memory) and a fairly large number of outer rows (else building the
hashtable is not repaid).

3. Inner indexscan: essentially, automatically do the IN-to-EXISTS
transform that's presently recommended by the FAQ.  This wins if the
subquery result is large but pushing down an equality condition makes
it cheap, and there aren't too many outer rows.

There may also be cases where the existing implementation is still the
best, or perhaps is the only usable one.


The difficulty is that it's not clear how to choose one of these four
ways, short of planning the *entire* query from scratch all four ways :-(.
This seems pretty grim.  Approaches #2 and #3 could be handled as local
transformations of the WHERE clause, but we couldn't choose which to use
very well if we don't yet know how many outer rows the WHERE clause will
be executed for.  Approach #1 is really planning a completely different
query --- one with an extra FROM-item --- and there's not going to be
all that much commonality in the computations, unless we restrict where
the added FROM-item can be joined to the others, which'd more or less
defeat the purpose.

Anyone see a way around this difficulty?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Boolean casting in 7.3 -> changed?
Next
From: David Walker
Date:
Subject: Re: Boolean casting in 7.3 -> changed?