Thread: Planning for improved versions of IN/NOT IN

Planning for improved versions of IN/NOT IN

From

Tom Lane

Date:

27 November 2002, 18:56:32

I've been thinking about how to improve the performance of queries using
"WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)".

In the existing implementation, the subquery result is rescanned to look
for a match to x each time the WHERE clause is executed; this essentially
makes it work like a nestloop join of the stupidest variety. (We do
stick a Materialize node atop the subselect if it looks complicated, but
that's not a big help in typical cases.)

I've thought of three alternative implementations that would perform
better in various scenarios. Each would be relatively simple to
implement; the problem I'm having is figuring out how to get the planner
to choose the best one. The alternatives are basically:

1. Add DISTINCT to the subquery, pull it up into the rangetable (as a subquery
RT entry), and change the = SUBLINK WHERE clause to a simple = against the
subquery output var(s). Essentially this transformsSELECT ... FROM foo WHERE foo.x IN (SELECT y FROM ...)
toSELECT ... FROM foo, (SELECT DISTINCT y FROM ...) ss WHERE foo.x = ss.y

This is not useful for NOT IN, and there are a few restrictions even for
IN (no correlation variables in subquery, no LIMIT, maybe others)?
Also I think it would only work correctly for IN appearing at the top
level of WHERE, though that might be too conservative.
The main case where it could be a win is where the subquery is expected to
produce relatively few output rows, so we could run it as the outer side of
some join plan. In particular, when x is an indexed column in a large
table, fetching x as the inner side of an indexscan nestloop would be much
better than scanning the whole of the outer table.

2. Hash-based implementations: read the subquery once, load its values
into an in-memory hashtable (discarding duplicates), and then probe the
hashtable for each execution of the WHERE clause. This works for both
IN and NOT IN, though we still need an uncorrelated subquery (else the
hashtable can't be reused from row to row). This probably wins for a
moderate number of rows in the subquery result (not too many to hash in
memory) and a fairly large number of outer rows (else building the
hashtable is not repaid).

3. Inner indexscan: essentially, automatically do the IN-to-EXISTS
transform that's presently recommended by the FAQ. This wins if the
subquery result is large but pushing down an equality condition makes
it cheap, and there aren't too many outer rows.

There may also be cases where the existing implementation is still the
best, or perhaps is the only usable one.

The difficulty is that it's not clear how to choose one of these four
ways, short of planning the *entire* query from scratch all four ways :-(.
This seems pretty grim. Approaches #2 and #3 could be handled as local
transformations of the WHERE clause, but we couldn't choose which to use
very well if we don't yet know how many outer rows the WHERE clause will
be executed for. Approach #1 is really planning a completely different
query --- one with an extra FROM-item --- and there's not going to be
all that much commonality in the computations, unless we restrict where
the added FROM-item can be joined to the others, which'd more or less
defeat the purpose.

Anyone see a way around this difficulty?
regards, tom lane

Re: Planning for improved versions of IN/NOT IN

From

"Christopher Kings-Lynne"

Date:

27 November 2002, 21:37:02

> The difficulty is that it's not clear how to choose one of these four
> ways, short of planning the *entire* query from scratch all four ways :-(.
> This seems pretty grim.  Approaches #2 and #3 could be handled as local
> transformations of the WHERE clause, but we couldn't choose which to use
> very well if we don't yet know how many outer rows the WHERE clause will
> be executed for.  Approach #1 is really planning a completely different
> query --- one with an extra FROM-item --- and there's not going to be
> all that much commonality in the computations, unless we restrict where
> the added FROM-item can be joined to the others, which'd more or less
> defeat the purpose.

What about in the case of a scalar subquery eg. SELECT x IN
(1,2,3,4,54,56,6), when there maybe hundreds of scalars?

Chris

Re: Planning for improved versions of IN/NOT IN

From

Tom Lane

Date:

27 November 2002, 22:59:51

"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
> What about in the case of a scalar subquery eg. SELECT x IN
> (1,2,3,4,54,56,6), when there maybe hundreds of scalars?

Unrelated to my present problem.
        regards, tom lane

Re: Planning for improved versions of IN/NOT IN

From

Joe Conway

Date:

29 November 2002, 23:34:19

Tom Lane wrote:
> I've been thinking about how to improve the performance of queries using
> "WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)".
> 
> In the existing implementation, the subquery result is rescanned to look
> for a match to x each time the WHERE clause is executed; this essentially
> makes it work like a nestloop join of the stupidest variety.  (We do
> stick a Materialize node atop the subselect if it looks complicated, but
> that's not a big help in typical cases.)
> 
> I've thought of three alternative implementations that would perform
> better in various scenarios.  Each would be relatively simple to
> implement; the problem I'm having is figuring out how to get the planner
> to choose the best one.  The alternatives are basically:
> 

[abbreviated]
> 1. Add FROM item
> 2. Hash-based
> 3. Inner indexscan
> 4. the existing implementation

> The difficulty is that it's not clear how to choose one of these four
> ways, short of planning the *entire* query from scratch all four ways :-(.
> This seems pretty grim.  Approaches #2 and #3 could be handled as local
> transformations of the WHERE clause, but we couldn't choose which to use
> very well if we don't yet know how many outer rows the WHERE clause will
> be executed for.  Approach #1 is really planning a completely different
> query --- one with an extra FROM-item --- and there's not going to be
> all that much commonality in the computations, unless we restrict where
> the added FROM-item can be joined to the others, which'd more or less
> defeat the purpose.
> 
> Anyone see a way around this difficulty?

How about starting with a rule-based method to make the choice?

1. If uncorrelated: use hash-based approach - ISTM this might address a large   percentage of the problem cases -- it
couldeven handle the   "IN (list-of-scalars)" case. Could it fall back to a   tuplesort/binary-search for the too many
tohash in memory case?
 
2. If correlated: use an inner indexscan
3. If you come up with a pattern where none of the approaches produce a   correct answer, use the existing
implementation

You could always get fancier later if needed, but something along these lines 
would be a great start.

Joe

Re: Planning for improved versions of IN/NOT IN

From

Mike Mascari

Date:

29 November 2002, 23:44:15

Joe Conway wrote:
> Tom Lane wrote:
> 
>> I've been thinking about how to improve the performance of queries using
>> "WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)".
>
> How about starting with a rule-based method to make the choice?
> 
> 1. If uncorrelated: use hash-based approach - ISTM this might address a 
> large
>    percentage of the problem cases -- it could even handle the
>    "IN (list-of-scalars)" case. Could it fall back to a
>    tuplesort/binary-search for the too many to hash in memory case?
> 2. If correlated: use an inner indexscan
> 3. If you come up with a pattern where none of the approaches produce a
>    correct answer, use the existing implementation
> 
> You could always get fancier later if needed, but something along these 
> lines would be a great start.

I curious if any of the rewriting of EXISTS and NOT EXISTS would 
address the problem described by Date:

http://www.firstsql.com/iexist.htm

Mike Mascari
mascarm@mascari.com

Re: Planning for improved versions of IN/NOT IN

From

Tom Lane

Date:

30 November 2002, 00:45:03

Mike Mascari <mascarm@mascari.com> writes:
> I curious if any of the rewriting of EXISTS and NOT EXISTS would 
> address the problem described by Date:

> http://www.firstsql.com/iexist.htm

We are not here to redefine the SQL spec ... and especially not to
eliminate its concept of NULL, which is what Date would really like ;-)

The above-quoted screed is based on a claimed logical equivalence
between NOT EXISTS() and NOT IN() that is just plain wrong when you
consider the possibility of NULLs.  Rather than "FirstSQL correctly
processes this query", you should read "FirstSQL deliberately violates
the SQL spec".  (There may be grounds to argue that the spec behavior
could be improved, but that's an argument to be making to the standards
committee, not here.)
        regards, tom lane

Re: Planning for improved versions of IN/NOT IN

From

Mike Mascari

Date:

30 November 2002, 06:38:11

Tom Lane wrote:
> Mike Mascari <mascarm@mascari.com> writes:
> 
>>I curious if any of the rewriting of EXISTS and NOT EXISTS would 
>>address the problem described by Date:

That should read "I'm curious"...

> 
> 
>>http://www.firstsql.com/iexist.htm
> 
> 
> We are not here to redefine the SQL spec ... and especially not to
> eliminate its concept of NULL, which is what Date would really like ;-)
From what I've read of Date's so far, I think he'd like to junk 
SQL altogether.

> The above-quoted screed is based on a claimed logical equivalence
> between NOT EXISTS() and NOT IN() that is just plain wrong when you
> consider the possibility of NULLs.  Rather than "FirstSQL correctly
> processes this query", you should read "FirstSQL deliberately violates
> the SQL spec".  (There may be grounds to argue that the spec behavior
> could be improved, but that's an argument to be making to the standards
> committee, not here.)

Okay. I knew there was talk in the past that IN be rewritten as 
EXISTS, which is not what you propose doing, but would have 
exposed the odd behavior NOT EXISTS exhibits according to the 
SQL spec. I was also curious to know which path PostgreSQL 
development prefers to take when the SQL spec and the Relational 
Model part ways, as they often do. Maybe someday RedHat will 
have a voting member on the ANSI X3H2/NCITS committee. ;-)

Mike Mascari
mascarm@mascari.com