Planning for improved versions of IN/NOT IN - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Planning for improved versions of IN/NOT IN |
Date | |
Msg-id | 29128.1038441391@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Planning for improved versions of IN/NOT IN
|
List | pgsql-hackers |
I've been thinking about how to improve the performance of queries using "WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)". In the existing implementation, the subquery result is rescanned to look for a match to x each time the WHERE clause is executed; this essentially makes it work like a nestloop join of the stupidest variety. (We do stick a Materialize node atop the subselect if it looks complicated, but that's not a big help in typical cases.) I've thought of three alternative implementations that would perform better in various scenarios. Each would be relatively simple to implement; the problem I'm having is figuring out how to get the planner to choose the best one. The alternatives are basically: 1. Add DISTINCT to the subquery, pull it up into the rangetable (as a subquery RT entry), and change the = SUBLINK WHERE clause to a simple = against the subquery output var(s). Essentially this transformsSELECT ... FROM foo WHERE foo.x IN (SELECT y FROM ...) toSELECT ... FROM foo, (SELECT DISTINCT y FROM ...) ss WHERE foo.x = ss.y This is not useful for NOT IN, and there are a few restrictions even for IN (no correlation variables in subquery, no LIMIT, maybe others)? Also I think it would only work correctly for IN appearing at the top level of WHERE, though that might be too conservative. The main case where it could be a win is where the subquery is expected to produce relatively few output rows, so we could run it as the outer side of some join plan. In particular, when x is an indexed column in a large table, fetching x as the inner side of an indexscan nestloop would be much better than scanning the whole of the outer table. 2. Hash-based implementations: read the subquery once, load its values into an in-memory hashtable (discarding duplicates), and then probe the hashtable for each execution of the WHERE clause. This works for both IN and NOT IN, though we still need an uncorrelated subquery (else the hashtable can't be reused from row to row). This probably wins for a moderate number of rows in the subquery result (not too many to hash in memory) and a fairly large number of outer rows (else building the hashtable is not repaid). 3. Inner indexscan: essentially, automatically do the IN-to-EXISTS transform that's presently recommended by the FAQ. This wins if the subquery result is large but pushing down an equality condition makes it cheap, and there aren't too many outer rows. There may also be cases where the existing implementation is still the best, or perhaps is the only usable one. The difficulty is that it's not clear how to choose one of these four ways, short of planning the *entire* query from scratch all four ways :-(. This seems pretty grim. Approaches #2 and #3 could be handled as local transformations of the WHERE clause, but we couldn't choose which to use very well if we don't yet know how many outer rows the WHERE clause will be executed for. Approach #1 is really planning a completely different query --- one with an extra FROM-item --- and there's not going to be all that much commonality in the computations, unless we restrict where the added FROM-item can be joined to the others, which'd more or less defeat the purpose. Anyone see a way around this difficulty? regards, tom lane
pgsql-hackers by date: