Thread: Planning for improved versions of IN/NOT IN
I've been thinking about how to improve the performance of queries using "WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)". In the existing implementation, the subquery result is rescanned to look for a match to x each time the WHERE clause is executed; this essentially makes it work like a nestloop join of the stupidest variety. (We do stick a Materialize node atop the subselect if it looks complicated, but that's not a big help in typical cases.) I've thought of three alternative implementations that would perform better in various scenarios. Each would be relatively simple to implement; the problem I'm having is figuring out how to get the planner to choose the best one. The alternatives are basically: 1. Add DISTINCT to the subquery, pull it up into the rangetable (as a subquery RT entry), and change the = SUBLINK WHERE clause to a simple = against the subquery output var(s). Essentially this transformsSELECT ... FROM foo WHERE foo.x IN (SELECT y FROM ...) toSELECT ... FROM foo, (SELECT DISTINCT y FROM ...) ss WHERE foo.x = ss.y This is not useful for NOT IN, and there are a few restrictions even for IN (no correlation variables in subquery, no LIMIT, maybe others)? Also I think it would only work correctly for IN appearing at the top level of WHERE, though that might be too conservative. The main case where it could be a win is where the subquery is expected to produce relatively few output rows, so we could run it as the outer side of some join plan. In particular, when x is an indexed column in a large table, fetching x as the inner side of an indexscan nestloop would be much better than scanning the whole of the outer table. 2. Hash-based implementations: read the subquery once, load its values into an in-memory hashtable (discarding duplicates), and then probe the hashtable for each execution of the WHERE clause. This works for both IN and NOT IN, though we still need an uncorrelated subquery (else the hashtable can't be reused from row to row). This probably wins for a moderate number of rows in the subquery result (not too many to hash in memory) and a fairly large number of outer rows (else building the hashtable is not repaid). 3. Inner indexscan: essentially, automatically do the IN-to-EXISTS transform that's presently recommended by the FAQ. This wins if the subquery result is large but pushing down an equality condition makes it cheap, and there aren't too many outer rows. There may also be cases where the existing implementation is still the best, or perhaps is the only usable one. The difficulty is that it's not clear how to choose one of these four ways, short of planning the *entire* query from scratch all four ways :-(. This seems pretty grim. Approaches #2 and #3 could be handled as local transformations of the WHERE clause, but we couldn't choose which to use very well if we don't yet know how many outer rows the WHERE clause will be executed for. Approach #1 is really planning a completely different query --- one with an extra FROM-item --- and there's not going to be all that much commonality in the computations, unless we restrict where the added FROM-item can be joined to the others, which'd more or less defeat the purpose. Anyone see a way around this difficulty? regards, tom lane
> The difficulty is that it's not clear how to choose one of these four > ways, short of planning the *entire* query from scratch all four ways :-(. > This seems pretty grim. Approaches #2 and #3 could be handled as local > transformations of the WHERE clause, but we couldn't choose which to use > very well if we don't yet know how many outer rows the WHERE clause will > be executed for. Approach #1 is really planning a completely different > query --- one with an extra FROM-item --- and there's not going to be > all that much commonality in the computations, unless we restrict where > the added FROM-item can be joined to the others, which'd more or less > defeat the purpose. What about in the case of a scalar subquery eg. SELECT x IN (1,2,3,4,54,56,6), when there maybe hundreds of scalars? Chris
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: > What about in the case of a scalar subquery eg. SELECT x IN > (1,2,3,4,54,56,6), when there maybe hundreds of scalars? Unrelated to my present problem. regards, tom lane
Tom Lane wrote: > I've been thinking about how to improve the performance of queries using > "WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)". > > In the existing implementation, the subquery result is rescanned to look > for a match to x each time the WHERE clause is executed; this essentially > makes it work like a nestloop join of the stupidest variety. (We do > stick a Materialize node atop the subselect if it looks complicated, but > that's not a big help in typical cases.) > > I've thought of three alternative implementations that would perform > better in various scenarios. Each would be relatively simple to > implement; the problem I'm having is figuring out how to get the planner > to choose the best one. The alternatives are basically: > [abbreviated] > 1. Add FROM item > 2. Hash-based > 3. Inner indexscan > 4. the existing implementation > The difficulty is that it's not clear how to choose one of these four > ways, short of planning the *entire* query from scratch all four ways :-(. > This seems pretty grim. Approaches #2 and #3 could be handled as local > transformations of the WHERE clause, but we couldn't choose which to use > very well if we don't yet know how many outer rows the WHERE clause will > be executed for. Approach #1 is really planning a completely different > query --- one with an extra FROM-item --- and there's not going to be > all that much commonality in the computations, unless we restrict where > the added FROM-item can be joined to the others, which'd more or less > defeat the purpose. > > Anyone see a way around this difficulty? How about starting with a rule-based method to make the choice? 1. If uncorrelated: use hash-based approach - ISTM this might address a large percentage of the problem cases -- it couldeven handle the "IN (list-of-scalars)" case. Could it fall back to a tuplesort/binary-search for the too many tohash in memory case? 2. If correlated: use an inner indexscan 3. If you come up with a pattern where none of the approaches produce a correct answer, use the existing implementation You could always get fancier later if needed, but something along these lines would be a great start. Joe
Joe Conway wrote: > Tom Lane wrote: > >> I've been thinking about how to improve the performance of queries using >> "WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)". > > How about starting with a rule-based method to make the choice? > > 1. If uncorrelated: use hash-based approach - ISTM this might address a > large > percentage of the problem cases -- it could even handle the > "IN (list-of-scalars)" case. Could it fall back to a > tuplesort/binary-search for the too many to hash in memory case? > 2. If correlated: use an inner indexscan > 3. If you come up with a pattern where none of the approaches produce a > correct answer, use the existing implementation > > You could always get fancier later if needed, but something along these > lines would be a great start. I curious if any of the rewriting of EXISTS and NOT EXISTS would address the problem described by Date: http://www.firstsql.com/iexist.htm Mike Mascari mascarm@mascari.com
Mike Mascari <mascarm@mascari.com> writes: > I curious if any of the rewriting of EXISTS and NOT EXISTS would > address the problem described by Date: > http://www.firstsql.com/iexist.htm We are not here to redefine the SQL spec ... and especially not to eliminate its concept of NULL, which is what Date would really like ;-) The above-quoted screed is based on a claimed logical equivalence between NOT EXISTS() and NOT IN() that is just plain wrong when you consider the possibility of NULLs. Rather than "FirstSQL correctly processes this query", you should read "FirstSQL deliberately violates the SQL spec". (There may be grounds to argue that the spec behavior could be improved, but that's an argument to be making to the standards committee, not here.) regards, tom lane
Tom Lane wrote: > Mike Mascari <mascarm@mascari.com> writes: > >>I curious if any of the rewriting of EXISTS and NOT EXISTS would >>address the problem described by Date: That should read "I'm curious"... > > >>http://www.firstsql.com/iexist.htm > > > We are not here to redefine the SQL spec ... and especially not to > eliminate its concept of NULL, which is what Date would really like ;-) From what I've read of Date's so far, I think he'd like to junk SQL altogether. > The above-quoted screed is based on a claimed logical equivalence > between NOT EXISTS() and NOT IN() that is just plain wrong when you > consider the possibility of NULLs. Rather than "FirstSQL correctly > processes this query", you should read "FirstSQL deliberately violates > the SQL spec". (There may be grounds to argue that the spec behavior > could be improved, but that's an argument to be making to the standards > committee, not here.) Okay. I knew there was talk in the past that IN be rewritten as EXISTS, which is not what you propose doing, but would have exposed the odd behavior NOT EXISTS exhibits according to the SQL spec. I was also curious to know which path PostgreSQL development prefers to take when the SQL spec and the Relational Model part ways, as they often do. Maybe someday RedHat will have a voting member on the ANSI X3H2/NCITS committee. ;-) Mike Mascari mascarm@mascari.com