Thread: Okay to tighten definition of oprcanhash?
I have been looking into the possibility of using a hashtable to speed up "x IN (SELECT y FROM ...)" operations. Basically the idea is to run the subselect once, loading its "y" outputs into an in-memory hashtable (any duplicates can be discarded); then for each outer row, probe into the hashtable to see if there's a match for the outer "x" value. This relatively simple idea gets a lot messier as soon as you start to consider the fine points of SQL92 "IN" semantics, however. In particular, if x is null or any particular y is null, the result of the IN should be null ("unknown"), not false. What the spec really says is: 1. If "x = y" yields TRUE for any row of the sub-select, the IN yields TRUE. 2. If "x = y" yields FALSE for all rows of the sub-select, or the sub-select contains no rows, the IN yields FALSE. 3. Otherwise (which by elimination means "x = y" yielded UNKNOWN for at least one row, and TRUE nowhere), the IN yieldsUNKNOWN. What we need to know about, therefore, is not so much whether x or y is null as whether the "=" operator yields null. Also, IN supports multi-column comparisons --- eg,(x1, x2) IN (SELECT y1, y2 FROM ...) So the x or y value could be partially null as well as completely null. I think I see how to do this in what will usually be a fairly efficient fashion, but it requires assuming a good deal about the behavior of the "=" operator(s) involved. The actual implementation has to be: 1. Store all-not-null subquery rows in hashtable (combining equal rows into a single entry). Store partially or completely null rows in a separate linear list, combining non-distinct rows into a single entry. 2a. For LHS row all null: result is FALSE if table is empty, else UNKNOWN. 2b. For LHS row partially-null: must compare against every element of both hashtable and linear list, using same logic as for scanning original subquery output (three-valued combination of OR of ANDs). We know that we will not get a TRUE result, so we can stop as soon as we find an UNKNOWN result for any row; this says it's worthwhile to scan the linear list first, as that's most likely to give us an UNKNOWN. 2c. For LHS all-non-null: probe into hashtable for equality; result is TRUE if match found. Otherwise, must scan list of partially-null rows to see if any non-distinct rows are found; if so return UNKNOWN, else return FALSE. This should be reasonably efficient because the slow cases only occur when there are partially-null LHS or RHS rows, which should be relatively uncommon. (In particular, when there is only one column to compare, case 2b doesn't arise at all, and there can never be more than one entry in the linear list, namely RHS = null.) Also, the number of distinct partially null rows should usually be much less than the number of distinct no-nulls rows, so the linear list should be much smaller than the hashtable. (I tried to think of ways to hash this list, but it doesn't seem to work...) Now, the assumptions this makes about the behavior of the "=" operator(s) are: 1. The operators are hashable (ie, only values producing the same hashkey could possibly be equal). Else we can't use a hashtable at all. 2. The operators are strict (must yield NULL out for NULL in). This is what lets us avoid a table scan in case 2a. 3. The operators do not yield NULL for non-null inputs. Otherwise, after case 2c fails to find a TRUE result, we'd have to scan the whole hashtable, not only the partially-null list, to see if there are rows that must cause us to return UNKNOWN. We already have markers in pg_operator and pg_proc to indicate hashability and strictness, but there's no marker to indicate "reverse strictness" as per assumption #3 (what's the proper term for that property, anyway?) I am leaning towards documenting that the oprcanhash marker implies "reverse strictness" as well. All the currently hashable operators meet that requirement. Although you could imagine cases where you might like "=" to return null for non-null inputs (for example, a rigorous interpretation of IEEE float math would probably say that a NaN compared to anything else yields UNKNOWN), it's not really a very practical thing to do in SQL --- one counterexample is that such a datatype could not be btree-indexable, since btree assumes a total ordering is possible. The other thing we could do is put a separate "reverse strict" marker column into pg_proc, but this seems like much more work than is justified. Comments? (Oh BTW: I'm aware that most of the problem goes away for IN appearing at the top level of WHERE, since there we don't actually need to distinguish FALSE from UNKNOWN results. But I'm trying to understand how to optimize the general case, too.) regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I have been looking into the possibility of using a hashtable to speed > up "x IN (SELECT y FROM ...)" operations. That's certainly one of the join types that Oracle can perform, and it's frequently by far the fastest. I'm not sure but I think the way Oracle optimizes subselects is by transforming them into the equivalent join. Or rather, probably by transforming both joins and subselects into an equivalent internal representation. This applies equally to things likeWHERE x=(select..) as well as things likeWHERE x IN (select...) The former is exactly equivalent to a join with an assertion of uniqueness. The latter is a little different but still equivalent to a join with special behaviour in case of duplicates. In many cases the database will have a constraint telling it that no duplicates will appear in which cases it should be able to optimize out any extra work anyways. I know when I guided less experienced SQL programmers I urged them not to worry about which form would perform better, only which form more clearly expressed the result set they wanted. I promised them the database should produce the same query plan for equivalent queries regardless of the form chosen to express that in. That was almost always true for Oracle. I generally found it impossible to optimize a query merely by changing it from one equivalent form to another. Oracle nearly always produced exactly the same plan. I always had to either find a non-equivalent form that I knew would produce the same results only because of extra information I had about the data, or add optimizer hints. So is there some more general internal representation that can represent all three of these cases in a consistent manner? It seems more powerful to implement hash joins in a way that helps normal join queries as well as particular subselect forms of queries. For what it's worth my limited experience so far with postgres is that what you're talking about is sorely needed. I'm having trouble getting queries that I would write without thinking twice about on Oracle to perform reasonably on postgres even after trying all kinds of contortions. And they're precisely the types of queries that would be helped by hash joins. -- greg
Greg Stark <gsstark@mit.edu> writes: > I'm not sure but I think the way Oracle optimizes subselects is by > transforming them into the equivalent join. The point here is that there is no exactly equivalent join operation. (Of course, given Oracle's known lack of standards-compliance on NULL semantics, I wouldn't be overly surprised if they've misimplemented IN in a way that doesn't preserve the spec's semantics ...) It does get a lot simpler when the IN appears as a top-level WHERE clause, because *in that context* you can ignore the difference between FALSE and UNKNOWN results from IN. I have some other plans for implementing IN in a join-like fashion in that special case. But what I'm looking at right now is the general case ... regards, tom lane
I like the idea, and see need to keep a separate list of NULL values. --------------------------------------------------------------------------- Tom Lane wrote: > I have been looking into the possibility of using a hashtable to speed > up "x IN (SELECT y FROM ...)" operations. Basically the idea is to run > the subselect once, loading its "y" outputs into an in-memory hashtable > (any duplicates can be discarded); then for each outer row, probe into > the hashtable to see if there's a match for the outer "x" value. > > This relatively simple idea gets a lot messier as soon as you start to > consider the fine points of SQL92 "IN" semantics, however. In > particular, if x is null or any particular y is null, the result of the > IN should be null ("unknown"), not false. What the spec really says is: > > 1. If "x = y" yields TRUE for any row of the sub-select, > the IN yields TRUE. > > 2. If "x = y" yields FALSE for all rows of the sub-select, > or the sub-select contains no rows, the IN yields FALSE. > > 3. Otherwise (which by elimination means "x = y" yielded UNKNOWN > for at least one row, and TRUE nowhere), the IN yields UNKNOWN. > > What we need to know about, therefore, is not so much whether x or y is > null as whether the "=" operator yields null. > > Also, IN supports multi-column comparisons --- eg, > (x1, x2) IN (SELECT y1, y2 FROM ...) > So the x or y value could be partially null as well as completely null. > > I think I see how to do this in what will usually be a fairly efficient > fashion, but it requires assuming a good deal about the behavior of the > "=" operator(s) involved. The actual implementation has to be: > > 1. Store all-not-null subquery rows in hashtable (combining equal rows > into a single entry). Store partially or completely null rows in a > separate linear list, combining non-distinct rows into a single entry. > > 2a. For LHS row all null: result is FALSE if table is empty, else UNKNOWN. > > 2b. For LHS row partially-null: must compare against every element of > both hashtable and linear list, using same logic as for scanning > original subquery output (three-valued combination of OR of ANDs). > We know that we will not get a TRUE result, so we can stop as soon as > we find an UNKNOWN result for any row; this says it's worthwhile to scan > the linear list first, as that's most likely to give us an UNKNOWN. > > 2c. For LHS all-non-null: probe into hashtable for equality; result is > TRUE if match found. Otherwise, must scan list of partially-null rows > to see if any non-distinct rows are found; if so return UNKNOWN, else > return FALSE. > > This should be reasonably efficient because the slow cases only occur > when there are partially-null LHS or RHS rows, which should be > relatively uncommon. (In particular, when there is only one column to > compare, case 2b doesn't arise at all, and there can never be more than > one entry in the linear list, namely RHS = null.) Also, the number of > distinct partially null rows should usually be much less than the number > of distinct no-nulls rows, so the linear list should be much smaller > than the hashtable. (I tried to think of ways to hash this list, but it > doesn't seem to work...) > > Now, the assumptions this makes about the behavior of the "=" > operator(s) are: > > 1. The operators are hashable (ie, only values producing the same > hashkey could possibly be equal). Else we can't use a hashtable at all. > > 2. The operators are strict (must yield NULL out for NULL in). This is > what lets us avoid a table scan in case 2a. > > 3. The operators do not yield NULL for non-null inputs. Otherwise, > after case 2c fails to find a TRUE result, we'd have to scan the whole > hashtable, not only the partially-null list, to see if there are rows > that must cause us to return UNKNOWN. > > We already have markers in pg_operator and pg_proc to indicate > hashability and strictness, but there's no marker to indicate "reverse > strictness" as per assumption #3 (what's the proper term for that > property, anyway?) > > I am leaning towards documenting that the oprcanhash marker implies > "reverse strictness" as well. All the currently hashable operators meet > that requirement. Although you could imagine cases where you might like > "=" to return null for non-null inputs (for example, a rigorous > interpretation of IEEE float math would probably say that a NaN compared > to anything else yields UNKNOWN), it's not really a very practical thing > to do in SQL --- one counterexample is that such a datatype could not be > btree-indexable, since btree assumes a total ordering is possible. > > The other thing we could do is put a separate "reverse strict" marker > column into pg_proc, but this seems like much more work than is > justified. > > Comments? > > (Oh BTW: I'm aware that most of the problem goes away for IN appearing > at the top level of WHERE, since there we don't actually need to > distinguish FALSE from UNKNOWN results. But I'm trying to understand > how to optimize the general case, too.) > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073