Thread: apply outer->inner join optimisation to OR clauses
The attached patch applies the optimisation translating outer joins to inner joins (where safe) to the cases where the WHERE clause has OR bits in it too, if the column is present (and not null) in all of the OR bits. This allows, for example: bbaetz=# explain analyze select bugs.bug_id from bugs left join longdescs using (bug_id) where who IN (1,2); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Merge Left Join (cost=149629.84..172021.09 rows=100000 width=12) (actual time=20102.71..23694.30 rows=105 loops=1) Merge Cond: ("outer".bug_id = "inner".bug_id) Filter: (("inner".who = 1) OR ("inner".who = 2)) -> Index Scan using bugs_pkey on bugs (cost=0.00..2142.00 rows=100000 width=4) (actual time=6.17..310.94 rows=100000 loops=1) -> Sort (cost=149629.84..152129.84 rows=1000000 width=8) (actual time=19969.66..21317.62 rows=1000000 loops=1) Sort Key: longdescs.bug_id -> Seq Scan on longdescs (cost=0.00..14902.00 rows=1000000 width=8) (actual time=0.03..4225.04 rows=1000000 loops=1) Total runtime: 23739.90 msec (8 rows) to become: bbaetz=# explain analyze select bugs.bug_id from bugs left join longdescs using (bug_id) where who IN (1,2); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..691.64 rows=99 width=8) (actual time=42.54..1289.34 rows=105 loops=1) -> Index Scan using longdescs_who_idx, longdescs_who_idx on longdescs (cost=0.00..395.09 rows=98 width=4) (actual time=7.31..547.09 rows=105 loops=1) Index Cond: ((who = 1) OR (who = 2)) -> Index Scan using bugs_pkey on bugs (cost=0.00..3.01 rows=1 width=4) (actual time=7.06..7.06 rows=1 loops=105) Index Cond: (bugs.bug_id = "outer".bug_id) Total runtime: 1289.60 msec (6 rows) I wanted to add a regression test, but it doesn't look like theres infrastructure to test that an optimisation is being applied. Thanks, Bradley
Attachment
Bradley Baetz <bbaetz@acm.org> writes: > The attached patch applies the optimisation translating outer joins to > inner joins (where safe) to the cases where the WHERE clause has OR bits > in it too, if the column is present (and not null) in all of the OR > bits. Your change for AND is obviously incorrect, and I don't think I believe the OR case either. Why is it safe to pass down a TRUE top_level flag? regards, tom lane
On Sat, May 03, 2003 at 01:22:11PM -0400, Tom Lane wrote: > Bradley Baetz <bbaetz@acm.org> writes: > > The attached patch applies the optimisation translating outer joins to > > inner joins (where safe) to the cases where the WHERE clause has OR bits > > in it too, if the column is present (and not null) in all of the OR > > bits. > > Your change for AND is obviously incorrect, and I don't think I believe > the OR case either. Why is it safe to pass down a TRUE top_level flag? Its safe because WHERE (foo=1) OR (foo=2) OR (foo IS NOT NULL) still means that foo is non_nullable. Without my patch, this is done for AND anyway, via the List node. I don't see the difference in this context between (List (Op = foo 1) (Op = foo 2) (NullTest isNotNull foo)) and (BoolExpr And (Op = foo 1) (Op = foo 2) (NullTest isNotNull foo)) since the second is simplified into the first earlier on. As I understand it, the top_level flag is intended to prevent against |NOT (foo IS NOT NULL)|, which does not make foo non_nullable, and |NOT (foo IS True/False/etc)|, which also doesn't prove that foo isn't nullable. For the opExpr, we want to stop ((foo IS NOT NULL)=False) from making foo nonnullable. For the BooleanExpr's OR and AND I don't see a combination which can take place that way - none of the operands are combined with the other so that they can 'interfere' with each other like that. I don't think that the fact that |SELECT foo>2 OR NULL| gives NULL, not FALSE, if foo <=2 is an issue either, since they're both not a true value. > > regards, tom lane Bradley
Bradley Baetz <bbaetz@acm.org> writes: > I don't think that the fact that |SELECT foo>2 OR NULL| gives NULL, not > FALSE, if foo <=2 is an issue either, since they're both not a true > value. But that is exactly the distinction that we have to worry about when not at top level. The error in the AND part of the proposed patch is exhibited by WHERE NOT ((a.a1 > 1) AND (b.b1 > 1)) When a.a1 is NULL, the AND can't yield TRUE --- but it can yield FALSE, which will become TRUE at the top level. So neither a nor b can be considered non-nullable in this expression. You do have an insight here, which is that if the same rel can be shown to null *all* the arms of an OR, it nulls the OR. The same holds true of an AND, I think. You might be right that passing top_level through the AND or OR would be safe in the present state of the code, but it strikes me as fragile and confusing to leave it that way, because it no longer means what its name implies and the comments say. ISTM what we need to do if we want to extend this code is to come up with a stronger definition of the context in which each recursive level gets invoked. I think it might work to define a three-way context value instead of a boolean: SAFE_IF_NULL a rel is nonnullable if nulling it forces this expression to return NULL SAFE_IF_NULL_OR_FALSE a rel is nonnullable if nulling it forces this expression to return NULL or FALSE SAFE_IF_NULL_OR_TRUE a rel is nonnullable if nulling it forces this expression to return NULL or TRUE The present notion of top-level would be replaced by context SAFE_IF_NULL_OR_FALSE. Descent through a strict operator or function always narrows the context to SAFE_IF_NULL (since we don't know what the op/func might do with other values, and we can't even be sure we're talking about booleans anymore anyway). Descent through NOT exchanges SAFE_IF_NULL_OR_FALSE and SAFE_IF_NULL_OR_TRUE. You could make appropriate extensions of the code for NullTest and BooleanTest, and I think this would allow sane handling of the AND and OR cases: for example, AND with SAFE_IF_NULL_OR_FALSE context really can union the inputs' nonnullable sets (since forcing any one input to null or false guarantees the AND result is null or false) but AND in the other contexts has to intersect the inputs' sets. I think also that AND and OR might be able to pass down more liberal contexts than they get, but am not sure about that part. (BTW, I don't quite like the adjective SAFE here; can you think of a better term?) The SAFE_IF_NULL_OR_TRUE context might prove not to be worth the trouble of implementing --- I'm not sure if either AND or OR can really exploit it any better than they can exploit SAFE_IS_NULL. If not, we'd be back to just a boolean context value, but we'd have a clearer understanding of what it really means. I don't have time to slog through the details, but if you want to run with that, go for it ... regards, tom lane
On Sat, May 03, 2003 at 08:53:14PM -0400, Tom Lane wrote: > Bradley Baetz <bbaetz@acm.org> writes: > > I don't think that the fact that |SELECT foo>2 OR NULL| gives NULL, not > > FALSE, if foo <=2 is an issue either, since they're both not a true > > value. > > But that is exactly the distinction that we have to worry about when not > at top level. The error in the AND part of the proposed patch is > exhibited by > WHERE NOT ((a.a1 > 1) AND (b.b1 > 1)) > When a.a1 is NULL, the AND can't yield TRUE --- but it can yield FALSE, > which will become TRUE at the top level. So neither a nor b can be > considered non-nullable in this expression. I'm about to run out for a bit, so I'll reply to the rest later. I was considering WHERE NOT (a.a1>1), which was already acceptable, but you're right in that your example can have either of them being null. Its really just NOT which is the problem. I wonder if its safe to do this for stuff inside a NULL iff there is only one nonnullable_rel from the expression? I'll think about that a bit. I'll read through your comments in more detail in a few hours. > (BTW, I don't quite like the adjective SAFE here; can you think of a > better term?) NULL_STRICT, perhaps? > > The SAFE_IF_NULL_OR_TRUE context might prove not to be worth the trouble > of implementing --- I'm not sure if either AND or OR can really exploit > it any better than they can exploit SAFE_IS_NULL. If not, we'd be back > to just a boolean context value, but we'd have a clearer understanding > of what it really means. The main use (which got me to look into this) is IN. Bradley
On Sat, May 03, 2003 at 08:53:14PM -0400, Tom Lane wrote: > Bradley Baetz <bbaetz@acm.org> writes: > > I don't think that the fact that |SELECT foo>2 OR NULL| gives NULL, not > > FALSE, if foo <=2 is an issue either, since they're both not a true > > value. > > But that is exactly the distinction that we have to worry about when not > at top level. The error in the AND part of the proposed patch is > exhibited by > WHERE NOT ((a.a1 > 1) AND (b.b1 > 1)) > When a.a1 is NULL, the AND can't yield TRUE --- but it can yield FALSE, > which will become TRUE at the top level. So neither a nor b can be > considered non-nullable in this expression. I tried testing this, but the problem is that this is rewritten as WHERE (a.a1<=1 OR b.b1<=1) so it doesn't trigger. (I was sure I'd tested that before submiting the patch) Is there a way to turn that transformation off for testing purposes? Or are we guaranteed that there wont' be OR/AND expressions inside a NOT, because they will have always been expanded? > > You do have an insight here, which is that if the same rel can be shown > to null *all* the arms of an OR, it nulls the OR. The same holds true > of an AND, I think. Yes, but with an AND, you can use any of them (modulo the NOT issue, of course) > SAFE_IF_NULL a rel is nonnullable if nulling it forces this > expression to return NULL > SAFE_IF_NULL_OR_FALSE a rel is nonnullable if nulling it forces this > expression to return NULL or FALSE > SAFE_IF_NULL_OR_TRUE a rel is nonnullable if nulling it forces this > expression to return NULL or TRUE I'll have to think abotu this a bit more. Do we agree that if the thing inside the NOT is a single expression, then what I've doing is safe? IS it safe even if theres and/or involved as the argument to a (strict) operator? > regards, tom lane Bradley
Bradley Baetz <bbaetz@acm.org> writes: > Or are we guaranteed that there wont' be OR/AND expressions > inside a NOT, because they will have always been expanded? No, we're not, because the CNF-ifying code is optional. If you study the heuristics in prepqual.c you could doubtless devise a query that has such an expression somewhere in it. Even if that was a safe assumption today, writing code that would fail in as subtle a way as this when presented with such a tree won't do. The optimizer gets revamped constantly, and so I don't want to see any assumptions that fragile embedded into pieces of it. > Do we agree that if the thing inside the NOT is a single expression, > then what I've doing is safe? IS it safe even if theres and/or involved > as the argument to a (strict) operator? I'm going to take the position "fix it right or don't touch it at all". The routine as it stands does what it was intended to do. If you want to invest the effort to take it to the next level of intelligence, great --- but let's not put in a half-baked attempt. regards, tom lane