Canonicalization of WHERE clauses considered harmful - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Canonicalization of WHERE clauses considered harmful |
Date | |
Msg-id | 8702.1071093294@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Canonicalization of WHERE clauses considered harmful
Re: Canonicalization of WHERE clauses considered harmful |
List | pgsql-hackers |
I've been thinking about Josh's recent complaint about poor planning of queries like SELECT t1.a, t2.b FROM t1, t2 WHERE t1.a = t2.a AND (( t1.c = x AND t1.f IN (m, n, o) AND t2.d = v AND t2.e BETWEEN j AND k)OR( t1.c = y AND t1.f IN (n, o, p) AND t2.d= v AND t2.e BETWEEN k AND h)OR ( t1.c = z AND t1.f IN (p, q) AND t2.d = w AND t2.e BETWEEN k AND h)) (which is an abstraction of Q19 in the TPC-R benchmark). I believe that most of the blame for this problem can be laid at the feet of optimizer/prep/prepqual.c, which attempts to transform the WHERE clause into a canonical form (either CNF AND-of-ORs, or DNF OR-of-ANDs, according to some heuristics that are not very compelling). I am not certain whether prepqual will actually fire on this query (still waiting for those EXPLAIN results, Josh), but if it doesn't then it'd really be disastrous, and if it does then it's still not going to improve matters much, because forcing this WHERE into pure AND-of-ORs form isn't very helpful. I've worked on prepqual.c some over the years, but it never occurred to me to question the fundamental premise of creating a normal-form WHERE clause before. Yet as I look at this example, I'm wondering why we should try to. Josh lied a little bit about the TPC-R query: the form actually given in the TPC spec is WHERE( t1.a = t2.a AND t1.c = x AND t1.f IN (m, n, o) AND t2.d = v AND t2.e BETWEEN j AND k)OR( t1.a = t2.a AND t1.c= y AND t1.f IN (n, o, p) AND t2.d = v AND t2.e BETWEEN k AND h)etc It is absolutely critical that the optimizer recognize that the "t1.a = t2.a" clause is common to all the OR branches and can be factored out to give the form Josh showed. Without that, you don't have any equijoin clause and you're going to end up with a horrid nested-loop join. In the TPC query per spec, some of the other single-table restrictions are also common to all the OR branches (ie, x, y, etc are not all different constants) and should be pulled out so that they can be applied as restriction clauses to the input tables before the join. In other words, we'd like the optimizer to transform(a AND b) OR (a AND c) toa AND (b OR c) Currently, this is accomplished by the roundabout method of converting the WHERE clause to CNF (AND-of-ORs) and then simplifying duplicate sub-clauses within an OR:(a AND b) OR (a AND c) expands by repeated application of the distributive law to(a OR a) AND (a OR c) AND (b OR a) AND (b OR c) and then qual_cleanup notices that (a OR a) is redundant, leavinga AND (a OR c) AND (b OR a) AND (b OR c) So we manage to pull out "a" all right, but we've left the query cluttered with additional, redundant clauses --- there is no logic that will notice that this could be simplified toa AND (b OR c) The extra clauses make for useless work during planning and during execution; they also screw up selectivity estimates (since the selectivity estimator doesn't realize they are redundant). This is bad. So as a mechanism for pulling out duplicate clauses, canonicalization sucks. Is there anything else that it does for us? After some thought the only possible benefit I can see is that it can sometimes manufacture restriction clauses from what otherwise would be join clauses. In the TPC-R example, once we've pulled out duplicate clauses we'd still be left with an OR of ANDed conditions, some on t1 and some on t2. If we leave this as-is, there's not much to be done with it except apply it as a join filter after the t1/t2 join is made. However, if we force to CNF form then some of the resulting sub-clauses will refer only to t1, some only to t2, and some to both tables. The ones that mention only t1 or only t2 would be usable as restriction clauses during the table scans, which is a big help in cutting the size of the required join. Also they could possibly be used as indexscan qualifications (I'm not sure whether TPC-R expects relevant indexes to be in place). So it seems to me that a reasonable heuristic for deciding whether to CNF-ify an OR clause is to check whether any of the resulting subclauses will refer to fewer relations than the original OR clause does. If not, we may as well leave it as an OR clause and avoid generating redundant clauses. There is no such test at present. Pulling out duplicate clauses from the OR arms would probably be better done directly than by relying on CNF-ification. Another consideration is that the planner already has logic for generating multi-indexscans by pulling relevant conditions out of OR-of-AND trees. So, at least for the case where there's a relevant index, we do not need any CNF conversion to be able to make a useful indexscan restriction clause. This fails to happen in the TPC-R example at the moment, but only because the OR indexscan logic neglects to consider whether OR trees that mention multiple relations might contain subclauses that reference only the indexed relation. It'd be pretty easy to fix that. Is it worth doing CNF-ification at all, if the only value is to detect possible restriction clauses that aren't related to indexes? Comments? regards, tom lane
pgsql-hackers by date: