Canonicalization of WHERE clauses considered harmful - Mailing list pgsql-hackers

From Tom Lane
Subject Canonicalization of WHERE clauses considered harmful
Date
Msg-id 8702.1071093294@sss.pgh.pa.us
Whole thread Raw
Responses Re: Canonicalization of WHERE clauses considered harmful
Re: Canonicalization of WHERE clauses considered harmful
List pgsql-hackers
I've been thinking about Josh's recent complaint about poor planning
of queries like

SELECT t1.a, t2.b
FROM t1, t2
WHERE t1.a = t2.a
AND (( t1.c = x  AND t1.f IN (m, n, o)  AND t2.d = v  AND t2.e BETWEEN j AND k)OR( t1.c = y  AND t1.f IN (n, o, p)  AND
t2.d= v  AND t2.e BETWEEN k AND h)OR ( t1.c = z  AND t1.f IN (p, q)  AND t2.d = w  AND t2.e BETWEEN k AND h))
 

(which is an abstraction of Q19 in the TPC-R benchmark).  I believe that
most of the blame for this problem can be laid at the feet of 
optimizer/prep/prepqual.c, which attempts to transform the WHERE clause
into a canonical form (either CNF AND-of-ORs, or DNF OR-of-ANDs, according
to some heuristics that are not very compelling).  I am not certain
whether prepqual will actually fire on this query (still waiting for those
EXPLAIN results, Josh), but if it doesn't then it'd really be disastrous,
and if it does then it's still not going to improve matters much, because
forcing this WHERE into pure AND-of-ORs form isn't very helpful.

I've worked on prepqual.c some over the years, but it never occurred to me
to question the fundamental premise of creating a normal-form WHERE clause
before.  Yet as I look at this example, I'm wondering why we should try
to.

Josh lied a little bit about the TPC-R query: the form actually given in
the TPC spec is

WHERE(  t1.a = t2.a  AND t1.c = x  AND t1.f IN (m, n, o)  AND t2.d = v  AND t2.e BETWEEN j AND k)OR(  t1.a = t2.a  AND
t1.c= y  AND t1.f IN (n, o, p)  AND t2.d = v  AND t2.e BETWEEN k AND h)etc
 

It is absolutely critical that the optimizer recognize that the "t1.a =
t2.a" clause is common to all the OR branches and can be factored out
to give the form Josh showed.  Without that, you don't have any equijoin
clause and you're going to end up with a horrid nested-loop join.
In the TPC query per spec, some of the other single-table restrictions are
also common to all the OR branches (ie, x, y, etc are not all different
constants) and should be pulled out so that they can be applied as
restriction clauses to the input tables before the join.

In other words, we'd like the optimizer to transform(a AND b) OR (a AND c)
toa AND (b OR c)

Currently, this is accomplished by the roundabout method of converting
the WHERE clause to CNF (AND-of-ORs) and then simplifying duplicate
sub-clauses within an OR:(a AND b) OR (a AND c)
expands by repeated application of the distributive law to(a OR a) AND (a OR c) AND (b OR a) AND (b OR c)
and then qual_cleanup notices that (a OR a) is redundant, leavinga AND (a OR c) AND (b OR a) AND (b OR c)
So we manage to pull out "a" all right, but we've left the query cluttered
with additional, redundant clauses --- there is no logic that will notice
that this could be simplified toa AND (b OR c)
The extra clauses make for useless work during planning and during
execution; they also screw up selectivity estimates (since the selectivity
estimator doesn't realize they are redundant).  This is bad.

So as a mechanism for pulling out duplicate clauses, canonicalization
sucks.  Is there anything else that it does for us?  After some thought
the only possible benefit I can see is that it can sometimes manufacture
restriction clauses from what otherwise would be join clauses.  In the
TPC-R example, once we've pulled out duplicate clauses we'd still be left
with an OR of ANDed conditions, some on t1 and some on t2.  If we leave
this as-is, there's not much to be done with it except apply it as a join
filter after the t1/t2 join is made.  However, if we force to CNF form
then some of the resulting sub-clauses will refer only to t1, some only
to t2, and some to both tables.  The ones that mention only t1 or only
t2 would be usable as restriction clauses during the table scans, which
is a big help in cutting the size of the required join.  Also they could
possibly be used as indexscan qualifications (I'm not sure whether TPC-R
expects relevant indexes to be in place).

So it seems to me that a reasonable heuristic for deciding whether to
CNF-ify an OR clause is to check whether any of the resulting subclauses
will refer to fewer relations than the original OR clause does.  If not,
we may as well leave it as an OR clause and avoid generating redundant
clauses.  There is no such test at present.

Pulling out duplicate clauses from the OR arms would probably be better
done directly than by relying on CNF-ification.

Another consideration is that the planner already has logic for generating
multi-indexscans by pulling relevant conditions out of OR-of-AND trees.
So, at least for the case where there's a relevant index, we do not need
any CNF conversion to be able to make a useful indexscan restriction
clause.  This fails to happen in the TPC-R example at the moment, but
only because the OR indexscan logic neglects to consider whether OR
trees that mention multiple relations might contain subclauses that
reference only the indexed relation.  It'd be pretty easy to fix that.
Is it worth doing CNF-ification at all, if the only value is to detect
possible restriction clauses that aren't related to indexes?

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Thomas Hallgren"
Date:
Subject: Re: pljava revisited
Next
From: Bruno Wolff III
Date:
Subject: Re: Canonicalization of WHERE clauses considered harmful