Thread: Reducing expression evaluation overhead
I've been looking at an example provided by Arjen van der Meijden in which practically all the runtime goes into evaluating very trivial comparison expressions (it's basically a CASE statement with a huge number of arms). Whether or not you think that a CASE with a huge number of arms is a particularly sane thing to do, it does seem that the overhead is uncomfortably large. I get profiles like this: % cumulative self self total time seconds seconds calls s/call s/call name 38.15 41.92 41.92 229646 0.00 0.00 nconc21.76 65.84 23.92 199054454 0.00 0.00 ExecEvalExpr11.38 78.34 12.50 10000 0.00 0.00 ExecEvalCase 8.43 87.61 9.27 66348151 0.00 0.00 ExecEvalFuncArgs 8.12 96.54 8.93 66348151 0.00 0.00 ExecMakeFunctionResult 2.96 99.78 3.2566348151 0.00 0.00 ExecEvalVar 1.23 101.14 1.36 10058 0.00 0.00 AllocSetCheck 1.23 102.49 1.35 66348151 0.00 0.00 ExecEvalOper 1.12 103.72 1.24 76537 0.00 0.00 OpernameGetCandidates0.85 104.66 0.94 66424693 0.00 0.00 int4eq % cumulative self self total time seconds seconds calls ms/call ms/call name 25.24 161.92 161.92 51488 3.14 3.14 nconc20.99 296.57 134.65 172163502 0.00 0.00 ExecEvalExpr14.90 392.18 95.61 _mcount 9.66 454.16 61.98 57381167 0.00 0.00 ExecMakeFunctionResult 7.32 501.12 46.96 10000 4.70 4.70 ExecEvalCase 7.24 547.60 46.48 57381167 0.00 0.00 ExecEvalFuncArgs 6.47 589.09 41.49 57381167 0.00 0.00 ExecEvalVar 2.90 607.69 18.60 57381167 0.00 0.00 ExecEvalOper 1.29 615.94 8.25 int4eq The above are from different-size test cases on different hardware, just to make sure that the data is reasonably trustworthy. The nconc entry may be ignored; it's lappend() overhead from the parser, and will drop into the noise once Neil gets his list rewrite done. The thing that jumps out at me is that ExecEvalExpr is taking way more cycles than it has any right to do. The test conditions in the CASE are of the form "var = constant", and so for each tested condition, ExecEvalExpr is invoked three times: one call is passed off to ExecEvalVar, one to ExecEvalOper, and the constant is handled inline by fairly trivial code, which is surely a good bit cheaper than ExecEvalVar. So it seems that a very large fraction of the runtime is being spent just in ExecEvalExpr's switching function, plus the overhead of an extra level of function call to reach the routines that actually do the work. I am thinking about attacking this by expanding ExprState nodes to include a function pointer to the appropriate evaluation subroutine (ExecEvalVar, ExecEvalOper, etc). ExecEvalExpr would be replaced by a macro like this: #define ExecEvalExpr(expr, econtext, isNull, isDone) \ ((*(expr)->evalfunc) (expr, econtext, isNull, isDone)) This would eliminate the overhead of the switch() in ExecEvalExpr, and reduce two levels of function call to one for most node types, at the cost of whatever the delta is between a direct and an indirect function call. In my experience that delta is pretty small and we should come out ahead. The main downside I can see is adding another four or eight bytes to each ExprState node, which might be a noticeable memory burden with complex expressions. A lesser objection is that unless we wanted to complicate and slow down the ExecEvalExpr macro, we'd have to assume that the macro is never passed a null expression pointer. Right now ExecEvalExpr treats a NULL input as if it were a null constant node. However, I'm fairly sure that that test is useless because no one ever tries to evaluate a null expression tree. We could make some other marginal hacks too once we did this. For instance, ExecEvalOper just hands off control to ExecMakeFunctionResult once it's made sure the fcache has been initialized. We could have it do that initialization and then change the evalfunc pointer to point directly to ExecMakeFunctionResult, thereby eliminating another level of function-call overhead on all subsequent calls. There are other tests that could be reduced from every-time to one-time overhead with similar tricks. I'm not sure that this would let us catch up to what Arjen reports as MySQL's expression evaluation speed, but it should at least speed things up a bit with only fairly localized changes. Comments? regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> I'm not sure that this would let us catch up to what Arjen Tom> reports as MySQL's expression evaluation speed,but it should Tom> at least speed things up a bit with only fairly localized Tom> changes. I like the idea of memoizing the switch with function pointers as I don't think branch prediction helps much with varying switch arms selected with different exprs. Also I agree that the delta of indirect function invocation is probably small. I've forgotten the syntax of case, but for the simple form isn't expr=const going to be the same expr for each case arm ? If that's the case, couldn't we actually save the value of expr in a Datum and then reuse that (through a Const) in each of the other arms to evaluate the actual exprs ? That should reduce the number of times ExecEvalVar (and through it heapgetattr) are called. -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh
Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes: > I like the idea of memoizing the switch with function pointers as I > don't think branch prediction helps much with varying switch arms > selected with different exprs. Check, it's hard to see how any CPU could get much traction on the behavior of the switch jump in ExecEvalExpr. > I've forgotten the syntax of case, but for the simple form isn't > expr=const going to be the same expr for each case arm ? If that's the > case, couldn't we actually save the value of expr in a Datum and then > reuse that (through a Const) in each of the other arms to evaluate the > actual exprs ? That should reduce the number of times ExecEvalVar (and > through it heapgetattr) are called. Right now we expand the simple form of CASE into the general form:CASE x WHEN y THEN ... WHEN z THEN ... becomesCASE WHEN x=y THEN ... WHEN x=z THEN ... This does involve multiple evaluation of x, which'd be particularly nasty if it's more than just a variable reference. It's probably a good idea to try to improve that. But inlining ExecEvalExpr will help all expressions not just CASE, so I'll work on that first ... regards, tom lane
Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes: > >>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > > Tom> I'm not sure that this would let us catch up to what Arjen > Tom> reports as MySQL's expression evaluation speed, but it should > Tom> at least speed things up a bit with only fairly localized > Tom> changes. > > I like the idea of memoizing the switch with function pointers as I > don't think branch prediction helps much with varying switch arms > selected with different exprs. Also I agree that the delta of indirect > function invocation is probably small. a) I don't see why you would assume branch prediction would be ineffective here. There are probably a few arms of the switch that are more common than all the others, especially when a large query is evaluating the same expression over and over again. However worrying about branch prediction is probably being premature when the amount of time being spend here is so large. b) Instead of storing one of a small set of function pointers in every node of every expression, wouldn't it make more sense to have a table lookup from node type to function pointer? You only need one pointer for each node type, not one for every node. NodeTags start at 0 and the highest is 900, so we're talking about a 4k lookup table on a 32 bit machine. > I've forgotten the syntax of case, but for the simple form isn't > expr=const going to be the same expr for each case arm ? If that's the > case, couldn't we actually save the value of expr in a Datum and then > reuse that (through a Const) in each of the other arms to evaluate the > actual exprs ? That should reduce the number of times ExecEvalVar (and > through it heapgetattr) are called. That's not true all the time, but I know 90% of my case statements are of this form. In some ideal world postgres would recognize this form and handle it specially using some kind of quick hash table lookup. I don't see how to reconcile that with postgres's extensible types though. I guess if postgres can see that every arm of the CASE is a '=' expression one side of which is a constant expression and all of the same type then it could use the same operators that the hash index code uses? That seems like it would be a lot of work though. -- greg
Greg Stark <gsstark@mit.edu> writes: > a) I don't see why you would assume branch prediction would be ineffective > here. There are probably a few arms of the switch that are more common than > all the others, especially when a large query is evaluating the same > expression over and over again. Sure, but the problem from the CPU's point of view is that the pattern isn't 11111111...2222222...3333333 but 123123123123123... For instance when evaluating "var op const" over and over, as in this example, the branches will go to ExecEvalVar, ExecEvalConst, ExecEvalOper, ExecEvalVar, ExecEvalConst, etc etc. I know of no modern CPU that will guess right in that scenario. Inlining ExecEvalExpr will push back this uncertainty to the call points --- which won't make things better, but can't make 'em any worse either. > However worrying about branch prediction is probably being premature True, I'm not sure that it's really an issue. On the other hand, I have measurements on two different architectures that show ExecEvalExpr being an unreasonably large fraction of the runtime. Given that there's hardly anything in there except a switch(), Sailesh's idea that branch misprediction is part of the problem seems like a reasonable thought. > b) Instead of storing one of a small set of function pointers in every > node of every expression, wouldn't it make more sense to have a table > lookup from node type to function pointer? That's pretty much what the ExecEvalExpr switch() does already, on most modern architectures. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > > b) Instead of storing one of a small set of function pointers in every > > node of every expression, wouldn't it make more sense to have a table > > lookup from node type to function pointer? > > That's pretty much what the ExecEvalExpr switch() does already, on most > modern architectures. Huh. <looks at the assembly on i386>. Indeed that's exactly what it's doing. Amazing that a failed branch prediction in the wrong place can cause that huge a penalty. -- greg
Greg Stark <gsstark@mit.edu> writes: > That's not true all the time, but I know 90% of my case statements are of this > form. In some ideal world postgres would recognize this form and handle it > specially using some kind of quick hash table lookup. > I don't see how to reconcile that with postgres's extensible types though. I > guess if postgres can see that every arm of the CASE is a '=' expression one > side of which is a constant expression and all of the same type then it could > use the same operators that the hash index code uses? That seems like it would > be a lot of work though. Yeah, trying to hash it is more work than I feel like doing. However I think it would be a good idea to avoid multiple evaluations of the CASE test expression, both for speed and to respect the principle of least surprise. If the test expression is volatile, our present behavior is unlikely to make the user happy. The idea I was toying with is to generate, not "x = y" with repeated copies of x, but "placeholder = y" where placeholder is a dummy expression tree node. Then at runtime, the CASE code would evaluate the test expression once and save it into the econtext so the dummy node(s) could regurgitate that value. We already have exactly such a mechanism in place to handle the VALUE keyword for domain check constraints; it'd be easy to duplicate it for CASE. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> The idea I was toying with is to generate, not "x = y" with Tom> repeated copies of x, but "placeholder = y" whereplaceholder Tom> is a dummy expression tree node. Then at runtime, the CASE Tom> code would evaluate the test expressiononce and save it into Tom> the econtext so the dummy node(s) could regurgitate that Tom> value. We alreadyhave exactly such a mechanism in place to Tom> handle the VALUE keyword for domain check constraints; it'd Tom>be easy to duplicate it for CASE. That is exactly what I was proposing. I implemented something like this in TCQ and used a Const expression tree node. This was for something we call "grouped filters" where we build an index on predicates from multiple queries. So if you have a bunch of queries (say ~1000) each with a predicate "R.a ??? xxx" where ??? is one of <.>.<=,>=,= then we evaluate using the predicate index which queries fail for each incoming tuple. In a separate experiment we found that ExecEvalVar is particularly expensive for us .. this is because we have an IntermediateHeapTuple data structure to represent join tuples (in our framework, join orders are not fixed) .. the IHT has a set of pointers to the constituent tuples. This means that we have to do more work in ExecEvalVar .. essentially one more lookup into the IHT. All this was only possible because you guys kept around the varnoold and the attnoold !! -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh