Thread: Reducing expression evaluation overhead

Reducing expression evaluation overhead

From
Tom Lane
Date:
I've been looking at an example provided by Arjen van der Meijden in
which practically all the runtime goes into evaluating very trivial
comparison expressions (it's basically a CASE statement with a huge
number of arms).  Whether or not you think that a CASE with a huge
number of arms is a particularly sane thing to do, it does seem that
the overhead is uncomfortably large.  I get profiles like this:
 %   cumulative   self              self     total           time   seconds   seconds    calls   s/call   s/call  name
 38.15     41.92    41.92   229646     0.00     0.00  nconc21.76     65.84    23.92 199054454     0.00     0.00
ExecEvalExpr11.38    78.34    12.50    10000     0.00     0.00  ExecEvalCase 8.43     87.61     9.27 66348151     0.00
  0.00  ExecEvalFuncArgs 8.12     96.54     8.93 66348151     0.00     0.00  ExecMakeFunctionResult 2.96     99.78
3.2566348151     0.00     0.00  ExecEvalVar 1.23    101.14     1.36    10058     0.00     0.00  AllocSetCheck 1.23
102.49    1.35 66348151     0.00     0.00  ExecEvalOper 1.12    103.72     1.24    76537     0.00     0.00
OpernameGetCandidates0.85    104.66     0.94 66424693     0.00     0.00  int4eq
 
 %   cumulative   self              self     total           time   seconds   seconds    calls  ms/call  ms/call  name
 25.24    161.92   161.92    51488     3.14     3.14  nconc20.99    296.57   134.65 172163502     0.00     0.00
ExecEvalExpr14.90   392.18    95.61                             _mcount 9.66    454.16    61.98 57381167     0.00
0.00 ExecMakeFunctionResult 7.32    501.12    46.96    10000     4.70     4.70  ExecEvalCase 7.24    547.60    46.48
57381167    0.00     0.00  ExecEvalFuncArgs 6.47    589.09    41.49 57381167     0.00     0.00  ExecEvalVar 2.90
607.69   18.60 57381167     0.00     0.00  ExecEvalOper 1.29    615.94     8.25                             int4eq
 

The above are from different-size test cases on different hardware,
just to make sure that the data is reasonably trustworthy.

The nconc entry may be ignored; it's lappend() overhead from the parser,
and will drop into the noise once Neil gets his list rewrite done.  The
thing that jumps out at me is that ExecEvalExpr is taking way more
cycles than it has any right to do.

The test conditions in the CASE are of the form "var = constant", and so
for each tested condition, ExecEvalExpr is invoked three times: one call
is passed off to ExecEvalVar, one to ExecEvalOper, and the constant is
handled inline by fairly trivial code, which is surely a good bit
cheaper than ExecEvalVar.  So it seems that a very large fraction of the
runtime is being spent just in ExecEvalExpr's switching function, plus
the overhead of an extra level of function call to reach the routines
that actually do the work.

I am thinking about attacking this by expanding ExprState nodes to
include a function pointer to the appropriate evaluation subroutine
(ExecEvalVar, ExecEvalOper, etc).  ExecEvalExpr would be replaced by
a macro like this:

#define ExecEvalExpr(expr, econtext, isNull, isDone) \ ((*(expr)->evalfunc) (expr, econtext, isNull, isDone))

This would eliminate the overhead of the switch() in ExecEvalExpr, and
reduce two levels of function call to one for most node types, at the
cost of whatever the delta is between a direct and an indirect function
call.  In my experience that delta is pretty small and we should come
out ahead.

The main downside I can see is adding another four or eight bytes to
each ExprState node, which might be a noticeable memory burden with
complex expressions.  A lesser objection is that unless we wanted to
complicate and slow down the ExecEvalExpr macro, we'd have to assume
that the macro is never passed a null expression pointer.  Right now
ExecEvalExpr treats a NULL input as if it were a null constant node.
However, I'm fairly sure that that test is useless because no one ever
tries to evaluate a null expression tree.

We could make some other marginal hacks too once we did this.  For
instance, ExecEvalOper just hands off control to ExecMakeFunctionResult
once it's made sure the fcache has been initialized.  We could have it
do that initialization and then change the evalfunc pointer to point
directly to ExecMakeFunctionResult, thereby eliminating another level of
function-call overhead on all subsequent calls.  There are other tests
that could be reduced from every-time to one-time overhead with similar
tricks.

I'm not sure that this would let us catch up to what Arjen reports as
MySQL's expression evaluation speed, but it should at least speed things
up a bit with only fairly localized changes.

Comments?
        regards, tom lane


Re: Reducing expression evaluation overhead

From
Sailesh Krishnamurthy
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
   Tom> I'm not sure that this would let us catch up to what Arjen   Tom> reports as MySQL's expression evaluation
speed,but it should   Tom> at least speed things up a bit with only fairly localized   Tom> changes.
 

I like the idea of memoizing the switch with function pointers as I
don't think branch prediction helps much with varying switch arms
selected with different exprs. Also I agree that the delta of indirect
function invocation is probably small.

I've forgotten the syntax of case, but for the simple form isn't
expr=const going to be the same expr for each case arm ? If that's the
case, couldn't we actually save the value of expr in a Datum and then
reuse that (through a Const) in each of the other arms to evaluate the
actual exprs ? That should reduce the number of times ExecEvalVar (and
through it heapgetattr) are called. 

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh




Re: Reducing expression evaluation overhead

From
Tom Lane
Date:
Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes:
> I like the idea of memoizing the switch with function pointers as I
> don't think branch prediction helps much with varying switch arms
> selected with different exprs.

Check, it's hard to see how any CPU could get much traction on the
behavior of the switch jump in ExecEvalExpr.

> I've forgotten the syntax of case, but for the simple form isn't
> expr=const going to be the same expr for each case arm ? If that's the
> case, couldn't we actually save the value of expr in a Datum and then
> reuse that (through a Const) in each of the other arms to evaluate the
> actual exprs ? That should reduce the number of times ExecEvalVar (and
> through it heapgetattr) are called. 

Right now we expand the simple form of CASE into the general form:CASE x WHEN y THEN ... WHEN z THEN ...
becomesCASE WHEN x=y THEN ... WHEN x=z THEN ...
This does involve multiple evaluation of x, which'd be particularly
nasty if it's more than just a variable reference.  It's probably a good
idea to try to improve that.  But inlining ExecEvalExpr will help all
expressions not just CASE, so I'll work on that first ...
        regards, tom lane


Re: Reducing expression evaluation overhead

From
Greg Stark
Date:
Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes:

> >>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
>     Tom> I'm not sure that this would let us catch up to what Arjen
>     Tom> reports as MySQL's expression evaluation speed, but it should
>     Tom> at least speed things up a bit with only fairly localized
>     Tom> changes.
> 
> I like the idea of memoizing the switch with function pointers as I
> don't think branch prediction helps much with varying switch arms
> selected with different exprs. Also I agree that the delta of indirect
> function invocation is probably small.

a) I don't see why you would assume branch prediction would be ineffective
here. There are probably a few arms of the switch that are more common than
all the others, especially when a large query is evaluating the same
expression over and over again. However worrying about branch prediction is
probably being premature when the amount of time being spend here is so large.

b) Instead of storing one of a small set of function pointers in every node of
every expression, wouldn't it make more sense to have a table lookup from node
type to function pointer? You only need one pointer for each node type, not
one for every node. NodeTags start at 0 and the highest is 900, so we're
talking about a 4k lookup table on a 32 bit machine.

> I've forgotten the syntax of case, but for the simple form isn't
> expr=const going to be the same expr for each case arm ? If that's the
> case, couldn't we actually save the value of expr in a Datum and then
> reuse that (through a Const) in each of the other arms to evaluate the
> actual exprs ? That should reduce the number of times ExecEvalVar (and
> through it heapgetattr) are called. 

That's not true all the time, but I know 90% of my case statements are of this
form. In some ideal world postgres would recognize this form and handle it
specially using some kind of quick hash table lookup. 

I don't see how to reconcile that with postgres's extensible types though. I
guess if postgres can see that every arm of the CASE is a '=' expression one
side of which is a constant expression and all of the same type then it could
use the same operators that the hash index code uses? That seems like it would
be a lot of work though.



-- 
greg



Re: Reducing expression evaluation overhead

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> a) I don't see why you would assume branch prediction would be ineffective
> here. There are probably a few arms of the switch that are more common than
> all the others, especially when a large query is evaluating the same
> expression over and over again.

Sure, but the problem from the CPU's point of view is that the pattern
isn't 11111111...2222222...3333333 but 123123123123123...  For instance
when evaluating "var op const" over and over, as in this example, the
branches will go to ExecEvalVar, ExecEvalConst, ExecEvalOper,
ExecEvalVar, ExecEvalConst, etc etc.  I know of no modern CPU that will
guess right in that scenario.  Inlining ExecEvalExpr will push back this
uncertainty to the call points --- which won't make things better, but
can't make 'em any worse either.

> However worrying about branch prediction is probably being premature

True, I'm not sure that it's really an issue.  On the other hand, I have
measurements on two different architectures that show ExecEvalExpr being
an unreasonably large fraction of the runtime.  Given that there's
hardly anything in there except a switch(), Sailesh's idea that branch
misprediction is part of the problem seems like a reasonable thought.

> b) Instead of storing one of a small set of function pointers in every
> node of every expression, wouldn't it make more sense to have a table
> lookup from node type to function pointer?

That's pretty much what the ExecEvalExpr switch() does already, on most
modern architectures.
        regards, tom lane


Re: Reducing expression evaluation overhead

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> > b) Instead of storing one of a small set of function pointers in every
> > node of every expression, wouldn't it make more sense to have a table
> > lookup from node type to function pointer?
> 
> That's pretty much what the ExecEvalExpr switch() does already, on most
> modern architectures.

Huh. <looks at the assembly on i386>. Indeed that's exactly what it's doing.
Amazing that a failed branch prediction in the wrong place can cause that huge
a penalty.

-- 
greg



Re: Reducing expression evaluation overhead

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> That's not true all the time, but I know 90% of my case statements are of this
> form. In some ideal world postgres would recognize this form and handle it
> specially using some kind of quick hash table lookup. 

> I don't see how to reconcile that with postgres's extensible types though. I
> guess if postgres can see that every arm of the CASE is a '=' expression one
> side of which is a constant expression and all of the same type then it could
> use the same operators that the hash index code uses? That seems like it would
> be a lot of work though.

Yeah, trying to hash it is more work than I feel like doing.  However I
think it would be a good idea to avoid multiple evaluations of the CASE
test expression, both for speed and to respect the principle of least
surprise.  If the test expression is volatile, our present behavior is
unlikely to make the user happy.

The idea I was toying with is to generate, not "x = y" with repeated
copies of x, but "placeholder = y" where placeholder is a dummy
expression tree node.  Then at runtime, the CASE code would evaluate the
test expression once and save it into the econtext so the dummy node(s)
could regurgitate that value.  We already have exactly such a mechanism
in place to handle the VALUE keyword for domain check constraints;
it'd be easy to duplicate it for CASE.
        regards, tom lane


Re: Reducing expression evaluation overhead

From
Sailesh Krishnamurthy
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
   Tom> The idea I was toying with is to generate, not "x = y" with   Tom> repeated copies of x, but "placeholder = y"
whereplaceholder   Tom> is a dummy expression tree node.  Then at runtime, the CASE   Tom> code would evaluate the test
expressiononce and save it into   Tom> the econtext so the dummy node(s) could regurgitate that   Tom> value.  We
alreadyhave exactly such a mechanism in place to   Tom> handle the VALUE keyword for domain check constraints; it'd
Tom>be easy to duplicate it for CASE.
 

That is exactly what I was proposing. I implemented something like
this in TCQ and used a Const expression tree node. This was for
something we call "grouped filters" where we build an index on
predicates from multiple queries. So if you have a bunch of queries
(say ~1000) each with a predicate "R.a ??? xxx" where ??? is one of
<.>.<=,>=,= then we evaluate using the predicate index which queries
fail for each incoming tuple. 

In a separate experiment we found that ExecEvalVar is particularly
expensive for us .. this is because we have an IntermediateHeapTuple
data structure to represent join tuples (in our framework, join orders
are not fixed) .. the IHT has a set of pointers to the constituent
tuples. This means that we have to do more work in ExecEvalVar
.. essentially one more lookup into the IHT. All this was only
possible because you guys kept around the varnoold and the attnoold !!

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh