Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)
Date
Msg-id 20161206232231.ajn6r5bww63v4ntu@alap3.anarazel.de
Whole thread Raw
In response to Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)  (Peter Geoghegan <pg@heroku.com>)
Responses Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)
List pgsql-hackers
On 2016-12-06 13:27:14 -0800, Peter Geoghegan wrote:
> On Mon, Dec 5, 2016 at 7:49 PM, Andres Freund <andres@anarazel.de> wrote:
> > I tried to address 2) by changing the C implementation. That brings some
> > measurable speedups, but it's not huge. A bigger speedup is making
> > slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers;
> > but it's still not huge.  Finally I turned to just-in-time (JIT)
> > compiling the code for tuple deforming. That doesn't save the cost of
> > 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01).  The
> > first part is done in 0008, the JITing in 0012.
>
> A more complete motivating example would be nice. For example, it
> would be nice to see the overall speedup for some particular TPC-H
> query.

Well, it's a bit WIP-y for that - not all TPCH queries run JITed yet, as
I've not done that for enough expression types... And you run quickly
into other bottlenecks.

But here we go for TPCH (scale 10) Q01:
master:
Time: 33885.381 ms 16.29%  postgres  postgres          [.] slot_getattr 12.85%  postgres  postgres          [.]
ExecMakeFunctionResultNoSets10.85%  postgres  postgres          [.] advance_aggregates  6.91%  postgres  postgres
  [.] slot_deform_tuple  6.70%  postgres  postgres          [.] advance_transition_function  4.59%  postgres  postgres
       [.] ExecProject  4.25%  postgres  postgres          [.] float8_accum  3.69%  postgres  postgres          [.]
tuplehash_insert 2.39%  postgres  postgres          [.] float8pl  2.20%  postgres  postgres          [.] bpchareq
2.03% postgres  postgres          [.] check_stack_depth
 

profile:

(note that all expression evaluated things are distributed among many
functions)

dev (no jiting):
Time: 30343.532 ms

profile: 16.57%  postgres  postgres          [.] slot_deform_tuple 13.39%  postgres  postgres          [.] ExecEvalExpr
8.64%  postgres  postgres          [.] advance_aggregates  8.58%  postgres  postgres          [.]
advance_transition_function 5.83%  postgres  postgres          [.] float8_accum  5.14%  postgres  postgres          [.]
tuplehash_insert 3.89%  postgres  postgres          [.] float8pl  3.60%  postgres  postgres          [.] slot_getattr
2.66% postgres  postgres          [.] bpchareq  2.56%  postgres  postgres          [.] heap_getnext
 

dev (jiting):
SET jit_tuple_deforming = on;
SET jit_expressions = true;

Time: 24439.803 ms

profile: 11.11%  postgres  postgres             [.] slot_deform_tuple 10.87%  postgres  postgres             [.]
advance_aggregates 9.74%  postgres  postgres             [.] advance_transition_function  6.53%  postgres  postgres
       [.] float8_accum  5.25%  postgres  postgres             [.] tuplehash_insert  4.31%  postgres  perf-10698.map
  [.] deform0  3.68%  postgres  perf-10698.map       [.] evalexpr6  3.53%  postgres  postgres             [.]
slot_getattr 3.41%  postgres  postgres             [.] float8pl  2.84%  postgres  postgres             [.] bpchareq
 

(note how expression eval when from 13.39% to roughly 4%)

The slot_deform_cost here is primarily cache misses. If you do the
"memory order" iteration, it drops significantly.

The JIT generated code still leaves a lot on the table, i.e. this is
definitely not the best we can do.  We also deform half the tuple twice,
because I've not yet added support for starting to deform in the middle
of a tuple.

Independent of new expression evaluation and/or JITing, if you make
advance_aggregates and advance_transition_function inline functions (or
you do profiling accounting for children), you'll notice that ExecAgg()
+ advance_aggregates + advance_transition_function themselves take up
about 20% cpu-time.  That's *not* including the hashtable management,
the actual transition functions, and such themselves.


If you have queries where tuple deforming is a bigger proportion of the
load, or where expression evalution (including projection) is a larger
part (any NULLs e.g.) you can get a lot bigger wins, even without
actually optimizing the generated code (which I've not yet done).

Just btw: float8_accum really should use an internal aggregation type
instead of using postgres array...


Andres



pgsql-hackers by date:

Previous
From: Gilles Darold
Date:
Subject: Re: Patch to implement pg_current_logfile() function
Next
From: Greg Stark
Date:
Subject: Re: Separate connection handling from backends