Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT) |
Date | |
Msg-id | 20161206232231.ajn6r5bww63v4ntu@alap3.anarazel.de Whole thread Raw |
In response to | Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)
|
List | pgsql-hackers |
On 2016-12-06 13:27:14 -0800, Peter Geoghegan wrote: > On Mon, Dec 5, 2016 at 7:49 PM, Andres Freund <andres@anarazel.de> wrote: > > I tried to address 2) by changing the C implementation. That brings some > > measurable speedups, but it's not huge. A bigger speedup is making > > slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers; > > but it's still not huge. Finally I turned to just-in-time (JIT) > > compiling the code for tuple deforming. That doesn't save the cost of > > 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01). The > > first part is done in 0008, the JITing in 0012. > > A more complete motivating example would be nice. For example, it > would be nice to see the overall speedup for some particular TPC-H > query. Well, it's a bit WIP-y for that - not all TPCH queries run JITed yet, as I've not done that for enough expression types... And you run quickly into other bottlenecks. But here we go for TPCH (scale 10) Q01: master: Time: 33885.381 ms 16.29% postgres postgres [.] slot_getattr 12.85% postgres postgres [.] ExecMakeFunctionResultNoSets10.85% postgres postgres [.] advance_aggregates 6.91% postgres postgres [.] slot_deform_tuple 6.70% postgres postgres [.] advance_transition_function 4.59% postgres postgres [.] ExecProject 4.25% postgres postgres [.] float8_accum 3.69% postgres postgres [.] tuplehash_insert 2.39% postgres postgres [.] float8pl 2.20% postgres postgres [.] bpchareq 2.03% postgres postgres [.] check_stack_depth profile: (note that all expression evaluated things are distributed among many functions) dev (no jiting): Time: 30343.532 ms profile: 16.57% postgres postgres [.] slot_deform_tuple 13.39% postgres postgres [.] ExecEvalExpr 8.64% postgres postgres [.] advance_aggregates 8.58% postgres postgres [.] advance_transition_function 5.83% postgres postgres [.] float8_accum 5.14% postgres postgres [.] tuplehash_insert 3.89% postgres postgres [.] float8pl 3.60% postgres postgres [.] slot_getattr 2.66% postgres postgres [.] bpchareq 2.56% postgres postgres [.] heap_getnext dev (jiting): SET jit_tuple_deforming = on; SET jit_expressions = true; Time: 24439.803 ms profile: 11.11% postgres postgres [.] slot_deform_tuple 10.87% postgres postgres [.] advance_aggregates 9.74% postgres postgres [.] advance_transition_function 6.53% postgres postgres [.] float8_accum 5.25% postgres postgres [.] tuplehash_insert 4.31% postgres perf-10698.map [.] deform0 3.68% postgres perf-10698.map [.] evalexpr6 3.53% postgres postgres [.] slot_getattr 3.41% postgres postgres [.] float8pl 2.84% postgres postgres [.] bpchareq (note how expression eval when from 13.39% to roughly 4%) The slot_deform_cost here is primarily cache misses. If you do the "memory order" iteration, it drops significantly. The JIT generated code still leaves a lot on the table, i.e. this is definitely not the best we can do. We also deform half the tuple twice, because I've not yet added support for starting to deform in the middle of a tuple. Independent of new expression evaluation and/or JITing, if you make advance_aggregates and advance_transition_function inline functions (or you do profiling accounting for children), you'll notice that ExecAgg() + advance_aggregates + advance_transition_function themselves take up about 20% cpu-time. That's *not* including the hashtable management, the actual transition functions, and such themselves. If you have queries where tuple deforming is a bigger proportion of the load, or where expression evalution (including projection) is a larger part (any NULLs e.g.) you can get a lot bigger wins, even without actually optimizing the generated code (which I've not yet done). Just btw: float8_accum really should use an internal aggregation type instead of using postgres array... Andres
pgsql-hackers by date: