Re: Why JIT speed improvement is so modest? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Why JIT speed improvement is so modest?
Date
Msg-id 20191204194332.eurzkwkqhlsbbd73@alap3.anarazel.de
Whole thread Raw
In response to Why JIT speed improvement is so modest?  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: Why JIT speed improvement is so modest?
List pgsql-hackers
Hi,

On 2019-11-25 18:09:29 +0300, Konstantin Knizhnik wrote:
> I wonder why even at this query, which seems to be ideal use case for JIT,
> we get such modest improvement?

I think there's a number of causes:

1) There's bottlenecks elsewhere:
   - The order of sequential scan memory accesses is bad
     https://www.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de

     In my experiments, fixing that yields larger JIT improvements,
     because less time is spent stalling due to cache misses during
     tuple deforming (needing the tuple's natts at the start prevents
     out-of-order from hiding the relevant latency).


   - The transition function for floating point aggregates is pretty
     expensive. In particular, we compute the full youngs-cramer stuff
     for sum/avg, even though they aren't actually needed there. This
     has become measurably worse with
     https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e954a727f0c8872bf5203186ad0f5312f6183746
     In this case it's complicated enough apparently that the transition
     functions are too expensive to inline.

   - float4/8_accum use arrays to store the transition state. That's
     noticably more expensive than just accessing a struct, partially
     because more checks needs to be done. We really should move most,
     if not all, aggregates that use array transition states to
     "internal" type transition states. Probably with some reusable
     helpers to make it easier to write serialization / deserialization
     functions so we can continue to allow parallelism.

   - The per-row overhead on lower levels of the query is
     significant. E.g. in your profile the
     HeapTupleSatisfiesVisibility() calls (you'd get largely rid of this
     by freezing), and the hashtable overhead is quite noticable. JITing
     expression eval doesn't fix that.

   ...


2) The code generated for JIT isn't that good. In particular, the
   external memory references included in the generated code limit the
   optimization potential quite substantially. There's also quite some
   (not just JIT) improvement potential related to the aggregation code,
   simplifying the generated expressions.

   See https://www.postgresql.org/message-id/20191023163849.sosqbfs5yenocez3%40alap3.anarazel.de
   for my attempt at improving the situation. It does measurably
   improve the situation for Q1, while still leaving a lot of further
   improvements to be done.  You'd be more than welcome to review some
   of that!


3) Plenty of crucial code is not JITed, even when expression
   related. Most crucial for Q1 is the fact that the hash computation
   for aggregates isn't JITed as a whole - when looking at hierarchical
   profiles, we spend about 1/3 of the whole query time within
   TupleHashTable*.

4) The currently required forming / deforming of tuples into minimal
   tuples when storing them in the hashagg table is *expensive*.

   We can address that partially by computing NOT NULL information for
   the tupledesc used for the hashtable (which will make JITed tuple
   deforming considerably faster, because it'll just be a reference to
   an hardcoded offset).

   We can also simplify the minimal tuple representation - historically
   it looks the way it does now because we needed minimal tuples to be
   largely compatible with heap tuples - but we don't anymore. Even just
   removing the weird offset math we do for minimal tuples would be
   beneficial, but I think we can do more than that.



> Vitesse DB reports 8x speedup on Q1,
> ISP-RAS JIT version  provides 3x speedup of Q1:

I think those measurements were done before a lot of generic
improvements to aggregation speed were done. E.g. Q1 performance
improved significantly due to the new expression evaluation engine, even
without JIT. Because the previous tree-walking expression evaluation was
so slow for many things, JITing that away obviously yielded bigger
improvements than it does now.


> VOPS provides 10x improvement of Q1.

My understanding of VOPS is that it ferries around more than one tuple
at a time. And avoids a lot of generic code paths. So that just doesn't
seem a meaningful comparison.


> In theory by elimination of interpretation overhead JIT should provide
> performance comparable with vecrtorized executor.

I don't think that's true at all. Vectorized execution, which I assume
to mean dealing with more than one tuple at a time, is largely
orthogonal to the way expressions are evaluated. The reason that
vectorized execution is good is that it drastically increases cache
locality (by performing work that accesses related data, e.g. a buffer
page, in a tight loop, without a lot of other work happening inbetween),
that it increases the benefits of out of order execution (by removing
dependencies, as e.g. predicates for multiple tuples can be computed,
without a separate dependency on the result for each predicate
evaluation), etc.

JIT compiled expression evaluation cannot get you these benefits.


> In most programming languages using JIT compiler instead of byte-code
> interpreter provides about 10x speed improvement.

But that's with low level bytecode execution, whereas expression
evaluation uses relatively coarse ops (sometimes called "super"
opcodes).



> Below are tops of profiles (functions with more than 1% of time):
>
> JIT:

Note that just looking at a plain porfile, without injecting information
about the JITed code, will yield misleading results. Without the
additional information perf will not be able to group the instructions
of the JITed code sampled to a function, leading to them each being
listed separately.

If you enable jit_profiling_support, and measure with

perf record -k 1 -o /tmp/perf.data -p 22950
(optionally with --call-graph lbr)
you then can inject the information about JITed code:
perf inject -v --jit -i /tmp/perf.data -o /tmp/perf.jit.data
and look at the result of that with
perf report -i /tmp/perf.jit.data


>   10.98%  postgres  postgres            [.] float4_accum
>    8.40%  postgres  postgres            [.] float8_accum
>    7.51%  postgres  postgres            [.] HeapTupleSatisfiesVisibility
>    5.92%  postgres  postgres            [.] ExecInterpExpr
>    5.63%  postgres  postgres            [.] tts_minimal_getsomeattrs

The fact that ExecInterpExpr, tts_minimal_getsomeattrs show up
significantly suggests that you're running a slightly older build,
without a few bugfixes. Could that be true?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Dmitry Dolgov
Date:
Subject: Re: Unsigned 64 bit integer to numeric
Next
From: Peter Eisentraut
Date:
Subject: more backtraces