Re: Why JIT speed improvement is so modest? - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: Why JIT speed improvement is so modest?
Date
Msg-id 7cad5bf3-6a00-7434-d9df-d7f68078aebd@postgrespro.ru
Whole thread Raw
In response to Re: Why JIT speed improvement is so modest?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

Thank you for your replay and explanations.
My comments are inside.

On 04.12.2019 22:43, Andres Freund wrote:
> Hi,
>
> On 2019-11-25 18:09:29 +0300, Konstantin Knizhnik wrote:
>> I wonder why even at this query, which seems to be ideal use case for JIT,
>> we get such modest improvement?
> I think there's a number of causes:
>
> 1) There's bottlenecks elsewhere:
>     - The order of sequential scan memory accesses is bad
>       https://www.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de
>
>       In my experiments, fixing that yields larger JIT improvements,
>       because less time is spent stalling due to cache misses during
>       tuple deforming (needing the tuple's natts at the start prevents
>       out-of-order from hiding the relevant latency).

This is why I have implemented my own in-memory table access method.
It stores tuples in unpacked format so there should be no tuple 
deforming overhead.
By the way if somebody is interested (mostly for experiments, I do not 
thing that it in the current state it has some practival meaning)
my in-memory storage implementation is here:

https://github.com/postgrespro/postgresql.builtin_pool/tree/inmem_am

>
>
>     - The transition function for floating point aggregates is pretty
>       expensive. In particular, we compute the full youngs-cramer stuff
>       for sum/avg, even though they aren't actually needed there. This
>       has become measurably worse with
>       https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e954a727f0c8872bf5203186ad0f5312f6183746
>       In this case it's complicated enough apparently that the transition
>       functions are too expensive to inline.
>
>     - float4/8_accum use arrays to store the transition state. That's
>       noticably more expensive than just accessing a struct, partially
>       because more checks needs to be done. We really should move most,
>       if not all, aggregates that use array transition states to
>       "internal" type transition states. Probably with some reusable
>       helpers to make it easier to write serialization / deserialization
>       functions so we can continue to allow parallelism.

Yes, it is true.
Profile shows that packing/unpacking transition state takes substantial 
amount of time.
But youngs-cramer stuff is not used for SUM aggregate! It is using 
float4pl for accumulation and float4 as transaction type.
When I replace AVG with sum time of query execution with in-mem storage 
is decreased from 6 seconds to 4 seconds.
But in VOPS improvement was even larger: 700 msec vs. 1865 msec. So the 
gap in performance is even larger.

And profile is shows that aggregate infrastructure overhead disappears:

    8.82%  postgres  postgres           [.] ExecScan
    4.60%  postgres  postgres           [.] fetch_input_tuple
    4.51%  postgres  postgres           [.] inmem_getnextslot
    3.48%  postgres  postgres           [.] SeqNext
    3.13%  postgres  postgres           [.] ExecAgg
    2.78%  postgres  postgres           [.] MemoryContextReset
    1.77%  postgres  perf-27660.map     [.] 0x00007fcb81ae9032
    1.52%  postgres  perf-27660.map     [.] 0x00007fcb81ae9a4a
    1.50%  postgres  perf-27660.map     [.] 0x00007fcb81ae9fa2
    1.49%  postgres  perf-27660.map     [.] 0x00007fcb81ae9dcd
    1.44%  postgres  perf-27660.map     [.] 0x00007fcb81aea205
    1.42%  postgres  perf-27660.map     [.] 0x00007fcb81ae9072
    1.31%  postgres  perf-27660.map     [.] 0x00007fcb81ae9a56
    1.22%  postgres  perf-27660.map     [.] 0x00007fcb81ae9df1
    1.22%  postgres  perf-27660.map     [.] 0x00007fcb81aea225
    1.21%  postgres  perf-27660.map     [.] 0x00007fcb81ae93e6
    1.21%  postgres  perf-27660.map     [.] 0x00007fcb81ae9fae
    1.19%  postgres  perf-27660.map     [.] 0x00007fcb81ae9c83
    1.12%  postgres  perf-27660.map     [.] 0x00007fcb81ae9e5b
    1.12%  postgres  perf-27660.map     [.] 0x00007fcb81ae9c5f
    1.05%  postgres  perf-27660.map     [.] 0x00007fcb81ae9010
    1.05%  postgres  perf-27660.map     [.] 0x00007fcb81ae987b

As far as I understand positions in profile starting from 7-th are 
corresponding to JIT code.


>     - The per-row overhead on lower levels of the query is
>       significant. E.g. in your profile the
>       HeapTupleSatisfiesVisibility() calls (you'd get largely rid of this
>       by freezing), and the hashtable overhead is quite noticable. JITing
>       expression eval doesn't fix that.

Once again: my in-memory storage doesn't perform visibility checks.
This is was the primary idea of my experiment: try to minimize per-row 
storage overhead and check if JIT can provide performance
comparable with vectorized engine. Unfortunately the answer was 
negative: the difference with VOPS is more than three times, while 
difference between
standard table and in-memory table is less than 1.5.

>
>     ...
>
>
> 2) The code generated for JIT isn't that good. In particular, the
>     external memory references included in the generated code limit the
>     optimization potential quite substantially. There's also quite some
>     (not just JIT) improvement potential related to the aggregation code,
>     simplifying the generated expressions.
>
>     See https://www.postgresql.org/message-id/20191023163849.sosqbfs5yenocez3%40alap3.anarazel.de
>     for my attempt at improving the situation. It does measurably
>     improve the situation for Q1, while still leaving a lot of further
>     improvements to be done.  You'd be more than welcome to review some
>     of that!
>
>
> 3) Plenty of crucial code is not JITed, even when expression
>     related. Most crucial for Q1 is the fact that the hash computation
>     for aggregates isn't JITed as a whole - when looking at hierarchical
>     profiles, we spend about 1/3 of the whole query time within
>     TupleHashTable*.
> 4) The currently required forming / deforming of tuples into minimal
>     tuples when storing them in the hashagg table is *expensive*.
>
>     We can address that partially by computing NOT NULL information for
>     the tupledesc used for the hashtable (which will make JITed tuple
>     deforming considerably faster, because it'll just be a reference to
>     an hardcoded offset).
>
>     We can also simplify the minimal tuple representation - historically
>     it looks the way it does now because we needed minimal tuples to be
>     largely compatible with heap tuples - but we don't anymore. Even just
>     removing the weird offset math we do for minimal tuples would be
>     beneficial, but I think we can do more than that.
>

Yes, this is the first think I have noticed. VOPS is calling hash 
function only once per 64 rows - 64 times smaller than row storage.
This is why VOPS is 6 times faster on Q1 than vanilla postgres and 5 
times than my in-memory storage.
And this is why I removed aggregation from Q1 and just calculates grand 
aggregates.

>
>> VOPS provides 10x improvement of Q1.
> My understanding of VOPS is that it ferries around more than one tuple
> at a time. And avoids a lot of generic code paths. So that just doesn't
> seem a meaningful comparison.

VOPS is just an example of vectorized executor.
It is possible to implement things which VOPS is performing using 
customized  types
using custom nodes as in Hubert Zhang  prototype:

https://www.postgresql.org/message-id/flat/CAB0yrenxJ3FcmnLs8JqpEG3tzSZ%3DOL1MZBUh3v6dgH%2Bo70GTFA%40mail.gmail.com#df50bbf3610dc2f42cb9b54423a22111


>> In theory by elimination of interpretation overhead JIT should provide
>> performance comparable with vecrtorized executor.
> I don't think that's true at all. Vectorized execution, which I assume
> to mean dealing with more than one tuple at a time, is largely
> orthogonal to the way expressions are evaluated. The reason that
> vectorized execution is good is that it drastically increases cache
> locality (by performing work that accesses related data, e.g. a buffer
> page, in a tight loop, without a lot of other work happening inbetween),
> that it increases the benefits of out of order execution (by removing
> dependencies, as e.g. predicates for multiple tuples can be computed,
> without a separate dependency on the result for each predicate
> evaluation), etc.
>
> JIT compiled expression evaluation cannot get you these benefits.

Yes, I know this arguments.
But please look here in Q1 and lineitem table projection we have just 4 
float4 attributes and calculates 7 aggregates for them.
It seems to me that in this case CPU cache will be even more efficiently 
using in case of horizontal calculation.
At least if you implement correspondent query in C, then version working 
with array of struct will be almost two times
faster than version working with vertical arrays.


>
> Note that just looking at a plain porfile, without injecting information
> about the JITed code, will yield misleading results. Without the
> additional information perf will not be able to group the instructions
> of the JITed code sampled to a function, leading to them each being
> listed separately.
>
> If you enable jit_profiling_support, and measure with
>
> perf record -k 1 -o /tmp/perf.data -p 22950
> (optionally with --call-graph lbr)
> you then can inject the information about JITed code:
> perf inject -v --jit -i /tmp/perf.data -o /tmp/perf.jit.data
> and look at the result of that with
> perf report -i /tmp/perf.jit.data
>
Something is not working properly in my case:

root@knizhnik:~# perf record -k 1 -o /tmp/perf.data -p 7407
^C[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.452 MB /tmp/perf.data (11410 samples) ]

root@knizhnik:~# perf inject -v --jit -i /tmp/perf.data -o 
/tmp/perf.jit.data
build id event received for [kernel.kallsyms]: 
b1ef0f6204a7ec3f508b9e1536f73521c7b4b41a
build id event received for /home/knizhnik/postgresql/dist/bin/postgres: 
8ef1a41e80f043a56778e265f5badb67f1441b61
build id event received for [vdso]: b13824592e1e837368d92991b72a19437dc86a27
Looking at the vmlinux_path (8 entries long)
symsrc__init: cannot get elf header.
Using /proc/kcore for kernel object code
Using /proc/kallsyms for symbols
Using CPUID GenuineIntel-6-3C
root@knizhnik:~# perf report -i /tmp/perf.jit.data

   7.37%  postgres  postgres           [.] ExecScan
    7.23%  postgres  postgres           [.] inmem_getnextslot
    4.79%  postgres  postgres           [.] fetch_input_tuple
    4.07%  postgres  postgres           [.] SeqNext
    3.52%  postgres  postgres           [.] ExecAgg
    2.68%  postgres  postgres           [.] MemoryContextReset
    1.62%  postgres  perf-7407.map      [.] 0x00007f4591c95f02
    1.50%  postgres  perf-7407.map      [.] 0x00007f4591c95d2d
...
> The fact that ExecInterpExpr, tts_minimal_getsomeattrs show up
> significantly suggests that you're running a slightly older build,
> without a few bugfixes. Could that be true?
My forked my branch on your commit from 27 November 
(ca266a069a20c32a8f0a1df982a5a67d9483bcb3).


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Block level parallel vacuum
Next
From: Daniel Gustafsson
Date:
Subject: Re: Update minimum SSL version