Thread: Why JIT speed improvement is so modest?

Why JIT speed improvement is so modest?

From
Konstantin Knizhnik
Date:
Right now JIT provides about 30% improvement of TPC-H Q1 query:

https://www.citusdata.com/blog/2018/09/11/postgresql-11-just-in-time/

I wonder why even at this query, which seems to be ideal use case for 
JIT, we get such modest improvement?
I have raised this question several years ago - but that time JIT was 
assumed to be in early development stage and performance aspects were 
less critical
than required infrastructure changes. But right now JIT seems to be 
stable enough and is switch on by default.
Vitesse DB reports 8x speedup on Q1,
ISP-RAS JIT version  provides 3x speedup of Q1:


https://www.pgcon.org/2017/schedule/attachments/467_PGCon%202017-05-26%2015-00%20ISPRAS%20Dynamic%20Compilation%20of%20SQL%20Queries%20in%20PostgreSQL%20Using%20LLVM%20JIT.pdf

According to this presentation Q1 spends 6% of time in ExecQual and 75% 
in ExecAgg.

VOPS provides 10x improvement of Q1.

I have a hypothesis that such difference was caused by the way of 
aggregates calculation.
Postgres is using Youngs-Cramer algorithm while both ISPRAS JIT version 
and my VOPS are just accumulating results in variable of type double.
I rewrite VOPS to use the same algorithm as Postgres, but VOPS is still 
about 10 times faster.

Results of Q1 on scale factor=10 TPC-H data at my desktop with parallel 
execution enabled:
no-JIT: 5640 msec
JIT:      4590msec
VOPS: 452 msec
VOPS + Youngs-Cramer algorithm: 610 msec

Below are tops of profiles (functions with more than 1% of time):

JIT:
   10.98%  postgres  postgres            [.] float4_accum
    8.40%  postgres  postgres            [.] float8_accum
    7.51%  postgres  postgres            [.] HeapTupleSatisfiesVisibility
    5.92%  postgres  postgres            [.] ExecInterpExpr
    5.63%  postgres  postgres            [.] tts_minimal_getsomeattrs
    4.35%  postgres  postgres            [.] lookup_hash_entries
    3.72%  postgres  postgres            [.] TupleHashTableHash.isra.8
    2.93%  postgres  postgres            [.] tuplehash_insert
    2.70%  postgres  postgres            [.] heapgettup_pagemode
    2.24%  postgres  postgres            [.] check_float8_array
    2.23%  postgres  postgres            [.] hash_search_with_hash_value
    2.10%  postgres  postgres            [.] ExecScan
    1.90%  postgres  postgres            [.] hash_uint32
    1.57%  postgres  postgres            [.] tts_minimal_clear
    1.53%  postgres  postgres            [.] FunctionCall1Coll
    1.47%  postgres  postgres            [.] pg_detoast_datum
    1.39%  postgres  postgres            [.] heapgetpage
    1.37%  postgres  postgres            [.] TupleHashTableMatch.isra.9
    1.35%  postgres  postgres            [.] ExecStoreBufferHeapTuple
    1.06%  postgres  postgres            [.] LookupTupleHashEntry
    1.06%  postgres  postgres            [.] AggCheckCallContext

no-JIT:
   26.82%  postgres  postgres            [.] ExecInterpExpr
   15.26%  postgres  postgres            [.] tts_buffer_heap_getsomeattrs
    8.27%  postgres  postgres            [.] float4_accum
    7.51%  postgres  postgres            [.] float8_accum
    5.26%  postgres  postgres            [.] HeapTupleSatisfiesVisibility
    2.78%  postgres  postgres            [.] TupleHashTableHash.isra.8
    2.63%  postgres  postgres            [.] tts_minimal_getsomeattrs
    2.54%  postgres  postgres            [.] lookup_hash_entries
    2.05%  postgres  postgres            [.] tuplehash_insert
    1.97%  postgres  postgres            [.] heapgettup_pagemode
    1.72%  postgres  postgres            [.] hash_search_with_hash_value
    1.57%  postgres  postgres            [.] float48mul
    1.55%  postgres  postgres            [.] check_float8_array
    1.48%  postgres  postgres            [.] ExecScan
    1.26%  postgres  postgres            [.] hash_uint32
    1.04%  postgres  postgres            [.] tts_minimal_clear
    1.00%  postgres  postgres            [.] FunctionCall1Coll

VOPS:
   44.25%  postgres  vops.so            [.] vops_avg_state_accumulate
   11.76%  postgres  vops.so            [.] vops_float4_avg_accumulate
    6.14%  postgres  postgres           [.] ExecInterpExpr
    5.89%  postgres  vops.so            [.] vops_float4_sub_lconst
    4.89%  postgres  vops.so            [.] vops_float4_mul
    4.30%  postgres  vops.so            [.] vops_int4_le_rconst
    2.57%  postgres  vops.so            [.] vops_float4_add_lconst
    2.31%  postgres  vops.so            [.] vops_count_accumulate
    2.24%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
    1.97%  postgres  postgres           [.] heap_page_prune_opt
    1.72%  postgres  postgres           [.] HeapTupleSatisfiesVisibility
    1.67%  postgres  postgres           [.] AllocSetAlloc
    1.47%  postgres  postgres           [.] hash_search_with_hash_value


In theory by elimination of interpretation overhead JIT should provide 
performance comparable with vecrtorized executor.
In most programming languages using JIT compiler instead of byte-code 
interpreter provides about 10x speed improvement.
Certainly DBMS engine is very different with traditional interpreter and 
a lot of time is spent in tuple packing/unpacking (although JIT is also 
used here),
in heap traversal,... But it is still unclear to me why if ISPRAS 
measurement were correct and we actually spent 75% of Q1 time in 
aggregation,
JIT was not able to significantly (times) increase speed on Q1 query?  
Experiment with VOPS shows that used aggregation algorithm itself is not 
a bottleneck.
Profile also give no answer for this question.
Any ideas?



-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Why JIT speed improvement is so modest?

From
Merlin Moncure
Date:
On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> JIT was not able to significantly (times) increase speed on Q1 query?
> Experiment with VOPS shows that used aggregation algorithm itself is not
> a bottleneck.
> Profile also give no answer for this question.
> Any ideas?

Well, in the VOPS variant around 2/3 of the time is spent in routines
that are obviously aggregation.  In the JIT version, it's around 20%.
So this suggests that the replacement execution engine is more
invasive.  I would also guess (!) that the VOPS engine optimizes fewer
classes of query plan.   ExecScan for example, looks to be completely
optimized out VOPS but is still utilized in the JIT engine.

I experimented with Vitessa a couple of years back and this was
consistent with my recollection.

merlin



Re: Why JIT speed improvement is so modest?

From
Konstantin Knizhnik
Date:

On 25.11.2019 18:24, Merlin Moncure wrote:
> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> JIT was not able to significantly (times) increase speed on Q1 query?
>> Experiment with VOPS shows that used aggregation algorithm itself is not
>> a bottleneck.
>> Profile also give no answer for this question.
>> Any ideas?
> Well, in the VOPS variant around 2/3 of the time is spent in routines
> that are obviously aggregation.  In the JIT version, it's around 20%.
> So this suggests that the replacement execution engine is more
> invasive.  I would also guess (!) that the VOPS engine optimizes fewer
> classes of query plan.   ExecScan for example, looks to be completely
> optimized out VOPS but is still utilized in the JIT engine.

The difference in fraction of time spent in aggregate calculation is not 
so large (2 times vs. 10 times).
I suspected that a lot of time is spent in relation traversal code, 
tuple unpacking and visibility checks.
To check this hypothesis I have implement in-memory table access method 
which stores tuples in unpacked form and
doesn't perform any visibility checks at all.
Results were not so existed. I have to disable parallel execution 
(because it is not possible for tuples stored in backend private memory).
Results are the following:

lineitem:               13736 msec
inmem_lineitem:  10044 msec
vops_lineitem:        1945 msec

The profile of inmem_lineitem is the following:

   16.79%  postgres  postgres             [.] float4_accum
   12.86%  postgres  postgres             [.] float8_accum
    5.83%  postgres  postgres             [.] TupleHashTableHash.isra.8
    4.44%  postgres  postgres             [.] lookup_hash_entries
    3.37%  postgres  postgres             [.] check_float8_array
    3.11%  postgres  postgres             [.] tuplehash_insert
    2.91%  postgres  postgres             [.] hash_uint32
    2.83%  postgres  postgres             [.] ExecScan
    2.56%  postgres  postgres             [.] inmem_getnextslot
    2.22%  postgres  postgres             [.] FunctionCall1Coll
    2.14%  postgres  postgres             [.] LookupTupleHashEntry
    1.95%  postgres  postgres             [.] TupleHashTableMatch.isra.9
    1.76%  postgres  postgres             [.] pg_detoast_datum
    1.58%  postgres  postgres             [.] AggCheckCallContext
    1.57%  postgres  postgres             [.] tts_minimal_clear
    1.35%  postgres  perf-3054.map        [.] 0x00007f558db60010
    1.23%  postgres  postgres             [.] fetch_input_tuple
    1.15%  postgres  postgres             [.] SeqNext
    1.06%  postgres  postgres             [.] ExecAgg
    1.00%  postgres  postgres             [.] tts_minimal_store_tuple

So now fraction of time spent in aggregation is increased to 30% (vs. 
20% for lineitem and 42% for vops_lineitem).
Looks like the main bottleneck now is hashagg. VOPS is accessing hash 
about 10 times less (because it accumulates values for the whole tile).
And it explains still large difference bwtween vops_lineitem and 
inmem_lineitem.

If we remove aggregation and rewrite Q1 query as:
select
     avg(l_quantity) as sum_qty,
     avg(l_extendedprice) as sum_base_price,
     avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
     avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
     avg(l_quantity) as avg_qty,
     avg(l_extendedprice) as avg_price,
     avg(l_discount) as avg_disc,
     count(*) as count_order
from
     inmem_lineitem
where
     l_shipdate <= '1998-12-01';

then results are the following:
lineitem:               9805 msec
inmem_lineitem:  6257 msec
vops_lineitem:      1865 msec

and now profile of inmem_lineitem is:

   25.27%  postgres  postgres           [.] float4_accum
   21.86%  postgres  postgres           [.] float8_accum
    5.49%  postgres  postgres           [.] check_float8_array
    4.57%  postgres  postgres           [.] ExecScan
    2.61%  postgres  postgres           [.] AggCheckCallContext
    2.30%  postgres  postgres           [.] pg_detoast_datum
    2.10%  postgres  postgres           [.] inmem_getnextslot
    1.81%  postgres  postgres           [.] SeqNext
    1.73%  postgres  postgres           [.] fetch_input_tuple
    1.61%  postgres  postgres           [.] ExecAgg
    1.23%  postgres  postgres           [.] MemoryContextReset

But still more than 3 times difference with VOPS!
Something is wrong here...

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Why JIT speed improvement is so modest?

From
Tomas Vondra
Date:
On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
>
>
>On 25.11.2019 18:24, Merlin Moncure wrote:
>>On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
>><k.knizhnik@postgrespro.ru> wrote:
>>>JIT was not able to significantly (times) increase speed on Q1 query?
>>>Experiment with VOPS shows that used aggregation algorithm itself is not
>>>a bottleneck.
>>>Profile also give no answer for this question.
>>>Any ideas?
>>Well, in the VOPS variant around 2/3 of the time is spent in routines
>>that are obviously aggregation.  In the JIT version, it's around 20%.
>>So this suggests that the replacement execution engine is more
>>invasive.  I would also guess (!) that the VOPS engine optimizes fewer
>>classes of query plan.   ExecScan for example, looks to be completely
>>optimized out VOPS but is still utilized in the JIT engine.
>
>The difference in fraction of time spent in aggregate calculation is 
>not so large (2 times vs. 10 times).
>I suspected that a lot of time is spent in relation traversal code, 
>tuple unpacking and visibility checks.
>To check this hypothesis I have implement in-memory table access 
>method which stores tuples in unpacked form and
>doesn't perform any visibility checks at all.
>Results were not so existed. I have to disable parallel execution 
>(because it is not possible for tuples stored in backend private 
>memory).
>Results are the following:
>
>lineitem:               13736 msec
>inmem_lineitem:  10044 msec
>vops_lineitem:        1945 msec
>
>The profile of inmem_lineitem is the following:
>
>  16.79%  postgres  postgres             [.] float4_accum
>  12.86%  postgres  postgres             [.] float8_accum
>   5.83%  postgres  postgres             [.] TupleHashTableHash.isra.8
>   4.44%  postgres  postgres             [.] lookup_hash_entries
>   3.37%  postgres  postgres             [.] check_float8_array
>   3.11%  postgres  postgres             [.] tuplehash_insert
>   2.91%  postgres  postgres             [.] hash_uint32
>   2.83%  postgres  postgres             [.] ExecScan
>   2.56%  postgres  postgres             [.] inmem_getnextslot
>   2.22%  postgres  postgres             [.] FunctionCall1Coll
>   2.14%  postgres  postgres             [.] LookupTupleHashEntry
>   1.95%  postgres  postgres             [.] TupleHashTableMatch.isra.9
>   1.76%  postgres  postgres             [.] pg_detoast_datum
>   1.58%  postgres  postgres             [.] AggCheckCallContext
>   1.57%  postgres  postgres             [.] tts_minimal_clear
>   1.35%  postgres  perf-3054.map        [.] 0x00007f558db60010
>   1.23%  postgres  postgres             [.] fetch_input_tuple
>   1.15%  postgres  postgres             [.] SeqNext
>   1.06%  postgres  postgres             [.] ExecAgg
>   1.00%  postgres  postgres             [.] tts_minimal_store_tuple
>
>So now fraction of time spent in aggregation is increased to 30% (vs. 
>20% for lineitem and 42% for vops_lineitem).
>Looks like the main bottleneck now is hashagg. VOPS is accessing hash 
>about 10 times less (because it accumulates values for the whole 
>tile).
>And it explains still large difference bwtween vops_lineitem and 
>inmem_lineitem.
>
>If we remove aggregation and rewrite Q1 query as:
>select
>    avg(l_quantity) as sum_qty,
>    avg(l_extendedprice) as sum_base_price,
>    avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
>    avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
>    avg(l_quantity) as avg_qty,
>    avg(l_extendedprice) as avg_price,
>    avg(l_discount) as avg_disc,
>    count(*) as count_order
>from
>    inmem_lineitem
>where
>    l_shipdate <= '1998-12-01';
>
>then results are the following:
>lineitem:               9805 msec
>inmem_lineitem:  6257 msec
>vops_lineitem:      1865 msec
>
>and now profile of inmem_lineitem is:
>
>  25.27%  postgres  postgres           [.] float4_accum
>  21.86%  postgres  postgres           [.] float8_accum
>   5.49%  postgres  postgres           [.] check_float8_array
>   4.57%  postgres  postgres           [.] ExecScan
>   2.61%  postgres  postgres           [.] AggCheckCallContext
>   2.30%  postgres  postgres           [.] pg_detoast_datum
>   2.10%  postgres  postgres           [.] inmem_getnextslot
>   1.81%  postgres  postgres           [.] SeqNext
>   1.73%  postgres  postgres           [.] fetch_input_tuple
>   1.61%  postgres  postgres           [.] ExecAgg
>   1.23%  postgres  postgres           [.] MemoryContextReset
>
>But still more than 3 times difference with VOPS!
>Something is wrong here...
>

I have no idea what VOPS does, but IIRC one of the bottlenecks compared
to various column stores is our iterative execution model, which makes
it difficult/imposible to vectorize operations. That's likely why the
accum functions are so high in the CPU profile.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why JIT speed improvement is so modest?

From
Konstantin Knizhnik
Date:

On 27.11.2019 19:05, Tomas Vondra wrote:
> On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
>>
>>
>> On 25.11.2019 18:24, Merlin Moncure wrote:
>>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
>>> <k.knizhnik@postgrespro.ru> wrote:
>>>> JIT was not able to significantly (times) increase speed on Q1 query?
>>>> Experiment with VOPS shows that used aggregation algorithm itself 
>>>> is not
>>>> a bottleneck.
>>>> Profile also give no answer for this question.
>>>> Any ideas?
>>> Well, in the VOPS variant around 2/3 of the time is spent in routines
>>> that are obviously aggregation.  In the JIT version, it's around 20%.
>>> So this suggests that the replacement execution engine is more
>>> invasive.  I would also guess (!) that the VOPS engine optimizes fewer
>>> classes of query plan.   ExecScan for example, looks to be completely
>>> optimized out VOPS but is still utilized in the JIT engine.
>>
>> The difference in fraction of time spent in aggregate calculation is 
>> not so large (2 times vs. 10 times).
>> I suspected that a lot of time is spent in relation traversal code, 
>> tuple unpacking and visibility checks.
>> To check this hypothesis I have implement in-memory table access 
>> method which stores tuples in unpacked form and
>> doesn't perform any visibility checks at all.
>> Results were not so existed. I have to disable parallel execution 
>> (because it is not possible for tuples stored in backend private 
>> memory).
>> Results are the following:
>>
>> lineitem:               13736 msec
>> inmem_lineitem:  10044 msec
>> vops_lineitem:        1945 msec
>>
>> The profile of inmem_lineitem is the following:
>>
>>   16.79%  postgres  postgres             [.] float4_accum
>>   12.86%  postgres  postgres             [.] float8_accum
>>    5.83%  postgres  postgres             [.] TupleHashTableHash.isra.8
>>    4.44%  postgres  postgres             [.] lookup_hash_entries
>>    3.37%  postgres  postgres             [.] check_float8_array
>>    3.11%  postgres  postgres             [.] tuplehash_insert
>>    2.91%  postgres  postgres             [.] hash_uint32
>>    2.83%  postgres  postgres             [.] ExecScan
>>    2.56%  postgres  postgres             [.] inmem_getnextslot
>>    2.22%  postgres  postgres             [.] FunctionCall1Coll
>>    2.14%  postgres  postgres             [.] LookupTupleHashEntry
>>    1.95%  postgres  postgres             [.] TupleHashTableMatch.isra.9
>>    1.76%  postgres  postgres             [.] pg_detoast_datum
>>    1.58%  postgres  postgres             [.] AggCheckCallContext
>>    1.57%  postgres  postgres             [.] tts_minimal_clear
>>    1.35%  postgres  perf-3054.map        [.] 0x00007f558db60010
>>    1.23%  postgres  postgres             [.] fetch_input_tuple
>>    1.15%  postgres  postgres             [.] SeqNext
>>    1.06%  postgres  postgres             [.] ExecAgg
>>    1.00%  postgres  postgres             [.] tts_minimal_store_tuple
>>
>> So now fraction of time spent in aggregation is increased to 30% (vs. 
>> 20% for lineitem and 42% for vops_lineitem).
>> Looks like the main bottleneck now is hashagg. VOPS is accessing hash 
>> about 10 times less (because it accumulates values for the whole tile).
>> And it explains still large difference bwtween vops_lineitem and 
>> inmem_lineitem.
>>
>> If we remove aggregation and rewrite Q1 query as:
>> select
>>     avg(l_quantity) as sum_qty,
>>     avg(l_extendedprice) as sum_base_price,
>>     avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
>>     avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
>>     avg(l_quantity) as avg_qty,
>>     avg(l_extendedprice) as avg_price,
>>     avg(l_discount) as avg_disc,
>>     count(*) as count_order
>> from
>>     inmem_lineitem
>> where
>>     l_shipdate <= '1998-12-01';
>>
>> then results are the following:
>> lineitem:               9805 msec
>> inmem_lineitem:  6257 msec
>> vops_lineitem:      1865 msec
>>
>> and now profile of inmem_lineitem is:
>>
>>   25.27%  postgres  postgres           [.] float4_accum
>>   21.86%  postgres  postgres           [.] float8_accum
>>    5.49%  postgres  postgres           [.] check_float8_array
>>    4.57%  postgres  postgres           [.] ExecScan
>>    2.61%  postgres  postgres           [.] AggCheckCallContext
>>    2.30%  postgres  postgres           [.] pg_detoast_datum
>>    2.10%  postgres  postgres           [.] inmem_getnextslot
>>    1.81%  postgres  postgres           [.] SeqNext
>>    1.73%  postgres  postgres           [.] fetch_input_tuple
>>    1.61%  postgres  postgres           [.] ExecAgg
>>    1.23%  postgres  postgres           [.] MemoryContextReset
>>
>> But still more than 3 times difference with VOPS!
>> Something is wrong here...
>>
>
> I have no idea what VOPS does, but IIRC one of the bottlenecks compared
> to various column stores is our iterative execution model, which makes
> it difficult/imposible to vectorize operations. That's likely why the
> accum functions are so high in the CPU profile.
>
> regards
>

VOPS is doing very simple thing: it replaces scala types with vector 
(tiles) and define all standard operations for them.
Also it provides Postgres aggregate for this types.
So while for normal Postgres table, the query

select sum(x) from T;

calls float4_accum for each row of T, the same query in VOPS will call 
vops_float4_avg_accumulate for each tile which contains 64 elements.
So vops_float4_avg_accumulate is called 64 times less than float4_accum. 
And inside it contains straightforward loop:

             for (i = 0; i < TILE_SIZE; i++) {
                 sum += opd->payload[i];
             }

which can be optimized by compiler (loop unrolling, use of SIMD 
instructions,...).
So no wonder that VOPS is faster than Postgres executor.
But Postgres now contains JIT and it is used in this case.
So interpretation overhead of executor should be mostly eliminated by JIT.
In theory, perfect JIT code should process rows of horizontal data model 
at the same speed as vector executor processing columns of vertical data 
model.
Vertical model provides signficatn advantages when a query affect only 
small fraction of rows.
But in case of Q1 we are calculating 8 aggregates for just 4 columns. 
And inmem_lineitem is actually projection of original lineitem table 
containing only  columns needed for this query.
So amount of fetched data in this case is almost the same for horizontal 
and vertical data models.
Effects of CPU caches should not also play significant role in this case.
That is why it is not quite clear to me why there is still big 
difference (3 times) between VOPS and in-memory table and not so large 
difference between normal and in-memory tables.

Concerning large percent spent in accumulate function - I do not agree 
with you. What this query is actually doing is just calculating aggregates.
The less is interpretation overhead the larger percent of time we should 
spent in aggregate function.
May be the whole infrastructure of Postgres aggregates adds too large 
overhead (check_float8_array, function calls,...) and in case of VOPS 
this overhead is divided by 64.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Why JIT speed improvement is so modest?

From
guangnan he
Date:
Hi, 
you mean if we don't add new compiler options the compiler will do the loop unrolling using SIMD automatically?
Beside the function calls, cache miss etc, for VOPS I think the call stack is squeezing too, but the JIT optimize still process rows one by one.

Konstantin Knizhnik <k.knizhnik@postgrespro.ru> 于2019年11月28日周四 下午3:08写道:


On 27.11.2019 19:05, Tomas Vondra wrote:
> On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
>>
>>
>> On 25.11.2019 18:24, Merlin Moncure wrote:
>>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
>>> <k.knizhnik@postgrespro.ru> wrote:
>>>> JIT was not able to significantly (times) increase speed on Q1 query?
>>>> Experiment with VOPS shows that used aggregation algorithm itself
>>>> is not
>>>> a bottleneck.
>>>> Profile also give no answer for this question.
>>>> Any ideas?
>>> Well, in the VOPS variant around 2/3 of the time is spent in routines
>>> that are obviously aggregation.  In the JIT version, it's around 20%.
>>> So this suggests that the replacement execution engine is more
>>> invasive.  I would also guess (!) that the VOPS engine optimizes fewer
>>> classes of query plan.   ExecScan for example, looks to be completely
>>> optimized out VOPS but is still utilized in the JIT engine.
>>
>> The difference in fraction of time spent in aggregate calculation is
>> not so large (2 times vs. 10 times).
>> I suspected that a lot of time is spent in relation traversal code,
>> tuple unpacking and visibility checks.
>> To check this hypothesis I have implement in-memory table access
>> method which stores tuples in unpacked form and
>> doesn't perform any visibility checks at all.
>> Results were not so existed. I have to disable parallel execution
>> (because it is not possible for tuples stored in backend private
>> memory).
>> Results are the following:
>>
>> lineitem:               13736 msec
>> inmem_lineitem:  10044 msec
>> vops_lineitem:        1945 msec
>>
>> The profile of inmem_lineitem is the following:
>>
>>   16.79%  postgres  postgres             [.] float4_accum
>>   12.86%  postgres  postgres             [.] float8_accum
>>    5.83%  postgres  postgres             [.] TupleHashTableHash.isra.8
>>    4.44%  postgres  postgres             [.] lookup_hash_entries
>>    3.37%  postgres  postgres             [.] check_float8_array
>>    3.11%  postgres  postgres             [.] tuplehash_insert
>>    2.91%  postgres  postgres             [.] hash_uint32
>>    2.83%  postgres  postgres             [.] ExecScan
>>    2.56%  postgres  postgres             [.] inmem_getnextslot
>>    2.22%  postgres  postgres             [.] FunctionCall1Coll
>>    2.14%  postgres  postgres             [.] LookupTupleHashEntry
>>    1.95%  postgres  postgres             [.] TupleHashTableMatch.isra.9
>>    1.76%  postgres  postgres             [.] pg_detoast_datum
>>    1.58%  postgres  postgres             [.] AggCheckCallContext
>>    1.57%  postgres  postgres             [.] tts_minimal_clear
>>    1.35%  postgres  perf-3054.map        [.] 0x00007f558db60010
>>    1.23%  postgres  postgres             [.] fetch_input_tuple
>>    1.15%  postgres  postgres             [.] SeqNext
>>    1.06%  postgres  postgres             [.] ExecAgg
>>    1.00%  postgres  postgres             [.] tts_minimal_store_tuple
>>
>> So now fraction of time spent in aggregation is increased to 30% (vs.
>> 20% for lineitem and 42% for vops_lineitem).
>> Looks like the main bottleneck now is hashagg. VOPS is accessing hash
>> about 10 times less (because it accumulates values for the whole tile).
>> And it explains still large difference bwtween vops_lineitem and
>> inmem_lineitem.
>>
>> If we remove aggregation and rewrite Q1 query as:
>> select
>>     avg(l_quantity) as sum_qty,
>>     avg(l_extendedprice) as sum_base_price,
>>     avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
>>     avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
>>     avg(l_quantity) as avg_qty,
>>     avg(l_extendedprice) as avg_price,
>>     avg(l_discount) as avg_disc,
>>     count(*) as count_order
>> from
>>     inmem_lineitem
>> where
>>     l_shipdate <= '1998-12-01';
>>
>> then results are the following:
>> lineitem:               9805 msec
>> inmem_lineitem:  6257 msec
>> vops_lineitem:      1865 msec
>>
>> and now profile of inmem_lineitem is:
>>
>>   25.27%  postgres  postgres           [.] float4_accum
>>   21.86%  postgres  postgres           [.] float8_accum
>>    5.49%  postgres  postgres           [.] check_float8_array
>>    4.57%  postgres  postgres           [.] ExecScan
>>    2.61%  postgres  postgres           [.] AggCheckCallContext
>>    2.30%  postgres  postgres           [.] pg_detoast_datum
>>    2.10%  postgres  postgres           [.] inmem_getnextslot
>>    1.81%  postgres  postgres           [.] SeqNext
>>    1.73%  postgres  postgres           [.] fetch_input_tuple
>>    1.61%  postgres  postgres           [.] ExecAgg
>>    1.23%  postgres  postgres           [.] MemoryContextReset
>>
>> But still more than 3 times difference with VOPS!
>> Something is wrong here...
>>
>
> I have no idea what VOPS does, but IIRC one of the bottlenecks compared
> to various column stores is our iterative execution model, which makes
> it difficult/imposible to vectorize operations. That's likely why the
> accum functions are so high in the CPU profile.
>
> regards
>

VOPS is doing very simple thing: it replaces scala types with vector
(tiles) and define all standard operations for them.
Also it provides Postgres aggregate for this types.
So while for normal Postgres table, the query

select sum(x) from T;

calls float4_accum for each row of T, the same query in VOPS will call
vops_float4_avg_accumulate for each tile which contains 64 elements.
So vops_float4_avg_accumulate is called 64 times less than float4_accum.
And inside it contains straightforward loop:

             for (i = 0; i < TILE_SIZE; i++) {
                 sum += opd->payload[i];
             }

which can be optimized by compiler (loop unrolling, use of SIMD
instructions,...).
So no wonder that VOPS is faster than Postgres executor.
But Postgres now contains JIT and it is used in this case.
So interpretation overhead of executor should be mostly eliminated by JIT.
In theory, perfect JIT code should process rows of horizontal data model
at the same speed as vector executor processing columns of vertical data
model.
Vertical model provides signficatn advantages when a query affect only
small fraction of rows.
But in case of Q1 we are calculating 8 aggregates for just 4 columns.
And inmem_lineitem is actually projection of original lineitem table
containing only  columns needed for this query.
So amount of fetched data in this case is almost the same for horizontal
and vertical data models.
Effects of CPU caches should not also play significant role in this case.
That is why it is not quite clear to me why there is still big
difference (3 times) between VOPS and in-memory table and not so large
difference between normal and in-memory tables.

Concerning large percent spent in accumulate function - I do not agree
with you. What this query is actually doing is just calculating aggregates.
The less is interpretation overhead the larger percent of time we should
spent in aggregate function.
May be the whole infrastructure of Postgres aggregates adds too large
overhead (check_float8_array, function calls,...) and in case of VOPS
this overhead is divided by 64.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





--
Guang-Nan He

Re: Why JIT speed improvement is so modest?

From
Konstantin Knizhnik
Date:
Hi,

On 28.11.2019 10:36, guangnan he wrote:
Hi, 
you mean if we don't add new compiler options the compiler will do the loop unrolling using SIMD automatically?

Yes, most of modern compiler are doing it.
GCC requires -O3 option (-O2 is not enough), but clang is using them even with -O2.

But Postgres is using more sophisticated Youngs-Cramer algorithm for calculating SUM/AVG aggregates. And here SIMD instructions do not help much.
My original assumption was that huge difference in speed between VOPS/ISPRAS JIT and Vanilla JIT can be explained by the difference in accumulation algorithm.

This is why I implemented calculation of AVG in VOPS using Youngs-Cramer algorithm.
And it certainly affect performance: Q1 with SUM aggregates is executed by VOPS almost three times faster than with AVG aggregates (700 msec vs. 2000 msec).
But even with Youngs-Cramer algorithm VOPS is 6 times faster than standard Postgres with JIT and 5 times faster than my in-memory storage.

Beside the function calls, cache miss etc, for VOPS I think the call stack is squeezing too, but the JIT optimize still process rows one by one.
If we do not take in account overhead of heap traversal and tuples packing then amount of calculations doesn't depend on data model: whether it is vertical or horizontal.
By implementing in-memory storage which just keeps unpacked tuples in L2 list in backend's private memory and so doesn't spend time for unpacking or visibility checks
I  want to exclude this overhead and reach almost the same speed as VOPS.
But it doesn't happen.




Konstantin Knizhnik <k.knizhnik@postgrespro.ru> 于2019年11月28日周四 下午3:08写道:


On 27.11.2019 19:05, Tomas Vondra wrote:
> On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
>>
>>
>> On 25.11.2019 18:24, Merlin Moncure wrote:
>>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
>>> <k.knizhnik@postgrespro.ru> wrote:
>>>> JIT was not able to significantly (times) increase speed on Q1 query?
>>>> Experiment with VOPS shows that used aggregation algorithm itself
>>>> is not
>>>> a bottleneck.
>>>> Profile also give no answer for this question.
>>>> Any ideas?
>>> Well, in the VOPS variant around 2/3 of the time is spent in routines
>>> that are obviously aggregation.  In the JIT version, it's around 20%.
>>> So this suggests that the replacement execution engine is more
>>> invasive.  I would also guess (!) that the VOPS engine optimizes fewer
>>> classes of query plan.   ExecScan for example, looks to be completely
>>> optimized out VOPS but is still utilized in the JIT engine.
>>
>> The difference in fraction of time spent in aggregate calculation is
>> not so large (2 times vs. 10 times).
>> I suspected that a lot of time is spent in relation traversal code,
>> tuple unpacking and visibility checks.
>> To check this hypothesis I have implement in-memory table access
>> method which stores tuples in unpacked form and
>> doesn't perform any visibility checks at all.
>> Results were not so existed. I have to disable parallel execution
>> (because it is not possible for tuples stored in backend private
>> memory).
>> Results are the following:
>>
>> lineitem:               13736 msec
>> inmem_lineitem:  10044 msec
>> vops_lineitem:        1945 msec
>>
>> The profile of inmem_lineitem is the following:
>>
>>   16.79%  postgres  postgres             [.] float4_accum
>>   12.86%  postgres  postgres             [.] float8_accum
>>    5.83%  postgres  postgres             [.] TupleHashTableHash.isra.8
>>    4.44%  postgres  postgres             [.] lookup_hash_entries
>>    3.37%  postgres  postgres             [.] check_float8_array
>>    3.11%  postgres  postgres             [.] tuplehash_insert
>>    2.91%  postgres  postgres             [.] hash_uint32
>>    2.83%  postgres  postgres             [.] ExecScan
>>    2.56%  postgres  postgres             [.] inmem_getnextslot
>>    2.22%  postgres  postgres             [.] FunctionCall1Coll
>>    2.14%  postgres  postgres             [.] LookupTupleHashEntry
>>    1.95%  postgres  postgres             [.] TupleHashTableMatch.isra.9
>>    1.76%  postgres  postgres             [.] pg_detoast_datum
>>    1.58%  postgres  postgres             [.] AggCheckCallContext
>>    1.57%  postgres  postgres             [.] tts_minimal_clear
>>    1.35%  postgres  perf-3054.map        [.] 0x00007f558db60010
>>    1.23%  postgres  postgres             [.] fetch_input_tuple
>>    1.15%  postgres  postgres             [.] SeqNext
>>    1.06%  postgres  postgres             [.] ExecAgg
>>    1.00%  postgres  postgres             [.] tts_minimal_store_tuple
>>
>> So now fraction of time spent in aggregation is increased to 30% (vs.
>> 20% for lineitem and 42% for vops_lineitem).
>> Looks like the main bottleneck now is hashagg. VOPS is accessing hash
>> about 10 times less (because it accumulates values for the whole tile).
>> And it explains still large difference bwtween vops_lineitem and
>> inmem_lineitem.
>>
>> If we remove aggregation and rewrite Q1 query as:
>> select
>>     avg(l_quantity) as sum_qty,
>>     avg(l_extendedprice) as sum_base_price,
>>     avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
>>     avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
>>     avg(l_quantity) as avg_qty,
>>     avg(l_extendedprice) as avg_price,
>>     avg(l_discount) as avg_disc,
>>     count(*) as count_order
>> from
>>     inmem_lineitem
>> where
>>     l_shipdate <= '1998-12-01';
>>
>> then results are the following:
>> lineitem:               9805 msec
>> inmem_lineitem:  6257 msec
>> vops_lineitem:      1865 msec
>>
>> and now profile of inmem_lineitem is:
>>
>>   25.27%  postgres  postgres           [.] float4_accum
>>   21.86%  postgres  postgres           [.] float8_accum
>>    5.49%  postgres  postgres           [.] check_float8_array
>>    4.57%  postgres  postgres           [.] ExecScan
>>    2.61%  postgres  postgres           [.] AggCheckCallContext
>>    2.30%  postgres  postgres           [.] pg_detoast_datum
>>    2.10%  postgres  postgres           [.] inmem_getnextslot
>>    1.81%  postgres  postgres           [.] SeqNext
>>    1.73%  postgres  postgres           [.] fetch_input_tuple
>>    1.61%  postgres  postgres           [.] ExecAgg
>>    1.23%  postgres  postgres           [.] MemoryContextReset
>>
>> But still more than 3 times difference with VOPS!
>> Something is wrong here...
>>
>
> I have no idea what VOPS does, but IIRC one of the bottlenecks compared
> to various column stores is our iterative execution model, which makes
> it difficult/imposible to vectorize operations. That's likely why the
> accum functions are so high in the CPU profile.
>
> regards
>

VOPS is doing very simple thing: it replaces scala types with vector
(tiles) and define all standard operations for them.
Also it provides Postgres aggregate for this types.
So while for normal Postgres table, the query

select sum(x) from T;

calls float4_accum for each row of T, the same query in VOPS will call
vops_float4_avg_accumulate for each tile which contains 64 elements.
So vops_float4_avg_accumulate is called 64 times less than float4_accum.
And inside it contains straightforward loop:

             for (i = 0; i < TILE_SIZE; i++) {
                 sum += opd->payload[i];
             }

which can be optimized by compiler (loop unrolling, use of SIMD
instructions,...).
So no wonder that VOPS is faster than Postgres executor.
But Postgres now contains JIT and it is used in this case.
So interpretation overhead of executor should be mostly eliminated by JIT.
In theory, perfect JIT code should process rows of horizontal data model
at the same speed as vector executor processing columns of vertical data
model.
Vertical model provides signficatn advantages when a query affect only
small fraction of rows.
But in case of Q1 we are calculating 8 aggregates for just 4 columns.
And inmem_lineitem is actually projection of original lineitem table
containing only  columns needed for this query.
So amount of fetched data in this case is almost the same for horizontal
and vertical data models.
Effects of CPU caches should not also play significant role in this case.
That is why it is not quite clear to me why there is still big
difference (3 times) between VOPS and in-memory table and not so large
difference between normal and in-memory tables.

Concerning large percent spent in accumulate function - I do not agree
with you. What this query is actually doing is just calculating aggregates.
The less is interpretation overhead the larger percent of time we should
spent in aggregate function.
May be the whole infrastructure of Postgres aggregates adds too large
overhead (check_float8_array, function calls,...) and in case of VOPS
this overhead is divided by 64.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





--
Guang-Nan He


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Why JIT speed improvement is so modest?

From
Andres Freund
Date:
Hi,

On 2019-11-25 18:09:29 +0300, Konstantin Knizhnik wrote:
> I wonder why even at this query, which seems to be ideal use case for JIT,
> we get such modest improvement?

I think there's a number of causes:

1) There's bottlenecks elsewhere:
   - The order of sequential scan memory accesses is bad
     https://www.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de

     In my experiments, fixing that yields larger JIT improvements,
     because less time is spent stalling due to cache misses during
     tuple deforming (needing the tuple's natts at the start prevents
     out-of-order from hiding the relevant latency).


   - The transition function for floating point aggregates is pretty
     expensive. In particular, we compute the full youngs-cramer stuff
     for sum/avg, even though they aren't actually needed there. This
     has become measurably worse with
     https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e954a727f0c8872bf5203186ad0f5312f6183746
     In this case it's complicated enough apparently that the transition
     functions are too expensive to inline.

   - float4/8_accum use arrays to store the transition state. That's
     noticably more expensive than just accessing a struct, partially
     because more checks needs to be done. We really should move most,
     if not all, aggregates that use array transition states to
     "internal" type transition states. Probably with some reusable
     helpers to make it easier to write serialization / deserialization
     functions so we can continue to allow parallelism.

   - The per-row overhead on lower levels of the query is
     significant. E.g. in your profile the
     HeapTupleSatisfiesVisibility() calls (you'd get largely rid of this
     by freezing), and the hashtable overhead is quite noticable. JITing
     expression eval doesn't fix that.

   ...


2) The code generated for JIT isn't that good. In particular, the
   external memory references included in the generated code limit the
   optimization potential quite substantially. There's also quite some
   (not just JIT) improvement potential related to the aggregation code,
   simplifying the generated expressions.

   See https://www.postgresql.org/message-id/20191023163849.sosqbfs5yenocez3%40alap3.anarazel.de
   for my attempt at improving the situation. It does measurably
   improve the situation for Q1, while still leaving a lot of further
   improvements to be done.  You'd be more than welcome to review some
   of that!


3) Plenty of crucial code is not JITed, even when expression
   related. Most crucial for Q1 is the fact that the hash computation
   for aggregates isn't JITed as a whole - when looking at hierarchical
   profiles, we spend about 1/3 of the whole query time within
   TupleHashTable*.

4) The currently required forming / deforming of tuples into minimal
   tuples when storing them in the hashagg table is *expensive*.

   We can address that partially by computing NOT NULL information for
   the tupledesc used for the hashtable (which will make JITed tuple
   deforming considerably faster, because it'll just be a reference to
   an hardcoded offset).

   We can also simplify the minimal tuple representation - historically
   it looks the way it does now because we needed minimal tuples to be
   largely compatible with heap tuples - but we don't anymore. Even just
   removing the weird offset math we do for minimal tuples would be
   beneficial, but I think we can do more than that.



> Vitesse DB reports 8x speedup on Q1,
> ISP-RAS JIT version  provides 3x speedup of Q1:

I think those measurements were done before a lot of generic
improvements to aggregation speed were done. E.g. Q1 performance
improved significantly due to the new expression evaluation engine, even
without JIT. Because the previous tree-walking expression evaluation was
so slow for many things, JITing that away obviously yielded bigger
improvements than it does now.


> VOPS provides 10x improvement of Q1.

My understanding of VOPS is that it ferries around more than one tuple
at a time. And avoids a lot of generic code paths. So that just doesn't
seem a meaningful comparison.


> In theory by elimination of interpretation overhead JIT should provide
> performance comparable with vecrtorized executor.

I don't think that's true at all. Vectorized execution, which I assume
to mean dealing with more than one tuple at a time, is largely
orthogonal to the way expressions are evaluated. The reason that
vectorized execution is good is that it drastically increases cache
locality (by performing work that accesses related data, e.g. a buffer
page, in a tight loop, without a lot of other work happening inbetween),
that it increases the benefits of out of order execution (by removing
dependencies, as e.g. predicates for multiple tuples can be computed,
without a separate dependency on the result for each predicate
evaluation), etc.

JIT compiled expression evaluation cannot get you these benefits.


> In most programming languages using JIT compiler instead of byte-code
> interpreter provides about 10x speed improvement.

But that's with low level bytecode execution, whereas expression
evaluation uses relatively coarse ops (sometimes called "super"
opcodes).



> Below are tops of profiles (functions with more than 1% of time):
>
> JIT:

Note that just looking at a plain porfile, without injecting information
about the JITed code, will yield misleading results. Without the
additional information perf will not be able to group the instructions
of the JITed code sampled to a function, leading to them each being
listed separately.

If you enable jit_profiling_support, and measure with

perf record -k 1 -o /tmp/perf.data -p 22950
(optionally with --call-graph lbr)
you then can inject the information about JITed code:
perf inject -v --jit -i /tmp/perf.data -o /tmp/perf.jit.data
and look at the result of that with
perf report -i /tmp/perf.jit.data


>   10.98%  postgres  postgres            [.] float4_accum
>    8.40%  postgres  postgres            [.] float8_accum
>    7.51%  postgres  postgres            [.] HeapTupleSatisfiesVisibility
>    5.92%  postgres  postgres            [.] ExecInterpExpr
>    5.63%  postgres  postgres            [.] tts_minimal_getsomeattrs

The fact that ExecInterpExpr, tts_minimal_getsomeattrs show up
significantly suggests that you're running a slightly older build,
without a few bugfixes. Could that be true?

Greetings,

Andres Freund



Re: Why JIT speed improvement is so modest?

From
Konstantin Knizhnik
Date:
Hi,

Thank you for your replay and explanations.
My comments are inside.

On 04.12.2019 22:43, Andres Freund wrote:
> Hi,
>
> On 2019-11-25 18:09:29 +0300, Konstantin Knizhnik wrote:
>> I wonder why even at this query, which seems to be ideal use case for JIT,
>> we get such modest improvement?
> I think there's a number of causes:
>
> 1) There's bottlenecks elsewhere:
>     - The order of sequential scan memory accesses is bad
>       https://www.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de
>
>       In my experiments, fixing that yields larger JIT improvements,
>       because less time is spent stalling due to cache misses during
>       tuple deforming (needing the tuple's natts at the start prevents
>       out-of-order from hiding the relevant latency).

This is why I have implemented my own in-memory table access method.
It stores tuples in unpacked format so there should be no tuple 
deforming overhead.
By the way if somebody is interested (mostly for experiments, I do not 
thing that it in the current state it has some practival meaning)
my in-memory storage implementation is here:

https://github.com/postgrespro/postgresql.builtin_pool/tree/inmem_am

>
>
>     - The transition function for floating point aggregates is pretty
>       expensive. In particular, we compute the full youngs-cramer stuff
>       for sum/avg, even though they aren't actually needed there. This
>       has become measurably worse with
>       https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e954a727f0c8872bf5203186ad0f5312f6183746
>       In this case it's complicated enough apparently that the transition
>       functions are too expensive to inline.
>
>     - float4/8_accum use arrays to store the transition state. That's
>       noticably more expensive than just accessing a struct, partially
>       because more checks needs to be done. We really should move most,
>       if not all, aggregates that use array transition states to
>       "internal" type transition states. Probably with some reusable
>       helpers to make it easier to write serialization / deserialization
>       functions so we can continue to allow parallelism.

Yes, it is true.
Profile shows that packing/unpacking transition state takes substantial 
amount of time.
But youngs-cramer stuff is not used for SUM aggregate! It is using 
float4pl for accumulation and float4 as transaction type.
When I replace AVG with sum time of query execution with in-mem storage 
is decreased from 6 seconds to 4 seconds.
But in VOPS improvement was even larger: 700 msec vs. 1865 msec. So the 
gap in performance is even larger.

And profile is shows that aggregate infrastructure overhead disappears:

    8.82%  postgres  postgres           [.] ExecScan
    4.60%  postgres  postgres           [.] fetch_input_tuple
    4.51%  postgres  postgres           [.] inmem_getnextslot
    3.48%  postgres  postgres           [.] SeqNext
    3.13%  postgres  postgres           [.] ExecAgg
    2.78%  postgres  postgres           [.] MemoryContextReset
    1.77%  postgres  perf-27660.map     [.] 0x00007fcb81ae9032
    1.52%  postgres  perf-27660.map     [.] 0x00007fcb81ae9a4a
    1.50%  postgres  perf-27660.map     [.] 0x00007fcb81ae9fa2
    1.49%  postgres  perf-27660.map     [.] 0x00007fcb81ae9dcd
    1.44%  postgres  perf-27660.map     [.] 0x00007fcb81aea205
    1.42%  postgres  perf-27660.map     [.] 0x00007fcb81ae9072
    1.31%  postgres  perf-27660.map     [.] 0x00007fcb81ae9a56
    1.22%  postgres  perf-27660.map     [.] 0x00007fcb81ae9df1
    1.22%  postgres  perf-27660.map     [.] 0x00007fcb81aea225
    1.21%  postgres  perf-27660.map     [.] 0x00007fcb81ae93e6
    1.21%  postgres  perf-27660.map     [.] 0x00007fcb81ae9fae
    1.19%  postgres  perf-27660.map     [.] 0x00007fcb81ae9c83
    1.12%  postgres  perf-27660.map     [.] 0x00007fcb81ae9e5b
    1.12%  postgres  perf-27660.map     [.] 0x00007fcb81ae9c5f
    1.05%  postgres  perf-27660.map     [.] 0x00007fcb81ae9010
    1.05%  postgres  perf-27660.map     [.] 0x00007fcb81ae987b

As far as I understand positions in profile starting from 7-th are 
corresponding to JIT code.


>     - The per-row overhead on lower levels of the query is
>       significant. E.g. in your profile the
>       HeapTupleSatisfiesVisibility() calls (you'd get largely rid of this
>       by freezing), and the hashtable overhead is quite noticable. JITing
>       expression eval doesn't fix that.

Once again: my in-memory storage doesn't perform visibility checks.
This is was the primary idea of my experiment: try to minimize per-row 
storage overhead and check if JIT can provide performance
comparable with vectorized engine. Unfortunately the answer was 
negative: the difference with VOPS is more than three times, while 
difference between
standard table and in-memory table is less than 1.5.

>
>     ...
>
>
> 2) The code generated for JIT isn't that good. In particular, the
>     external memory references included in the generated code limit the
>     optimization potential quite substantially. There's also quite some
>     (not just JIT) improvement potential related to the aggregation code,
>     simplifying the generated expressions.
>
>     See https://www.postgresql.org/message-id/20191023163849.sosqbfs5yenocez3%40alap3.anarazel.de
>     for my attempt at improving the situation. It does measurably
>     improve the situation for Q1, while still leaving a lot of further
>     improvements to be done.  You'd be more than welcome to review some
>     of that!
>
>
> 3) Plenty of crucial code is not JITed, even when expression
>     related. Most crucial for Q1 is the fact that the hash computation
>     for aggregates isn't JITed as a whole - when looking at hierarchical
>     profiles, we spend about 1/3 of the whole query time within
>     TupleHashTable*.
> 4) The currently required forming / deforming of tuples into minimal
>     tuples when storing them in the hashagg table is *expensive*.
>
>     We can address that partially by computing NOT NULL information for
>     the tupledesc used for the hashtable (which will make JITed tuple
>     deforming considerably faster, because it'll just be a reference to
>     an hardcoded offset).
>
>     We can also simplify the minimal tuple representation - historically
>     it looks the way it does now because we needed minimal tuples to be
>     largely compatible with heap tuples - but we don't anymore. Even just
>     removing the weird offset math we do for minimal tuples would be
>     beneficial, but I think we can do more than that.
>

Yes, this is the first think I have noticed. VOPS is calling hash 
function only once per 64 rows - 64 times smaller than row storage.
This is why VOPS is 6 times faster on Q1 than vanilla postgres and 5 
times than my in-memory storage.
And this is why I removed aggregation from Q1 and just calculates grand 
aggregates.

>
>> VOPS provides 10x improvement of Q1.
> My understanding of VOPS is that it ferries around more than one tuple
> at a time. And avoids a lot of generic code paths. So that just doesn't
> seem a meaningful comparison.

VOPS is just an example of vectorized executor.
It is possible to implement things which VOPS is performing using 
customized  types
using custom nodes as in Hubert Zhang  prototype:

https://www.postgresql.org/message-id/flat/CAB0yrenxJ3FcmnLs8JqpEG3tzSZ%3DOL1MZBUh3v6dgH%2Bo70GTFA%40mail.gmail.com#df50bbf3610dc2f42cb9b54423a22111


>> In theory by elimination of interpretation overhead JIT should provide
>> performance comparable with vecrtorized executor.
> I don't think that's true at all. Vectorized execution, which I assume
> to mean dealing with more than one tuple at a time, is largely
> orthogonal to the way expressions are evaluated. The reason that
> vectorized execution is good is that it drastically increases cache
> locality (by performing work that accesses related data, e.g. a buffer
> page, in a tight loop, without a lot of other work happening inbetween),
> that it increases the benefits of out of order execution (by removing
> dependencies, as e.g. predicates for multiple tuples can be computed,
> without a separate dependency on the result for each predicate
> evaluation), etc.
>
> JIT compiled expression evaluation cannot get you these benefits.

Yes, I know this arguments.
But please look here in Q1 and lineitem table projection we have just 4 
float4 attributes and calculates 7 aggregates for them.
It seems to me that in this case CPU cache will be even more efficiently 
using in case of horizontal calculation.
At least if you implement correspondent query in C, then version working 
with array of struct will be almost two times
faster than version working with vertical arrays.


>
> Note that just looking at a plain porfile, without injecting information
> about the JITed code, will yield misleading results. Without the
> additional information perf will not be able to group the instructions
> of the JITed code sampled to a function, leading to them each being
> listed separately.
>
> If you enable jit_profiling_support, and measure with
>
> perf record -k 1 -o /tmp/perf.data -p 22950
> (optionally with --call-graph lbr)
> you then can inject the information about JITed code:
> perf inject -v --jit -i /tmp/perf.data -o /tmp/perf.jit.data
> and look at the result of that with
> perf report -i /tmp/perf.jit.data
>
Something is not working properly in my case:

root@knizhnik:~# perf record -k 1 -o /tmp/perf.data -p 7407
^C[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.452 MB /tmp/perf.data (11410 samples) ]

root@knizhnik:~# perf inject -v --jit -i /tmp/perf.data -o 
/tmp/perf.jit.data
build id event received for [kernel.kallsyms]: 
b1ef0f6204a7ec3f508b9e1536f73521c7b4b41a
build id event received for /home/knizhnik/postgresql/dist/bin/postgres: 
8ef1a41e80f043a56778e265f5badb67f1441b61
build id event received for [vdso]: b13824592e1e837368d92991b72a19437dc86a27
Looking at the vmlinux_path (8 entries long)
symsrc__init: cannot get elf header.
Using /proc/kcore for kernel object code
Using /proc/kallsyms for symbols
Using CPUID GenuineIntel-6-3C
root@knizhnik:~# perf report -i /tmp/perf.jit.data

   7.37%  postgres  postgres           [.] ExecScan
    7.23%  postgres  postgres           [.] inmem_getnextslot
    4.79%  postgres  postgres           [.] fetch_input_tuple
    4.07%  postgres  postgres           [.] SeqNext
    3.52%  postgres  postgres           [.] ExecAgg
    2.68%  postgres  postgres           [.] MemoryContextReset
    1.62%  postgres  perf-7407.map      [.] 0x00007f4591c95f02
    1.50%  postgres  perf-7407.map      [.] 0x00007f4591c95d2d
...
> The fact that ExecInterpExpr, tts_minimal_getsomeattrs show up
> significantly suggests that you're running a slightly older build,
> without a few bugfixes. Could that be true?
My forked my branch on your commit from 27 November 
(ca266a069a20c32a8f0a1df982a5a67d9483bcb3).


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Why JIT speed improvement is so modest?

From
Robert Haas
Date:
On Thu, Nov 28, 2019 at 2:08 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> calls float4_accum for each row of T, the same query in VOPS will call
> vops_float4_avg_accumulate for each tile which contains 64 elements.
> So vops_float4_avg_accumulate is called 64 times less than float4_accum.
> And inside it contains straightforward loop:
>
>              for (i = 0; i < TILE_SIZE; i++) {
>                  sum += opd->payload[i];
>              }
>
> which can be optimized by compiler (loop unrolling, use of SIMD
> instructions,...).

Part of the reason why the compiler can optimize that so well is
probably related to the fact that it includes no overflow checks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why JIT speed improvement is so modest?

From
Konstantin Knizhnik
Date:

On 06.12.2019 18:53, Robert Haas wrote:
> On Thu, Nov 28, 2019 at 2:08 AM Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> calls float4_accum for each row of T, the same query in VOPS will call
>> vops_float4_avg_accumulate for each tile which contains 64 elements.
>> So vops_float4_avg_accumulate is called 64 times less than float4_accum.
>> And inside it contains straightforward loop:
>>
>>               for (i = 0; i < TILE_SIZE; i++) {
>>                   sum += opd->payload[i];
>>               }
>>
>> which can be optimized by compiler (loop unrolling, use of SIMD
>> instructions,...).
> Part of the reason why the compiler can optimize that so well is
> probably related to the fact that it includes no overflow checks.

May it makes sense to use in aggregate transformation function which is 
not checking for overflow and perform this check only in final function?
NaN and Inf values will be preserved in any case...

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Why JIT speed improvement is so modest?

From
Konstantin Knizhnik
Date:

On 06.12.2019 19:52, Konstantin Knizhnik wrote:
>
>
> On 06.12.2019 18:53, Robert Haas wrote:
>> On Thu, Nov 28, 2019 at 2:08 AM Konstantin Knizhnik
>> <k.knizhnik@postgrespro.ru> wrote:
>>> calls float4_accum for each row of T, the same query in VOPS will call
>>> vops_float4_avg_accumulate for each tile which contains 64 elements.
>>> So vops_float4_avg_accumulate is called 64 times less than 
>>> float4_accum.
>>> And inside it contains straightforward loop:
>>>
>>>               for (i = 0; i < TILE_SIZE; i++) {
>>>                   sum += opd->payload[i];
>>>               }
>>>
>>> which can be optimized by compiler (loop unrolling, use of SIMD
>>> instructions,...).
>> Part of the reason why the compiler can optimize that so well is
>> probably related to the fact that it includes no overflow checks.
>
> May it makes sense to use in aggregate transformation function which 
> is not checking for overflow and perform this check only in final 
> function?
> NaN and Inf values will be preserved in any case...
>
I have tried to comment check_float8_val in  float4_pl/float8_pl and get 
completely no difference in performance.

But if I replace query

select
     sum(l_quantity) as sum_qty,
     sum(l_extendedprice) as sum_base_price,
     sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
     sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
     sum(l_quantity) as avg_qty,
     sum(l_extendedprice) as avg_price,
     sum(l_discount) as avg_disc,
count(*) as count_order
from lineitem_inmem;


with

select sum(l_quantity + l_extendedprice + l_discount + l_tax) from 
lineitem_inmem;


then time is reduced from 3686 to 1748 msec.
So at least half of this time we spend in expression evaluations and 
aggregates accumulation.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Why JIT speed improvement is so modest?

From
Andres Freund
Date:
Hi,

On 2019-12-06 19:52:15 +0300, Konstantin Knizhnik wrote:
> On 06.12.2019 18:53, Robert Haas wrote:
> > On Thu, Nov 28, 2019 at 2:08 AM Konstantin Knizhnik
> > <k.knizhnik@postgrespro.ru> wrote:
> > > calls float4_accum for each row of T, the same query in VOPS will call
> > > vops_float4_avg_accumulate for each tile which contains 64 elements.
> > > So vops_float4_avg_accumulate is called 64 times less than float4_accum.
> > > And inside it contains straightforward loop:
> > > 
> > >               for (i = 0; i < TILE_SIZE; i++) {
> > >                   sum += opd->payload[i];
> > >               }
> > > 
> > > which can be optimized by compiler (loop unrolling, use of SIMD
> > > instructions,...).

I still fail to see what this has to do with the subject.


> > Part of the reason why the compiler can optimize that so well is
> > probably related to the fact that it includes no overflow checks.
> 
> May it makes sense to use in aggregate transformation function which is not
> checking for overflow and perform this check only in final function?
> NaN and Inf values will be preserved in any case...

I mean I personally think it'd be ok to skip the overflow checks for
floating point operations, they're not all that useful in practice (if
not the opposite). But if you want correct overflow detection behaviour,
you cannot just check in the final function, as you cannot discern
between the state where infinity/NaN has been incorporated into the
transition state from cases where that has happened due to overflow etc.

Greetings,

Andres Freund