Thread: Why JIT speed improvement is so modest?
Right now JIT provides about 30% improvement of TPC-H Q1 query: https://www.citusdata.com/blog/2018/09/11/postgresql-11-just-in-time/ I wonder why even at this query, which seems to be ideal use case for JIT, we get such modest improvement? I have raised this question several years ago - but that time JIT was assumed to be in early development stage and performance aspects were less critical than required infrastructure changes. But right now JIT seems to be stable enough and is switch on by default. Vitesse DB reports 8x speedup on Q1, ISP-RAS JIT version provides 3x speedup of Q1: https://www.pgcon.org/2017/schedule/attachments/467_PGCon%202017-05-26%2015-00%20ISPRAS%20Dynamic%20Compilation%20of%20SQL%20Queries%20in%20PostgreSQL%20Using%20LLVM%20JIT.pdf According to this presentation Q1 spends 6% of time in ExecQual and 75% in ExecAgg. VOPS provides 10x improvement of Q1. I have a hypothesis that such difference was caused by the way of aggregates calculation. Postgres is using Youngs-Cramer algorithm while both ISPRAS JIT version and my VOPS are just accumulating results in variable of type double. I rewrite VOPS to use the same algorithm as Postgres, but VOPS is still about 10 times faster. Results of Q1 on scale factor=10 TPC-H data at my desktop with parallel execution enabled: no-JIT: 5640 msec JIT: 4590msec VOPS: 452 msec VOPS + Youngs-Cramer algorithm: 610 msec Below are tops of profiles (functions with more than 1% of time): JIT: 10.98% postgres postgres [.] float4_accum 8.40% postgres postgres [.] float8_accum 7.51% postgres postgres [.] HeapTupleSatisfiesVisibility 5.92% postgres postgres [.] ExecInterpExpr 5.63% postgres postgres [.] tts_minimal_getsomeattrs 4.35% postgres postgres [.] lookup_hash_entries 3.72% postgres postgres [.] TupleHashTableHash.isra.8 2.93% postgres postgres [.] tuplehash_insert 2.70% postgres postgres [.] heapgettup_pagemode 2.24% postgres postgres [.] check_float8_array 2.23% postgres postgres [.] hash_search_with_hash_value 2.10% postgres postgres [.] ExecScan 1.90% postgres postgres [.] hash_uint32 1.57% postgres postgres [.] tts_minimal_clear 1.53% postgres postgres [.] FunctionCall1Coll 1.47% postgres postgres [.] pg_detoast_datum 1.39% postgres postgres [.] heapgetpage 1.37% postgres postgres [.] TupleHashTableMatch.isra.9 1.35% postgres postgres [.] ExecStoreBufferHeapTuple 1.06% postgres postgres [.] LookupTupleHashEntry 1.06% postgres postgres [.] AggCheckCallContext no-JIT: 26.82% postgres postgres [.] ExecInterpExpr 15.26% postgres postgres [.] tts_buffer_heap_getsomeattrs 8.27% postgres postgres [.] float4_accum 7.51% postgres postgres [.] float8_accum 5.26% postgres postgres [.] HeapTupleSatisfiesVisibility 2.78% postgres postgres [.] TupleHashTableHash.isra.8 2.63% postgres postgres [.] tts_minimal_getsomeattrs 2.54% postgres postgres [.] lookup_hash_entries 2.05% postgres postgres [.] tuplehash_insert 1.97% postgres postgres [.] heapgettup_pagemode 1.72% postgres postgres [.] hash_search_with_hash_value 1.57% postgres postgres [.] float48mul 1.55% postgres postgres [.] check_float8_array 1.48% postgres postgres [.] ExecScan 1.26% postgres postgres [.] hash_uint32 1.04% postgres postgres [.] tts_minimal_clear 1.00% postgres postgres [.] FunctionCall1Coll VOPS: 44.25% postgres vops.so [.] vops_avg_state_accumulate 11.76% postgres vops.so [.] vops_float4_avg_accumulate 6.14% postgres postgres [.] ExecInterpExpr 5.89% postgres vops.so [.] vops_float4_sub_lconst 4.89% postgres vops.so [.] vops_float4_mul 4.30% postgres vops.so [.] vops_int4_le_rconst 2.57% postgres vops.so [.] vops_float4_add_lconst 2.31% postgres vops.so [.] vops_count_accumulate 2.24% postgres postgres [.] tts_buffer_heap_getsomeattrs 1.97% postgres postgres [.] heap_page_prune_opt 1.72% postgres postgres [.] HeapTupleSatisfiesVisibility 1.67% postgres postgres [.] AllocSetAlloc 1.47% postgres postgres [.] hash_search_with_hash_value In theory by elimination of interpretation overhead JIT should provide performance comparable with vecrtorized executor. In most programming languages using JIT compiler instead of byte-code interpreter provides about 10x speed improvement. Certainly DBMS engine is very different with traditional interpreter and a lot of time is spent in tuple packing/unpacking (although JIT is also used here), in heap traversal,... But it is still unclear to me why if ISPRAS measurement were correct and we actually spent 75% of Q1 time in aggregation, JIT was not able to significantly (times) increase speed on Q1 query? Experiment with VOPS shows that used aggregation algorithm itself is not a bottleneck. Profile also give no answer for this question. Any ideas? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > JIT was not able to significantly (times) increase speed on Q1 query? > Experiment with VOPS shows that used aggregation algorithm itself is not > a bottleneck. > Profile also give no answer for this question. > Any ideas? Well, in the VOPS variant around 2/3 of the time is spent in routines that are obviously aggregation. In the JIT version, it's around 20%. So this suggests that the replacement execution engine is more invasive. I would also guess (!) that the VOPS engine optimizes fewer classes of query plan. ExecScan for example, looks to be completely optimized out VOPS but is still utilized in the JIT engine. I experimented with Vitessa a couple of years back and this was consistent with my recollection. merlin
On 25.11.2019 18:24, Merlin Moncure wrote: > On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> JIT was not able to significantly (times) increase speed on Q1 query? >> Experiment with VOPS shows that used aggregation algorithm itself is not >> a bottleneck. >> Profile also give no answer for this question. >> Any ideas? > Well, in the VOPS variant around 2/3 of the time is spent in routines > that are obviously aggregation. In the JIT version, it's around 20%. > So this suggests that the replacement execution engine is more > invasive. I would also guess (!) that the VOPS engine optimizes fewer > classes of query plan. ExecScan for example, looks to be completely > optimized out VOPS but is still utilized in the JIT engine. The difference in fraction of time spent in aggregate calculation is not so large (2 times vs. 10 times). I suspected that a lot of time is spent in relation traversal code, tuple unpacking and visibility checks. To check this hypothesis I have implement in-memory table access method which stores tuples in unpacked form and doesn't perform any visibility checks at all. Results were not so existed. I have to disable parallel execution (because it is not possible for tuples stored in backend private memory). Results are the following: lineitem: 13736 msec inmem_lineitem: 10044 msec vops_lineitem: 1945 msec The profile of inmem_lineitem is the following: 16.79% postgres postgres [.] float4_accum 12.86% postgres postgres [.] float8_accum 5.83% postgres postgres [.] TupleHashTableHash.isra.8 4.44% postgres postgres [.] lookup_hash_entries 3.37% postgres postgres [.] check_float8_array 3.11% postgres postgres [.] tuplehash_insert 2.91% postgres postgres [.] hash_uint32 2.83% postgres postgres [.] ExecScan 2.56% postgres postgres [.] inmem_getnextslot 2.22% postgres postgres [.] FunctionCall1Coll 2.14% postgres postgres [.] LookupTupleHashEntry 1.95% postgres postgres [.] TupleHashTableMatch.isra.9 1.76% postgres postgres [.] pg_detoast_datum 1.58% postgres postgres [.] AggCheckCallContext 1.57% postgres postgres [.] tts_minimal_clear 1.35% postgres perf-3054.map [.] 0x00007f558db60010 1.23% postgres postgres [.] fetch_input_tuple 1.15% postgres postgres [.] SeqNext 1.06% postgres postgres [.] ExecAgg 1.00% postgres postgres [.] tts_minimal_store_tuple So now fraction of time spent in aggregation is increased to 30% (vs. 20% for lineitem and 42% for vops_lineitem). Looks like the main bottleneck now is hashagg. VOPS is accessing hash about 10 times less (because it accumulates values for the whole tile). And it explains still large difference bwtween vops_lineitem and inmem_lineitem. If we remove aggregation and rewrite Q1 query as: select avg(l_quantity) as sum_qty, avg(l_extendedprice) as sum_base_price, avg(l_extendedprice*(1-l_discount)) as sum_disc_price, avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from inmem_lineitem where l_shipdate <= '1998-12-01'; then results are the following: lineitem: 9805 msec inmem_lineitem: 6257 msec vops_lineitem: 1865 msec and now profile of inmem_lineitem is: 25.27% postgres postgres [.] float4_accum 21.86% postgres postgres [.] float8_accum 5.49% postgres postgres [.] check_float8_array 4.57% postgres postgres [.] ExecScan 2.61% postgres postgres [.] AggCheckCallContext 2.30% postgres postgres [.] pg_detoast_datum 2.10% postgres postgres [.] inmem_getnextslot 1.81% postgres postgres [.] SeqNext 1.73% postgres postgres [.] fetch_input_tuple 1.61% postgres postgres [.] ExecAgg 1.23% postgres postgres [.] MemoryContextReset But still more than 3 times difference with VOPS! Something is wrong here... -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote: > > >On 25.11.2019 18:24, Merlin Moncure wrote: >>On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik >><k.knizhnik@postgrespro.ru> wrote: >>>JIT was not able to significantly (times) increase speed on Q1 query? >>>Experiment with VOPS shows that used aggregation algorithm itself is not >>>a bottleneck. >>>Profile also give no answer for this question. >>>Any ideas? >>Well, in the VOPS variant around 2/3 of the time is spent in routines >>that are obviously aggregation. In the JIT version, it's around 20%. >>So this suggests that the replacement execution engine is more >>invasive. I would also guess (!) that the VOPS engine optimizes fewer >>classes of query plan. ExecScan for example, looks to be completely >>optimized out VOPS but is still utilized in the JIT engine. > >The difference in fraction of time spent in aggregate calculation is >not so large (2 times vs. 10 times). >I suspected that a lot of time is spent in relation traversal code, >tuple unpacking and visibility checks. >To check this hypothesis I have implement in-memory table access >method which stores tuples in unpacked form and >doesn't perform any visibility checks at all. >Results were not so existed. I have to disable parallel execution >(because it is not possible for tuples stored in backend private >memory). >Results are the following: > >lineitem: 13736 msec >inmem_lineitem: 10044 msec >vops_lineitem: 1945 msec > >The profile of inmem_lineitem is the following: > > 16.79% postgres postgres [.] float4_accum > 12.86% postgres postgres [.] float8_accum > 5.83% postgres postgres [.] TupleHashTableHash.isra.8 > 4.44% postgres postgres [.] lookup_hash_entries > 3.37% postgres postgres [.] check_float8_array > 3.11% postgres postgres [.] tuplehash_insert > 2.91% postgres postgres [.] hash_uint32 > 2.83% postgres postgres [.] ExecScan > 2.56% postgres postgres [.] inmem_getnextslot > 2.22% postgres postgres [.] FunctionCall1Coll > 2.14% postgres postgres [.] LookupTupleHashEntry > 1.95% postgres postgres [.] TupleHashTableMatch.isra.9 > 1.76% postgres postgres [.] pg_detoast_datum > 1.58% postgres postgres [.] AggCheckCallContext > 1.57% postgres postgres [.] tts_minimal_clear > 1.35% postgres perf-3054.map [.] 0x00007f558db60010 > 1.23% postgres postgres [.] fetch_input_tuple > 1.15% postgres postgres [.] SeqNext > 1.06% postgres postgres [.] ExecAgg > 1.00% postgres postgres [.] tts_minimal_store_tuple > >So now fraction of time spent in aggregation is increased to 30% (vs. >20% for lineitem and 42% for vops_lineitem). >Looks like the main bottleneck now is hashagg. VOPS is accessing hash >about 10 times less (because it accumulates values for the whole >tile). >And it explains still large difference bwtween vops_lineitem and >inmem_lineitem. > >If we remove aggregation and rewrite Q1 query as: >select > avg(l_quantity) as sum_qty, > avg(l_extendedprice) as sum_base_price, > avg(l_extendedprice*(1-l_discount)) as sum_disc_price, > avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, > avg(l_quantity) as avg_qty, > avg(l_extendedprice) as avg_price, > avg(l_discount) as avg_disc, > count(*) as count_order >from > inmem_lineitem >where > l_shipdate <= '1998-12-01'; > >then results are the following: >lineitem: 9805 msec >inmem_lineitem: 6257 msec >vops_lineitem: 1865 msec > >and now profile of inmem_lineitem is: > > 25.27% postgres postgres [.] float4_accum > 21.86% postgres postgres [.] float8_accum > 5.49% postgres postgres [.] check_float8_array > 4.57% postgres postgres [.] ExecScan > 2.61% postgres postgres [.] AggCheckCallContext > 2.30% postgres postgres [.] pg_detoast_datum > 2.10% postgres postgres [.] inmem_getnextslot > 1.81% postgres postgres [.] SeqNext > 1.73% postgres postgres [.] fetch_input_tuple > 1.61% postgres postgres [.] ExecAgg > 1.23% postgres postgres [.] MemoryContextReset > >But still more than 3 times difference with VOPS! >Something is wrong here... > I have no idea what VOPS does, but IIRC one of the bottlenecks compared to various column stores is our iterative execution model, which makes it difficult/imposible to vectorize operations. That's likely why the accum functions are so high in the CPU profile. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 27.11.2019 19:05, Tomas Vondra wrote: > On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote: >> >> >> On 25.11.2019 18:24, Merlin Moncure wrote: >>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik >>> <k.knizhnik@postgrespro.ru> wrote: >>>> JIT was not able to significantly (times) increase speed on Q1 query? >>>> Experiment with VOPS shows that used aggregation algorithm itself >>>> is not >>>> a bottleneck. >>>> Profile also give no answer for this question. >>>> Any ideas? >>> Well, in the VOPS variant around 2/3 of the time is spent in routines >>> that are obviously aggregation. In the JIT version, it's around 20%. >>> So this suggests that the replacement execution engine is more >>> invasive. I would also guess (!) that the VOPS engine optimizes fewer >>> classes of query plan. ExecScan for example, looks to be completely >>> optimized out VOPS but is still utilized in the JIT engine. >> >> The difference in fraction of time spent in aggregate calculation is >> not so large (2 times vs. 10 times). >> I suspected that a lot of time is spent in relation traversal code, >> tuple unpacking and visibility checks. >> To check this hypothesis I have implement in-memory table access >> method which stores tuples in unpacked form and >> doesn't perform any visibility checks at all. >> Results were not so existed. I have to disable parallel execution >> (because it is not possible for tuples stored in backend private >> memory). >> Results are the following: >> >> lineitem: 13736 msec >> inmem_lineitem: 10044 msec >> vops_lineitem: 1945 msec >> >> The profile of inmem_lineitem is the following: >> >> 16.79% postgres postgres [.] float4_accum >> 12.86% postgres postgres [.] float8_accum >> 5.83% postgres postgres [.] TupleHashTableHash.isra.8 >> 4.44% postgres postgres [.] lookup_hash_entries >> 3.37% postgres postgres [.] check_float8_array >> 3.11% postgres postgres [.] tuplehash_insert >> 2.91% postgres postgres [.] hash_uint32 >> 2.83% postgres postgres [.] ExecScan >> 2.56% postgres postgres [.] inmem_getnextslot >> 2.22% postgres postgres [.] FunctionCall1Coll >> 2.14% postgres postgres [.] LookupTupleHashEntry >> 1.95% postgres postgres [.] TupleHashTableMatch.isra.9 >> 1.76% postgres postgres [.] pg_detoast_datum >> 1.58% postgres postgres [.] AggCheckCallContext >> 1.57% postgres postgres [.] tts_minimal_clear >> 1.35% postgres perf-3054.map [.] 0x00007f558db60010 >> 1.23% postgres postgres [.] fetch_input_tuple >> 1.15% postgres postgres [.] SeqNext >> 1.06% postgres postgres [.] ExecAgg >> 1.00% postgres postgres [.] tts_minimal_store_tuple >> >> So now fraction of time spent in aggregation is increased to 30% (vs. >> 20% for lineitem and 42% for vops_lineitem). >> Looks like the main bottleneck now is hashagg. VOPS is accessing hash >> about 10 times less (because it accumulates values for the whole tile). >> And it explains still large difference bwtween vops_lineitem and >> inmem_lineitem. >> >> If we remove aggregation and rewrite Q1 query as: >> select >> avg(l_quantity) as sum_qty, >> avg(l_extendedprice) as sum_base_price, >> avg(l_extendedprice*(1-l_discount)) as sum_disc_price, >> avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, >> avg(l_quantity) as avg_qty, >> avg(l_extendedprice) as avg_price, >> avg(l_discount) as avg_disc, >> count(*) as count_order >> from >> inmem_lineitem >> where >> l_shipdate <= '1998-12-01'; >> >> then results are the following: >> lineitem: 9805 msec >> inmem_lineitem: 6257 msec >> vops_lineitem: 1865 msec >> >> and now profile of inmem_lineitem is: >> >> 25.27% postgres postgres [.] float4_accum >> 21.86% postgres postgres [.] float8_accum >> 5.49% postgres postgres [.] check_float8_array >> 4.57% postgres postgres [.] ExecScan >> 2.61% postgres postgres [.] AggCheckCallContext >> 2.30% postgres postgres [.] pg_detoast_datum >> 2.10% postgres postgres [.] inmem_getnextslot >> 1.81% postgres postgres [.] SeqNext >> 1.73% postgres postgres [.] fetch_input_tuple >> 1.61% postgres postgres [.] ExecAgg >> 1.23% postgres postgres [.] MemoryContextReset >> >> But still more than 3 times difference with VOPS! >> Something is wrong here... >> > > I have no idea what VOPS does, but IIRC one of the bottlenecks compared > to various column stores is our iterative execution model, which makes > it difficult/imposible to vectorize operations. That's likely why the > accum functions are so high in the CPU profile. > > regards > VOPS is doing very simple thing: it replaces scala types with vector (tiles) and define all standard operations for them. Also it provides Postgres aggregate for this types. So while for normal Postgres table, the query select sum(x) from T; calls float4_accum for each row of T, the same query in VOPS will call vops_float4_avg_accumulate for each tile which contains 64 elements. So vops_float4_avg_accumulate is called 64 times less than float4_accum. And inside it contains straightforward loop: for (i = 0; i < TILE_SIZE; i++) { sum += opd->payload[i]; } which can be optimized by compiler (loop unrolling, use of SIMD instructions,...). So no wonder that VOPS is faster than Postgres executor. But Postgres now contains JIT and it is used in this case. So interpretation overhead of executor should be mostly eliminated by JIT. In theory, perfect JIT code should process rows of horizontal data model at the same speed as vector executor processing columns of vertical data model. Vertical model provides signficatn advantages when a query affect only small fraction of rows. But in case of Q1 we are calculating 8 aggregates for just 4 columns. And inmem_lineitem is actually projection of original lineitem table containing only columns needed for this query. So amount of fetched data in this case is almost the same for horizontal and vertical data models. Effects of CPU caches should not also play significant role in this case. That is why it is not quite clear to me why there is still big difference (3 times) between VOPS and in-memory table and not so large difference between normal and in-memory tables. Concerning large percent spent in accumulate function - I do not agree with you. What this query is actually doing is just calculating aggregates. The less is interpretation overhead the larger percent of time we should spent in aggregate function. May be the whole infrastructure of Postgres aggregates adds too large overhead (check_float8_array, function calls,...) and in case of VOPS this overhead is divided by 64. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi,
you mean if we don't add new compiler options the compiler will do the loop unrolling using SIMD automatically?
Beside the function calls, cache miss etc, for VOPS I think the call stack is squeezing too, but the JIT optimize still process rows one by one.
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> 于2019年11月28日周四 下午3:08写道:
On 27.11.2019 19:05, Tomas Vondra wrote:
> On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
>>
>>
>> On 25.11.2019 18:24, Merlin Moncure wrote:
>>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
>>> <k.knizhnik@postgrespro.ru> wrote:
>>>> JIT was not able to significantly (times) increase speed on Q1 query?
>>>> Experiment with VOPS shows that used aggregation algorithm itself
>>>> is not
>>>> a bottleneck.
>>>> Profile also give no answer for this question.
>>>> Any ideas?
>>> Well, in the VOPS variant around 2/3 of the time is spent in routines
>>> that are obviously aggregation. In the JIT version, it's around 20%.
>>> So this suggests that the replacement execution engine is more
>>> invasive. I would also guess (!) that the VOPS engine optimizes fewer
>>> classes of query plan. ExecScan for example, looks to be completely
>>> optimized out VOPS but is still utilized in the JIT engine.
>>
>> The difference in fraction of time spent in aggregate calculation is
>> not so large (2 times vs. 10 times).
>> I suspected that a lot of time is spent in relation traversal code,
>> tuple unpacking and visibility checks.
>> To check this hypothesis I have implement in-memory table access
>> method which stores tuples in unpacked form and
>> doesn't perform any visibility checks at all.
>> Results were not so existed. I have to disable parallel execution
>> (because it is not possible for tuples stored in backend private
>> memory).
>> Results are the following:
>>
>> lineitem: 13736 msec
>> inmem_lineitem: 10044 msec
>> vops_lineitem: 1945 msec
>>
>> The profile of inmem_lineitem is the following:
>>
>> 16.79% postgres postgres [.] float4_accum
>> 12.86% postgres postgres [.] float8_accum
>> 5.83% postgres postgres [.] TupleHashTableHash.isra.8
>> 4.44% postgres postgres [.] lookup_hash_entries
>> 3.37% postgres postgres [.] check_float8_array
>> 3.11% postgres postgres [.] tuplehash_insert
>> 2.91% postgres postgres [.] hash_uint32
>> 2.83% postgres postgres [.] ExecScan
>> 2.56% postgres postgres [.] inmem_getnextslot
>> 2.22% postgres postgres [.] FunctionCall1Coll
>> 2.14% postgres postgres [.] LookupTupleHashEntry
>> 1.95% postgres postgres [.] TupleHashTableMatch.isra.9
>> 1.76% postgres postgres [.] pg_detoast_datum
>> 1.58% postgres postgres [.] AggCheckCallContext
>> 1.57% postgres postgres [.] tts_minimal_clear
>> 1.35% postgres perf-3054.map [.] 0x00007f558db60010
>> 1.23% postgres postgres [.] fetch_input_tuple
>> 1.15% postgres postgres [.] SeqNext
>> 1.06% postgres postgres [.] ExecAgg
>> 1.00% postgres postgres [.] tts_minimal_store_tuple
>>
>> So now fraction of time spent in aggregation is increased to 30% (vs.
>> 20% for lineitem and 42% for vops_lineitem).
>> Looks like the main bottleneck now is hashagg. VOPS is accessing hash
>> about 10 times less (because it accumulates values for the whole tile).
>> And it explains still large difference bwtween vops_lineitem and
>> inmem_lineitem.
>>
>> If we remove aggregation and rewrite Q1 query as:
>> select
>> avg(l_quantity) as sum_qty,
>> avg(l_extendedprice) as sum_base_price,
>> avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
>> avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
>> avg(l_quantity) as avg_qty,
>> avg(l_extendedprice) as avg_price,
>> avg(l_discount) as avg_disc,
>> count(*) as count_order
>> from
>> inmem_lineitem
>> where
>> l_shipdate <= '1998-12-01';
>>
>> then results are the following:
>> lineitem: 9805 msec
>> inmem_lineitem: 6257 msec
>> vops_lineitem: 1865 msec
>>
>> and now profile of inmem_lineitem is:
>>
>> 25.27% postgres postgres [.] float4_accum
>> 21.86% postgres postgres [.] float8_accum
>> 5.49% postgres postgres [.] check_float8_array
>> 4.57% postgres postgres [.] ExecScan
>> 2.61% postgres postgres [.] AggCheckCallContext
>> 2.30% postgres postgres [.] pg_detoast_datum
>> 2.10% postgres postgres [.] inmem_getnextslot
>> 1.81% postgres postgres [.] SeqNext
>> 1.73% postgres postgres [.] fetch_input_tuple
>> 1.61% postgres postgres [.] ExecAgg
>> 1.23% postgres postgres [.] MemoryContextReset
>>
>> But still more than 3 times difference with VOPS!
>> Something is wrong here...
>>
>
> I have no idea what VOPS does, but IIRC one of the bottlenecks compared
> to various column stores is our iterative execution model, which makes
> it difficult/imposible to vectorize operations. That's likely why the
> accum functions are so high in the CPU profile.
>
> regards
>
VOPS is doing very simple thing: it replaces scala types with vector
(tiles) and define all standard operations for them.
Also it provides Postgres aggregate for this types.
So while for normal Postgres table, the query
select sum(x) from T;
calls float4_accum for each row of T, the same query in VOPS will call
vops_float4_avg_accumulate for each tile which contains 64 elements.
So vops_float4_avg_accumulate is called 64 times less than float4_accum.
And inside it contains straightforward loop:
for (i = 0; i < TILE_SIZE; i++) {
sum += opd->payload[i];
}
which can be optimized by compiler (loop unrolling, use of SIMD
instructions,...).
So no wonder that VOPS is faster than Postgres executor.
But Postgres now contains JIT and it is used in this case.
So interpretation overhead of executor should be mostly eliminated by JIT.
In theory, perfect JIT code should process rows of horizontal data model
at the same speed as vector executor processing columns of vertical data
model.
Vertical model provides signficatn advantages when a query affect only
small fraction of rows.
But in case of Q1 we are calculating 8 aggregates for just 4 columns.
And inmem_lineitem is actually projection of original lineitem table
containing only columns needed for this query.
So amount of fetched data in this case is almost the same for horizontal
and vertical data models.
Effects of CPU caches should not also play significant role in this case.
That is why it is not quite clear to me why there is still big
difference (3 times) between VOPS and in-memory table and not so large
difference between normal and in-memory tables.
Concerning large percent spent in accumulate function - I do not agree
with you. What this query is actually doing is just calculating aggregates.
The less is interpretation overhead the larger percent of time we should
spent in aggregate function.
May be the whole infrastructure of Postgres aggregates adds too large
overhead (check_float8_array, function calls,...) and in case of VOPS
this overhead is divided by 64.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Guang-Nan He
Hi,
Yes, most of modern compiler are doing it.
GCC requires -O3 option (-O2 is not enough), but clang is using them even with -O2.
But Postgres is using more sophisticated Youngs-Cramer algorithm for calculating SUM/AVG aggregates. And here SIMD instructions do not help much.
My original assumption was that huge difference in speed between VOPS/ISPRAS JIT and Vanilla JIT can be explained by the difference in accumulation algorithm.
This is why I implemented calculation of AVG in VOPS using Youngs-Cramer algorithm.
And it certainly affect performance: Q1 with SUM aggregates is executed by VOPS almost three times faster than with AVG aggregates (700 msec vs. 2000 msec).
But even with Youngs-Cramer algorithm VOPS is 6 times faster than standard Postgres with JIT and 5 times faster than my in-memory storage.
By implementing in-memory storage which just keeps unpacked tuples in L2 list in backend's private memory and so doesn't spend time for unpacking or visibility checks
I want to exclude this overhead and reach almost the same speed as VOPS.
But it doesn't happen.
On 28.11.2019 10:36, guangnan he wrote:
Hi,you mean if we don't add new compiler options the compiler will do the loop unrolling using SIMD automatically?
Yes, most of modern compiler are doing it.
GCC requires -O3 option (-O2 is not enough), but clang is using them even with -O2.
But Postgres is using more sophisticated Youngs-Cramer algorithm for calculating SUM/AVG aggregates. And here SIMD instructions do not help much.
My original assumption was that huge difference in speed between VOPS/ISPRAS JIT and Vanilla JIT can be explained by the difference in accumulation algorithm.
This is why I implemented calculation of AVG in VOPS using Youngs-Cramer algorithm.
And it certainly affect performance: Q1 with SUM aggregates is executed by VOPS almost three times faster than with AVG aggregates (700 msec vs. 2000 msec).
But even with Youngs-Cramer algorithm VOPS is 6 times faster than standard Postgres with JIT and 5 times faster than my in-memory storage.
If we do not take in account overhead of heap traversal and tuples packing then amount of calculations doesn't depend on data model: whether it is vertical or horizontal.Beside the function calls, cache miss etc, for VOPS I think the call stack is squeezing too, but the JIT optimize still process rows one by one.
By implementing in-memory storage which just keeps unpacked tuples in L2 list in backend's private memory and so doesn't spend time for unpacking or visibility checks
I want to exclude this overhead and reach almost the same speed as VOPS.
But it doesn't happen.
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> 于2019年11月28日周四 下午3:08写道:
On 27.11.2019 19:05, Tomas Vondra wrote:
> On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
>>
>>
>> On 25.11.2019 18:24, Merlin Moncure wrote:
>>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
>>> <k.knizhnik@postgrespro.ru> wrote:
>>>> JIT was not able to significantly (times) increase speed on Q1 query?
>>>> Experiment with VOPS shows that used aggregation algorithm itself
>>>> is not
>>>> a bottleneck.
>>>> Profile also give no answer for this question.
>>>> Any ideas?
>>> Well, in the VOPS variant around 2/3 of the time is spent in routines
>>> that are obviously aggregation. In the JIT version, it's around 20%.
>>> So this suggests that the replacement execution engine is more
>>> invasive. I would also guess (!) that the VOPS engine optimizes fewer
>>> classes of query plan. ExecScan for example, looks to be completely
>>> optimized out VOPS but is still utilized in the JIT engine.
>>
>> The difference in fraction of time spent in aggregate calculation is
>> not so large (2 times vs. 10 times).
>> I suspected that a lot of time is spent in relation traversal code,
>> tuple unpacking and visibility checks.
>> To check this hypothesis I have implement in-memory table access
>> method which stores tuples in unpacked form and
>> doesn't perform any visibility checks at all.
>> Results were not so existed. I have to disable parallel execution
>> (because it is not possible for tuples stored in backend private
>> memory).
>> Results are the following:
>>
>> lineitem: 13736 msec
>> inmem_lineitem: 10044 msec
>> vops_lineitem: 1945 msec
>>
>> The profile of inmem_lineitem is the following:
>>
>> 16.79% postgres postgres [.] float4_accum
>> 12.86% postgres postgres [.] float8_accum
>> 5.83% postgres postgres [.] TupleHashTableHash.isra.8
>> 4.44% postgres postgres [.] lookup_hash_entries
>> 3.37% postgres postgres [.] check_float8_array
>> 3.11% postgres postgres [.] tuplehash_insert
>> 2.91% postgres postgres [.] hash_uint32
>> 2.83% postgres postgres [.] ExecScan
>> 2.56% postgres postgres [.] inmem_getnextslot
>> 2.22% postgres postgres [.] FunctionCall1Coll
>> 2.14% postgres postgres [.] LookupTupleHashEntry
>> 1.95% postgres postgres [.] TupleHashTableMatch.isra.9
>> 1.76% postgres postgres [.] pg_detoast_datum
>> 1.58% postgres postgres [.] AggCheckCallContext
>> 1.57% postgres postgres [.] tts_minimal_clear
>> 1.35% postgres perf-3054.map [.] 0x00007f558db60010
>> 1.23% postgres postgres [.] fetch_input_tuple
>> 1.15% postgres postgres [.] SeqNext
>> 1.06% postgres postgres [.] ExecAgg
>> 1.00% postgres postgres [.] tts_minimal_store_tuple
>>
>> So now fraction of time spent in aggregation is increased to 30% (vs.
>> 20% for lineitem and 42% for vops_lineitem).
>> Looks like the main bottleneck now is hashagg. VOPS is accessing hash
>> about 10 times less (because it accumulates values for the whole tile).
>> And it explains still large difference bwtween vops_lineitem and
>> inmem_lineitem.
>>
>> If we remove aggregation and rewrite Q1 query as:
>> select
>> avg(l_quantity) as sum_qty,
>> avg(l_extendedprice) as sum_base_price,
>> avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
>> avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
>> avg(l_quantity) as avg_qty,
>> avg(l_extendedprice) as avg_price,
>> avg(l_discount) as avg_disc,
>> count(*) as count_order
>> from
>> inmem_lineitem
>> where
>> l_shipdate <= '1998-12-01';
>>
>> then results are the following:
>> lineitem: 9805 msec
>> inmem_lineitem: 6257 msec
>> vops_lineitem: 1865 msec
>>
>> and now profile of inmem_lineitem is:
>>
>> 25.27% postgres postgres [.] float4_accum
>> 21.86% postgres postgres [.] float8_accum
>> 5.49% postgres postgres [.] check_float8_array
>> 4.57% postgres postgres [.] ExecScan
>> 2.61% postgres postgres [.] AggCheckCallContext
>> 2.30% postgres postgres [.] pg_detoast_datum
>> 2.10% postgres postgres [.] inmem_getnextslot
>> 1.81% postgres postgres [.] SeqNext
>> 1.73% postgres postgres [.] fetch_input_tuple
>> 1.61% postgres postgres [.] ExecAgg
>> 1.23% postgres postgres [.] MemoryContextReset
>>
>> But still more than 3 times difference with VOPS!
>> Something is wrong here...
>>
>
> I have no idea what VOPS does, but IIRC one of the bottlenecks compared
> to various column stores is our iterative execution model, which makes
> it difficult/imposible to vectorize operations. That's likely why the
> accum functions are so high in the CPU profile.
>
> regards
>
VOPS is doing very simple thing: it replaces scala types with vector
(tiles) and define all standard operations for them.
Also it provides Postgres aggregate for this types.
So while for normal Postgres table, the query
select sum(x) from T;
calls float4_accum for each row of T, the same query in VOPS will call
vops_float4_avg_accumulate for each tile which contains 64 elements.
So vops_float4_avg_accumulate is called 64 times less than float4_accum.
And inside it contains straightforward loop:
for (i = 0; i < TILE_SIZE; i++) {
sum += opd->payload[i];
}
which can be optimized by compiler (loop unrolling, use of SIMD
instructions,...).
So no wonder that VOPS is faster than Postgres executor.
But Postgres now contains JIT and it is used in this case.
So interpretation overhead of executor should be mostly eliminated by JIT.
In theory, perfect JIT code should process rows of horizontal data model
at the same speed as vector executor processing columns of vertical data
model.
Vertical model provides signficatn advantages when a query affect only
small fraction of rows.
But in case of Q1 we are calculating 8 aggregates for just 4 columns.
And inmem_lineitem is actually projection of original lineitem table
containing only columns needed for this query.
So amount of fetched data in this case is almost the same for horizontal
and vertical data models.
Effects of CPU caches should not also play significant role in this case.
That is why it is not quite clear to me why there is still big
difference (3 times) between VOPS and in-memory table and not so large
difference between normal and in-memory tables.
Concerning large percent spent in accumulate function - I do not agree
with you. What this query is actually doing is just calculating aggregates.
The less is interpretation overhead the larger percent of time we should
spent in aggregate function.
May be the whole infrastructure of Postgres aggregates adds too large
overhead (check_float8_array, function calls,...) and in case of VOPS
this overhead is divided by 64.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company--Guang-Nan He
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, On 2019-11-25 18:09:29 +0300, Konstantin Knizhnik wrote: > I wonder why even at this query, which seems to be ideal use case for JIT, > we get such modest improvement? I think there's a number of causes: 1) There's bottlenecks elsewhere: - The order of sequential scan memory accesses is bad https://www.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de In my experiments, fixing that yields larger JIT improvements, because less time is spent stalling due to cache misses during tuple deforming (needing the tuple's natts at the start prevents out-of-order from hiding the relevant latency). - The transition function for floating point aggregates is pretty expensive. In particular, we compute the full youngs-cramer stuff for sum/avg, even though they aren't actually needed there. This has become measurably worse with https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e954a727f0c8872bf5203186ad0f5312f6183746 In this case it's complicated enough apparently that the transition functions are too expensive to inline. - float4/8_accum use arrays to store the transition state. That's noticably more expensive than just accessing a struct, partially because more checks needs to be done. We really should move most, if not all, aggregates that use array transition states to "internal" type transition states. Probably with some reusable helpers to make it easier to write serialization / deserialization functions so we can continue to allow parallelism. - The per-row overhead on lower levels of the query is significant. E.g. in your profile the HeapTupleSatisfiesVisibility() calls (you'd get largely rid of this by freezing), and the hashtable overhead is quite noticable. JITing expression eval doesn't fix that. ... 2) The code generated for JIT isn't that good. In particular, the external memory references included in the generated code limit the optimization potential quite substantially. There's also quite some (not just JIT) improvement potential related to the aggregation code, simplifying the generated expressions. See https://www.postgresql.org/message-id/20191023163849.sosqbfs5yenocez3%40alap3.anarazel.de for my attempt at improving the situation. It does measurably improve the situation for Q1, while still leaving a lot of further improvements to be done. You'd be more than welcome to review some of that! 3) Plenty of crucial code is not JITed, even when expression related. Most crucial for Q1 is the fact that the hash computation for aggregates isn't JITed as a whole - when looking at hierarchical profiles, we spend about 1/3 of the whole query time within TupleHashTable*. 4) The currently required forming / deforming of tuples into minimal tuples when storing them in the hashagg table is *expensive*. We can address that partially by computing NOT NULL information for the tupledesc used for the hashtable (which will make JITed tuple deforming considerably faster, because it'll just be a reference to an hardcoded offset). We can also simplify the minimal tuple representation - historically it looks the way it does now because we needed minimal tuples to be largely compatible with heap tuples - but we don't anymore. Even just removing the weird offset math we do for minimal tuples would be beneficial, but I think we can do more than that. > Vitesse DB reports 8x speedup on Q1, > ISP-RAS JIT version provides 3x speedup of Q1: I think those measurements were done before a lot of generic improvements to aggregation speed were done. E.g. Q1 performance improved significantly due to the new expression evaluation engine, even without JIT. Because the previous tree-walking expression evaluation was so slow for many things, JITing that away obviously yielded bigger improvements than it does now. > VOPS provides 10x improvement of Q1. My understanding of VOPS is that it ferries around more than one tuple at a time. And avoids a lot of generic code paths. So that just doesn't seem a meaningful comparison. > In theory by elimination of interpretation overhead JIT should provide > performance comparable with vecrtorized executor. I don't think that's true at all. Vectorized execution, which I assume to mean dealing with more than one tuple at a time, is largely orthogonal to the way expressions are evaluated. The reason that vectorized execution is good is that it drastically increases cache locality (by performing work that accesses related data, e.g. a buffer page, in a tight loop, without a lot of other work happening inbetween), that it increases the benefits of out of order execution (by removing dependencies, as e.g. predicates for multiple tuples can be computed, without a separate dependency on the result for each predicate evaluation), etc. JIT compiled expression evaluation cannot get you these benefits. > In most programming languages using JIT compiler instead of byte-code > interpreter provides about 10x speed improvement. But that's with low level bytecode execution, whereas expression evaluation uses relatively coarse ops (sometimes called "super" opcodes). > Below are tops of profiles (functions with more than 1% of time): > > JIT: Note that just looking at a plain porfile, without injecting information about the JITed code, will yield misleading results. Without the additional information perf will not be able to group the instructions of the JITed code sampled to a function, leading to them each being listed separately. If you enable jit_profiling_support, and measure with perf record -k 1 -o /tmp/perf.data -p 22950 (optionally with --call-graph lbr) you then can inject the information about JITed code: perf inject -v --jit -i /tmp/perf.data -o /tmp/perf.jit.data and look at the result of that with perf report -i /tmp/perf.jit.data > 10.98% postgres postgres [.] float4_accum > 8.40% postgres postgres [.] float8_accum > 7.51% postgres postgres [.] HeapTupleSatisfiesVisibility > 5.92% postgres postgres [.] ExecInterpExpr > 5.63% postgres postgres [.] tts_minimal_getsomeattrs The fact that ExecInterpExpr, tts_minimal_getsomeattrs show up significantly suggests that you're running a slightly older build, without a few bugfixes. Could that be true? Greetings, Andres Freund
Hi, Thank you for your replay and explanations. My comments are inside. On 04.12.2019 22:43, Andres Freund wrote: > Hi, > > On 2019-11-25 18:09:29 +0300, Konstantin Knizhnik wrote: >> I wonder why even at this query, which seems to be ideal use case for JIT, >> we get such modest improvement? > I think there's a number of causes: > > 1) There's bottlenecks elsewhere: > - The order of sequential scan memory accesses is bad > https://www.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de > > In my experiments, fixing that yields larger JIT improvements, > because less time is spent stalling due to cache misses during > tuple deforming (needing the tuple's natts at the start prevents > out-of-order from hiding the relevant latency). This is why I have implemented my own in-memory table access method. It stores tuples in unpacked format so there should be no tuple deforming overhead. By the way if somebody is interested (mostly for experiments, I do not thing that it in the current state it has some practival meaning) my in-memory storage implementation is here: https://github.com/postgrespro/postgresql.builtin_pool/tree/inmem_am > > > - The transition function for floating point aggregates is pretty > expensive. In particular, we compute the full youngs-cramer stuff > for sum/avg, even though they aren't actually needed there. This > has become measurably worse with > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e954a727f0c8872bf5203186ad0f5312f6183746 > In this case it's complicated enough apparently that the transition > functions are too expensive to inline. > > - float4/8_accum use arrays to store the transition state. That's > noticably more expensive than just accessing a struct, partially > because more checks needs to be done. We really should move most, > if not all, aggregates that use array transition states to > "internal" type transition states. Probably with some reusable > helpers to make it easier to write serialization / deserialization > functions so we can continue to allow parallelism. Yes, it is true. Profile shows that packing/unpacking transition state takes substantial amount of time. But youngs-cramer stuff is not used for SUM aggregate! It is using float4pl for accumulation and float4 as transaction type. When I replace AVG with sum time of query execution with in-mem storage is decreased from 6 seconds to 4 seconds. But in VOPS improvement was even larger: 700 msec vs. 1865 msec. So the gap in performance is even larger. And profile is shows that aggregate infrastructure overhead disappears: 8.82% postgres postgres [.] ExecScan 4.60% postgres postgres [.] fetch_input_tuple 4.51% postgres postgres [.] inmem_getnextslot 3.48% postgres postgres [.] SeqNext 3.13% postgres postgres [.] ExecAgg 2.78% postgres postgres [.] MemoryContextReset 1.77% postgres perf-27660.map [.] 0x00007fcb81ae9032 1.52% postgres perf-27660.map [.] 0x00007fcb81ae9a4a 1.50% postgres perf-27660.map [.] 0x00007fcb81ae9fa2 1.49% postgres perf-27660.map [.] 0x00007fcb81ae9dcd 1.44% postgres perf-27660.map [.] 0x00007fcb81aea205 1.42% postgres perf-27660.map [.] 0x00007fcb81ae9072 1.31% postgres perf-27660.map [.] 0x00007fcb81ae9a56 1.22% postgres perf-27660.map [.] 0x00007fcb81ae9df1 1.22% postgres perf-27660.map [.] 0x00007fcb81aea225 1.21% postgres perf-27660.map [.] 0x00007fcb81ae93e6 1.21% postgres perf-27660.map [.] 0x00007fcb81ae9fae 1.19% postgres perf-27660.map [.] 0x00007fcb81ae9c83 1.12% postgres perf-27660.map [.] 0x00007fcb81ae9e5b 1.12% postgres perf-27660.map [.] 0x00007fcb81ae9c5f 1.05% postgres perf-27660.map [.] 0x00007fcb81ae9010 1.05% postgres perf-27660.map [.] 0x00007fcb81ae987b As far as I understand positions in profile starting from 7-th are corresponding to JIT code. > - The per-row overhead on lower levels of the query is > significant. E.g. in your profile the > HeapTupleSatisfiesVisibility() calls (you'd get largely rid of this > by freezing), and the hashtable overhead is quite noticable. JITing > expression eval doesn't fix that. Once again: my in-memory storage doesn't perform visibility checks. This is was the primary idea of my experiment: try to minimize per-row storage overhead and check if JIT can provide performance comparable with vectorized engine. Unfortunately the answer was negative: the difference with VOPS is more than three times, while difference between standard table and in-memory table is less than 1.5. > > ... > > > 2) The code generated for JIT isn't that good. In particular, the > external memory references included in the generated code limit the > optimization potential quite substantially. There's also quite some > (not just JIT) improvement potential related to the aggregation code, > simplifying the generated expressions. > > See https://www.postgresql.org/message-id/20191023163849.sosqbfs5yenocez3%40alap3.anarazel.de > for my attempt at improving the situation. It does measurably > improve the situation for Q1, while still leaving a lot of further > improvements to be done. You'd be more than welcome to review some > of that! > > > 3) Plenty of crucial code is not JITed, even when expression > related. Most crucial for Q1 is the fact that the hash computation > for aggregates isn't JITed as a whole - when looking at hierarchical > profiles, we spend about 1/3 of the whole query time within > TupleHashTable*. > 4) The currently required forming / deforming of tuples into minimal > tuples when storing them in the hashagg table is *expensive*. > > We can address that partially by computing NOT NULL information for > the tupledesc used for the hashtable (which will make JITed tuple > deforming considerably faster, because it'll just be a reference to > an hardcoded offset). > > We can also simplify the minimal tuple representation - historically > it looks the way it does now because we needed minimal tuples to be > largely compatible with heap tuples - but we don't anymore. Even just > removing the weird offset math we do for minimal tuples would be > beneficial, but I think we can do more than that. > Yes, this is the first think I have noticed. VOPS is calling hash function only once per 64 rows - 64 times smaller than row storage. This is why VOPS is 6 times faster on Q1 than vanilla postgres and 5 times than my in-memory storage. And this is why I removed aggregation from Q1 and just calculates grand aggregates. > >> VOPS provides 10x improvement of Q1. > My understanding of VOPS is that it ferries around more than one tuple > at a time. And avoids a lot of generic code paths. So that just doesn't > seem a meaningful comparison. VOPS is just an example of vectorized executor. It is possible to implement things which VOPS is performing using customized types using custom nodes as in Hubert Zhang prototype: https://www.postgresql.org/message-id/flat/CAB0yrenxJ3FcmnLs8JqpEG3tzSZ%3DOL1MZBUh3v6dgH%2Bo70GTFA%40mail.gmail.com#df50bbf3610dc2f42cb9b54423a22111 >> In theory by elimination of interpretation overhead JIT should provide >> performance comparable with vecrtorized executor. > I don't think that's true at all. Vectorized execution, which I assume > to mean dealing with more than one tuple at a time, is largely > orthogonal to the way expressions are evaluated. The reason that > vectorized execution is good is that it drastically increases cache > locality (by performing work that accesses related data, e.g. a buffer > page, in a tight loop, without a lot of other work happening inbetween), > that it increases the benefits of out of order execution (by removing > dependencies, as e.g. predicates for multiple tuples can be computed, > without a separate dependency on the result for each predicate > evaluation), etc. > > JIT compiled expression evaluation cannot get you these benefits. Yes, I know this arguments. But please look here in Q1 and lineitem table projection we have just 4 float4 attributes and calculates 7 aggregates for them. It seems to me that in this case CPU cache will be even more efficiently using in case of horizontal calculation. At least if you implement correspondent query in C, then version working with array of struct will be almost two times faster than version working with vertical arrays. > > Note that just looking at a plain porfile, without injecting information > about the JITed code, will yield misleading results. Without the > additional information perf will not be able to group the instructions > of the JITed code sampled to a function, leading to them each being > listed separately. > > If you enable jit_profiling_support, and measure with > > perf record -k 1 -o /tmp/perf.data -p 22950 > (optionally with --call-graph lbr) > you then can inject the information about JITed code: > perf inject -v --jit -i /tmp/perf.data -o /tmp/perf.jit.data > and look at the result of that with > perf report -i /tmp/perf.jit.data > Something is not working properly in my case: root@knizhnik:~# perf record -k 1 -o /tmp/perf.data -p 7407 ^C[ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.452 MB /tmp/perf.data (11410 samples) ] root@knizhnik:~# perf inject -v --jit -i /tmp/perf.data -o /tmp/perf.jit.data build id event received for [kernel.kallsyms]: b1ef0f6204a7ec3f508b9e1536f73521c7b4b41a build id event received for /home/knizhnik/postgresql/dist/bin/postgres: 8ef1a41e80f043a56778e265f5badb67f1441b61 build id event received for [vdso]: b13824592e1e837368d92991b72a19437dc86a27 Looking at the vmlinux_path (8 entries long) symsrc__init: cannot get elf header. Using /proc/kcore for kernel object code Using /proc/kallsyms for symbols Using CPUID GenuineIntel-6-3C root@knizhnik:~# perf report -i /tmp/perf.jit.data 7.37% postgres postgres [.] ExecScan 7.23% postgres postgres [.] inmem_getnextslot 4.79% postgres postgres [.] fetch_input_tuple 4.07% postgres postgres [.] SeqNext 3.52% postgres postgres [.] ExecAgg 2.68% postgres postgres [.] MemoryContextReset 1.62% postgres perf-7407.map [.] 0x00007f4591c95f02 1.50% postgres perf-7407.map [.] 0x00007f4591c95d2d ... > The fact that ExecInterpExpr, tts_minimal_getsomeattrs show up > significantly suggests that you're running a slightly older build, > without a few bugfixes. Could that be true? My forked my branch on your commit from 27 November (ca266a069a20c32a8f0a1df982a5a67d9483bcb3). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Nov 28, 2019 at 2:08 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > calls float4_accum for each row of T, the same query in VOPS will call > vops_float4_avg_accumulate for each tile which contains 64 elements. > So vops_float4_avg_accumulate is called 64 times less than float4_accum. > And inside it contains straightforward loop: > > for (i = 0; i < TILE_SIZE; i++) { > sum += opd->payload[i]; > } > > which can be optimized by compiler (loop unrolling, use of SIMD > instructions,...). Part of the reason why the compiler can optimize that so well is probably related to the fact that it includes no overflow checks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06.12.2019 18:53, Robert Haas wrote: > On Thu, Nov 28, 2019 at 2:08 AM Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> calls float4_accum for each row of T, the same query in VOPS will call >> vops_float4_avg_accumulate for each tile which contains 64 elements. >> So vops_float4_avg_accumulate is called 64 times less than float4_accum. >> And inside it contains straightforward loop: >> >> for (i = 0; i < TILE_SIZE; i++) { >> sum += opd->payload[i]; >> } >> >> which can be optimized by compiler (loop unrolling, use of SIMD >> instructions,...). > Part of the reason why the compiler can optimize that so well is > probably related to the fact that it includes no overflow checks. May it makes sense to use in aggregate transformation function which is not checking for overflow and perform this check only in final function? NaN and Inf values will be preserved in any case... -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 06.12.2019 19:52, Konstantin Knizhnik wrote: > > > On 06.12.2019 18:53, Robert Haas wrote: >> On Thu, Nov 28, 2019 at 2:08 AM Konstantin Knizhnik >> <k.knizhnik@postgrespro.ru> wrote: >>> calls float4_accum for each row of T, the same query in VOPS will call >>> vops_float4_avg_accumulate for each tile which contains 64 elements. >>> So vops_float4_avg_accumulate is called 64 times less than >>> float4_accum. >>> And inside it contains straightforward loop: >>> >>> for (i = 0; i < TILE_SIZE; i++) { >>> sum += opd->payload[i]; >>> } >>> >>> which can be optimized by compiler (loop unrolling, use of SIMD >>> instructions,...). >> Part of the reason why the compiler can optimize that so well is >> probably related to the fact that it includes no overflow checks. > > May it makes sense to use in aggregate transformation function which > is not checking for overflow and perform this check only in final > function? > NaN and Inf values will be preserved in any case... > I have tried to comment check_float8_val in float4_pl/float8_pl and get completely no difference in performance. But if I replace query select sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, sum(l_quantity) as avg_qty, sum(l_extendedprice) as avg_price, sum(l_discount) as avg_disc, count(*) as count_order from lineitem_inmem; with select sum(l_quantity + l_extendedprice + l_discount + l_tax) from lineitem_inmem; then time is reduced from 3686 to 1748 msec. So at least half of this time we spend in expression evaluations and aggregates accumulation. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, On 2019-12-06 19:52:15 +0300, Konstantin Knizhnik wrote: > On 06.12.2019 18:53, Robert Haas wrote: > > On Thu, Nov 28, 2019 at 2:08 AM Konstantin Knizhnik > > <k.knizhnik@postgrespro.ru> wrote: > > > calls float4_accum for each row of T, the same query in VOPS will call > > > vops_float4_avg_accumulate for each tile which contains 64 elements. > > > So vops_float4_avg_accumulate is called 64 times less than float4_accum. > > > And inside it contains straightforward loop: > > > > > > for (i = 0; i < TILE_SIZE; i++) { > > > sum += opd->payload[i]; > > > } > > > > > > which can be optimized by compiler (loop unrolling, use of SIMD > > > instructions,...). I still fail to see what this has to do with the subject. > > Part of the reason why the compiler can optimize that so well is > > probably related to the fact that it includes no overflow checks. > > May it makes sense to use in aggregate transformation function which is not > checking for overflow and perform this check only in final function? > NaN and Inf values will be preserved in any case... I mean I personally think it'd be ok to skip the overflow checks for floating point operations, they're not all that useful in practice (if not the opposite). But if you want correct overflow detection behaviour, you cannot just check in the final function, as you cannot discern between the state where infinity/NaN has been incorporated into the transition state from cases where that has happened due to overflow etc. Greetings, Andres Freund