Re: Why JIT speed improvement is so modest? - Mailing list pgsql-hackers

From guangnan he
Subject Re: Why JIT speed improvement is so modest?
Date
Msg-id CAMXQC6Wy+bme0WxAKn+Wd87urVtDDEg=8Q9ZuxNYp7f53K+ROw@mail.gmail.com
Whole thread Raw
In response to Re: Why JIT speed improvement is so modest?  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: Why JIT speed improvement is so modest?  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List pgsql-hackers
Hi, 
you mean if we don't add new compiler options the compiler will do the loop unrolling using SIMD automatically?
Beside the function calls, cache miss etc, for VOPS I think the call stack is squeezing too, but the JIT optimize still process rows one by one.

Konstantin Knizhnik <k.knizhnik@postgrespro.ru> 于2019年11月28日周四 下午3:08写道:


On 27.11.2019 19:05, Tomas Vondra wrote:
> On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
>>
>>
>> On 25.11.2019 18:24, Merlin Moncure wrote:
>>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
>>> <k.knizhnik@postgrespro.ru> wrote:
>>>> JIT was not able to significantly (times) increase speed on Q1 query?
>>>> Experiment with VOPS shows that used aggregation algorithm itself
>>>> is not
>>>> a bottleneck.
>>>> Profile also give no answer for this question.
>>>> Any ideas?
>>> Well, in the VOPS variant around 2/3 of the time is spent in routines
>>> that are obviously aggregation.  In the JIT version, it's around 20%.
>>> So this suggests that the replacement execution engine is more
>>> invasive.  I would also guess (!) that the VOPS engine optimizes fewer
>>> classes of query plan.   ExecScan for example, looks to be completely
>>> optimized out VOPS but is still utilized in the JIT engine.
>>
>> The difference in fraction of time spent in aggregate calculation is
>> not so large (2 times vs. 10 times).
>> I suspected that a lot of time is spent in relation traversal code,
>> tuple unpacking and visibility checks.
>> To check this hypothesis I have implement in-memory table access
>> method which stores tuples in unpacked form and
>> doesn't perform any visibility checks at all.
>> Results were not so existed. I have to disable parallel execution
>> (because it is not possible for tuples stored in backend private
>> memory).
>> Results are the following:
>>
>> lineitem:               13736 msec
>> inmem_lineitem:  10044 msec
>> vops_lineitem:        1945 msec
>>
>> The profile of inmem_lineitem is the following:
>>
>>   16.79%  postgres  postgres             [.] float4_accum
>>   12.86%  postgres  postgres             [.] float8_accum
>>    5.83%  postgres  postgres             [.] TupleHashTableHash.isra.8
>>    4.44%  postgres  postgres             [.] lookup_hash_entries
>>    3.37%  postgres  postgres             [.] check_float8_array
>>    3.11%  postgres  postgres             [.] tuplehash_insert
>>    2.91%  postgres  postgres             [.] hash_uint32
>>    2.83%  postgres  postgres             [.] ExecScan
>>    2.56%  postgres  postgres             [.] inmem_getnextslot
>>    2.22%  postgres  postgres             [.] FunctionCall1Coll
>>    2.14%  postgres  postgres             [.] LookupTupleHashEntry
>>    1.95%  postgres  postgres             [.] TupleHashTableMatch.isra.9
>>    1.76%  postgres  postgres             [.] pg_detoast_datum
>>    1.58%  postgres  postgres             [.] AggCheckCallContext
>>    1.57%  postgres  postgres             [.] tts_minimal_clear
>>    1.35%  postgres  perf-3054.map        [.] 0x00007f558db60010
>>    1.23%  postgres  postgres             [.] fetch_input_tuple
>>    1.15%  postgres  postgres             [.] SeqNext
>>    1.06%  postgres  postgres             [.] ExecAgg
>>    1.00%  postgres  postgres             [.] tts_minimal_store_tuple
>>
>> So now fraction of time spent in aggregation is increased to 30% (vs.
>> 20% for lineitem and 42% for vops_lineitem).
>> Looks like the main bottleneck now is hashagg. VOPS is accessing hash
>> about 10 times less (because it accumulates values for the whole tile).
>> And it explains still large difference bwtween vops_lineitem and
>> inmem_lineitem.
>>
>> If we remove aggregation and rewrite Q1 query as:
>> select
>>     avg(l_quantity) as sum_qty,
>>     avg(l_extendedprice) as sum_base_price,
>>     avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
>>     avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
>>     avg(l_quantity) as avg_qty,
>>     avg(l_extendedprice) as avg_price,
>>     avg(l_discount) as avg_disc,
>>     count(*) as count_order
>> from
>>     inmem_lineitem
>> where
>>     l_shipdate <= '1998-12-01';
>>
>> then results are the following:
>> lineitem:               9805 msec
>> inmem_lineitem:  6257 msec
>> vops_lineitem:      1865 msec
>>
>> and now profile of inmem_lineitem is:
>>
>>   25.27%  postgres  postgres           [.] float4_accum
>>   21.86%  postgres  postgres           [.] float8_accum
>>    5.49%  postgres  postgres           [.] check_float8_array
>>    4.57%  postgres  postgres           [.] ExecScan
>>    2.61%  postgres  postgres           [.] AggCheckCallContext
>>    2.30%  postgres  postgres           [.] pg_detoast_datum
>>    2.10%  postgres  postgres           [.] inmem_getnextslot
>>    1.81%  postgres  postgres           [.] SeqNext
>>    1.73%  postgres  postgres           [.] fetch_input_tuple
>>    1.61%  postgres  postgres           [.] ExecAgg
>>    1.23%  postgres  postgres           [.] MemoryContextReset
>>
>> But still more than 3 times difference with VOPS!
>> Something is wrong here...
>>
>
> I have no idea what VOPS does, but IIRC one of the bottlenecks compared
> to various column stores is our iterative execution model, which makes
> it difficult/imposible to vectorize operations. That's likely why the
> accum functions are so high in the CPU profile.
>
> regards
>

VOPS is doing very simple thing: it replaces scala types with vector
(tiles) and define all standard operations for them.
Also it provides Postgres aggregate for this types.
So while for normal Postgres table, the query

select sum(x) from T;

calls float4_accum for each row of T, the same query in VOPS will call
vops_float4_avg_accumulate for each tile which contains 64 elements.
So vops_float4_avg_accumulate is called 64 times less than float4_accum.
And inside it contains straightforward loop:

             for (i = 0; i < TILE_SIZE; i++) {
                 sum += opd->payload[i];
             }

which can be optimized by compiler (loop unrolling, use of SIMD
instructions,...).
So no wonder that VOPS is faster than Postgres executor.
But Postgres now contains JIT and it is used in this case.
So interpretation overhead of executor should be mostly eliminated by JIT.
In theory, perfect JIT code should process rows of horizontal data model
at the same speed as vector executor processing columns of vertical data
model.
Vertical model provides signficatn advantages when a query affect only
small fraction of rows.
But in case of Q1 we are calculating 8 aggregates for just 4 columns.
And inmem_lineitem is actually projection of original lineitem table
containing only  columns needed for this query.
So amount of fetched data in this case is almost the same for horizontal
and vertical data models.
Effects of CPU caches should not also play significant role in this case.
That is why it is not quite clear to me why there is still big
difference (3 times) between VOPS and in-memory table and not so large
difference between normal and in-memory tables.

Concerning large percent spent in accumulate function - I do not agree
with you. What this query is actually doing is just calculating aggregates.
The less is interpretation overhead the larger percent of time we should
spent in aggregate function.
May be the whole infrastructure of Postgres aggregates adds too large
overhead (check_float8_array, function calls,...) and in case of VOPS
this overhead is divided by 64.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





--
Guang-Nan He

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: pgbench -i progress output on terminal
Next
From: Amit Langote
Date:
Subject: Re: pgbench -i progress output on terminal