Thread: JIT performance question

JIT performance question

From
Tobias Gierke
Date:
Hi,

I was playing around with PG11.2 (i6700k with 16GB RAM, on Ubuntu 18.04, 
compiled from sources) and LLVM, trying a CPU-bound query that in my 
simple mind should benefit from JIT'ting but (almost) doesn't.

1.) Test table with 195 columns of type 'numeric':

CREATE TABLE test (data0 numeric,data1 numeric,data2 numeric,data3 
numeric,...,data192 numeric,data193 numeric,data194 numeric);

2.) bulk-loaded (via COPY) 2 mio. rows of randomly generated data into 
this table (and ran vacuum & analyze afterwards)

3.) Disable parallel workers to just measure JIT performance via 'set 
max_parallel_workers = 0'

4.) Execute query without JIT a couple of times to make sure table is in 
memory (I had iostat running in the background to verify that actually 
no disk access was taking place):

test=# explain (analyze,buffers) SELECT SUM(data0) AS data0,SUM(data1) 
AS data1,SUM(data2) AS data2,...,SUM(data193) AS data193,SUM(data194) AS 
data194 FROM test;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
  Finalize Aggregate  (cost=815586.31..815586.32 rows=1 width=6240) 
(actual time=14304.058..14304.058 rows=1 loops=1)
    Buffers: shared hit=64 read=399936
    ->  Gather  (cost=815583.66..815583.87 rows=2 width=6240) (actual 
time=14303.925..14303.975 rows=1 loops=1)
          Workers Planned: 2
          Workers Launched: 0
          Buffers: shared hit=64 read=399936
          ->  Partial Aggregate  (cost=814583.66..814583.67 rows=1 
width=6240) (actual time=14302.966..14302.966 rows=1 loops=1)
                Buffers: shared hit=64 read=399936
                ->  Parallel Seq Scan on test (cost=0.00..408333.33 
rows=833333 width=1170) (actual time=0.017..810.513 rows=2000000 loops=1)
                      Buffers: shared hit=64 read=399936
  Planning Time: 4.707 ms
  Execution Time: 14305.380 ms

5.) Now I turned on the JIT and repeated the same query a couple of 
times. This is what I got

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
  Finalize Aggregate  (cost=815586.31..815586.32 rows=1 width=6240) 
(actual time=15558.558..15558.558 rows=1 loops=1)
    Buffers: shared hit=128 read=399872
    ->  Gather  (cost=815583.66..815583.87 rows=2 width=6240) (actual 
time=15558.450..15558.499 rows=1 loops=1)
          Workers Planned: 2
          Workers Launched: 0
          Buffers: shared hit=128 read=399872
          ->  Partial Aggregate  (cost=814583.66..814583.67 rows=1 
width=6240) (actual time=15557.541..15557.541 rows=1 loops=1)
                Buffers: shared hit=128 read=399872
                ->  Parallel Seq Scan on test (cost=0.00..408333.33 
rows=833333 width=1170) (actual time=0.020..941.925 rows=2000000 loops=1)
                      Buffers: shared hit=128 read=399872
  Planning Time: 11.230 ms
  JIT:
    Functions: 6
    Options: Inlining true, Optimization true, Expressions true, 
Deforming true
    Timing: Generation 15.707 ms, Inlining 4.688 ms, Optimization 
652.021 ms, Emission 939.556 ms, Total 1611.973 ms
  Execution Time: 15576.516 ms
(16 rows)

So (ignoring the time for JIT'ting itself) this yields only ~2-3% 
performance increase... is this because my query is just too simple to 
actually benefit a lot, meaning the code path for the 'un-JIT' case is 
already fairly optimal ? Or does JIT'ting actually only have a large 
impact on the filter/WHERE part of the query but not so much on 
aggregation / tuple deforming ?

Thanks,
Tobias









Re: JIT performance question

From
Andres Freund
Date:
Hi,

On 2019-03-06 18:16:08 +0100, Tobias Gierke wrote:
> I was playing around with PG11.2 (i6700k with 16GB RAM, on Ubuntu 18.04,
> compiled from sources) and LLVM, trying a CPU-bound query that in my simple
> mind should benefit from JIT'ting but (almost) doesn't.
> 
> 1.) Test table with 195 columns of type 'numeric':
> 
> CREATE TABLE test (data0 numeric,data1 numeric,data2 numeric,data3
> numeric,...,data192 numeric,data193 numeric,data194 numeric);
> 
> 2.) bulk-loaded (via COPY) 2 mio. rows of randomly generated data into this
> table (and ran vacuum & analyze afterwards)
> 
> 3.) Disable parallel workers to just measure JIT performance via 'set
> max_parallel_workers = 0'

FWIW, it's better to do that via max_parallel_workers_per_gather in most
cases, because creating a parallel plan and then not using that will
have its own consequences.


> 4.) Execute query without JIT a couple of times to make sure table is in
> memory (I had iostat running in the background to verify that actually no
> disk access was taking place):

There's definitely accesses outside of PG happening here :(. Probably
cached at the IO level, but without track_io_timings that's hard to
confirm.   Presumably that's caused by the sequential scan ringbuffers.
I found that forcing the pages to be read in using pg_prewarm gives more
measurable results.


> So (ignoring the time for JIT'ting itself) this yields only ~2-3%
> performance increase... is this because my query is just too simple to
> actually benefit a lot, meaning the code path for the 'un-JIT' case is
> already fairly optimal ? Or does JIT'ting actually only have a large impact
> on the filter/WHERE part of the query but not so much on aggregation / tuple
> deforming ?

It's hard to know precisely without running a profile of the
workload. My suspicion is that the bottleneck in this query is the use
of numeric, which has fairly slow operations, including aggregation. And
they're too complicated to be inlined.

Generally there's definitely advantage in JITing aggregation.

There's a lot of further improvements on the table with better JIT code
generation, I just haven't gotten around implementing those :(

Greetings,

Andres Freund


Re: JIT performance question

From
Tobias Gierke
Date:
On 06.03.19 18:42, Andres Freund wrote:
>
> It's hard to know precisely without running a profile of the
> workload. My suspicion is that the bottleneck in this query is the use
> of numeric, which has fairly slow operations, including aggregation. And
> they're too complicated to be inlined.
>
> Generally there's definitely advantage in JITing aggregation.
>
> There's a lot of further improvements on the table with better JIT code
> generation, I just haven't gotten around implementing those :(

Thanks for the quick response ! I think you're onto something with the 
numeric type. I replaced it with bigint and repeated my test and now I 
get a nice 40% speedup (I'm again intentionally ignoring the costs for 
JIT'ting here as I assume a future PostgreSQL version will have some 
kind of caching for the generated code):

Without JIT:

                                                         QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=1395000.49..1395000.50 rows=1 width=6240) (actual 
time=6023.436..6023.436 rows=1 loops=1)
    Buffers: shared hit=256 read=399744
    I/O Timings: read=475.135
    ->  Seq Scan on test  (cost=0.00..420000.00 rows=2000000 width=1560) 
(actual time=0.035..862.424 rows=2000000 loops=1)
          Buffers: shared hit=256 read=399744
          I/O Timings: read=475.135
  Planning Time: 0.574 ms
  Execution Time: 6024.298 ms
(8 rows)


With JIT:

                                                         QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=1395000.49..1395000.50 rows=1 width=6240) (actual 
time=4840.064..4840.064 rows=1 loops=1)
    Buffers: shared hit=320 read=399680
    I/O Timings: read=493.679
    ->  Seq Scan on test  (cost=0.00..420000.00 rows=2000000 width=1560) 
(actual time=0.090..847.458 rows=2000000 loops=1)
          Buffers: shared hit=320 read=399680
          I/O Timings: read=493.679
  Planning Time: 1.414 ms
  JIT:
    Functions: 3
    Options: Inlining true, Optimization true, Expressions true, 
Deforming true
    Timing: Generation 19.747 ms, Inlining 10.281 ms, Optimization 
222.619 ms, Emission 362.862 ms, Total 615.509 ms
  Execution Time: 4862.113 ms
(12 rows)

Cheers,
Tobias



Re: JIT performance question

From
Andres Freund
Date:
Hi,

On 2019-03-06 19:21:33 +0100, Tobias Gierke wrote:
> On 06.03.19 18:42, Andres Freund wrote:
> > 
> > It's hard to know precisely without running a profile of the
> > workload. My suspicion is that the bottleneck in this query is the use
> > of numeric, which has fairly slow operations, including aggregation. And
> > they're too complicated to be inlined.
> > 
> > Generally there's definitely advantage in JITing aggregation.
> > 
> > There's a lot of further improvements on the table with better JIT code
> > generation, I just haven't gotten around implementing those :(
> 
> Thanks for the quick response ! I think you're onto something with the
> numeric type. I replaced it with bigint and repeated my test and now I get a
> nice 40% speedup

Cool. It'd really be worthwhile for somebody to work on adding fastpaths
to the numeric code...

Greetings,

Andres Freund