Thread: JIT performance question
Hi, I was playing around with PG11.2 (i6700k with 16GB RAM, on Ubuntu 18.04, compiled from sources) and LLVM, trying a CPU-bound query that in my simple mind should benefit from JIT'ting but (almost) doesn't. 1.) Test table with 195 columns of type 'numeric': CREATE TABLE test (data0 numeric,data1 numeric,data2 numeric,data3 numeric,...,data192 numeric,data193 numeric,data194 numeric); 2.) bulk-loaded (via COPY) 2 mio. rows of randomly generated data into this table (and ran vacuum & analyze afterwards) 3.) Disable parallel workers to just measure JIT performance via 'set max_parallel_workers = 0' 4.) Execute query without JIT a couple of times to make sure table is in memory (I had iostat running in the background to verify that actually no disk access was taking place): test=# explain (analyze,buffers) SELECT SUM(data0) AS data0,SUM(data1) AS data1,SUM(data2) AS data2,...,SUM(data193) AS data193,SUM(data194) AS data194 FROM test; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=815586.31..815586.32 rows=1 width=6240) (actual time=14304.058..14304.058 rows=1 loops=1) Buffers: shared hit=64 read=399936 -> Gather (cost=815583.66..815583.87 rows=2 width=6240) (actual time=14303.925..14303.975 rows=1 loops=1) Workers Planned: 2 Workers Launched: 0 Buffers: shared hit=64 read=399936 -> Partial Aggregate (cost=814583.66..814583.67 rows=1 width=6240) (actual time=14302.966..14302.966 rows=1 loops=1) Buffers: shared hit=64 read=399936 -> Parallel Seq Scan on test (cost=0.00..408333.33 rows=833333 width=1170) (actual time=0.017..810.513 rows=2000000 loops=1) Buffers: shared hit=64 read=399936 Planning Time: 4.707 ms Execution Time: 14305.380 ms 5.) Now I turned on the JIT and repeated the same query a couple of times. This is what I got QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=815586.31..815586.32 rows=1 width=6240) (actual time=15558.558..15558.558 rows=1 loops=1) Buffers: shared hit=128 read=399872 -> Gather (cost=815583.66..815583.87 rows=2 width=6240) (actual time=15558.450..15558.499 rows=1 loops=1) Workers Planned: 2 Workers Launched: 0 Buffers: shared hit=128 read=399872 -> Partial Aggregate (cost=814583.66..814583.67 rows=1 width=6240) (actual time=15557.541..15557.541 rows=1 loops=1) Buffers: shared hit=128 read=399872 -> Parallel Seq Scan on test (cost=0.00..408333.33 rows=833333 width=1170) (actual time=0.020..941.925 rows=2000000 loops=1) Buffers: shared hit=128 read=399872 Planning Time: 11.230 ms JIT: Functions: 6 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 15.707 ms, Inlining 4.688 ms, Optimization 652.021 ms, Emission 939.556 ms, Total 1611.973 ms Execution Time: 15576.516 ms (16 rows) So (ignoring the time for JIT'ting itself) this yields only ~2-3% performance increase... is this because my query is just too simple to actually benefit a lot, meaning the code path for the 'un-JIT' case is already fairly optimal ? Or does JIT'ting actually only have a large impact on the filter/WHERE part of the query but not so much on aggregation / tuple deforming ? Thanks, Tobias
Hi, On 2019-03-06 18:16:08 +0100, Tobias Gierke wrote: > I was playing around with PG11.2 (i6700k with 16GB RAM, on Ubuntu 18.04, > compiled from sources) and LLVM, trying a CPU-bound query that in my simple > mind should benefit from JIT'ting but (almost) doesn't. > > 1.) Test table with 195 columns of type 'numeric': > > CREATE TABLE test (data0 numeric,data1 numeric,data2 numeric,data3 > numeric,...,data192 numeric,data193 numeric,data194 numeric); > > 2.) bulk-loaded (via COPY) 2 mio. rows of randomly generated data into this > table (and ran vacuum & analyze afterwards) > > 3.) Disable parallel workers to just measure JIT performance via 'set > max_parallel_workers = 0' FWIW, it's better to do that via max_parallel_workers_per_gather in most cases, because creating a parallel plan and then not using that will have its own consequences. > 4.) Execute query without JIT a couple of times to make sure table is in > memory (I had iostat running in the background to verify that actually no > disk access was taking place): There's definitely accesses outside of PG happening here :(. Probably cached at the IO level, but without track_io_timings that's hard to confirm. Presumably that's caused by the sequential scan ringbuffers. I found that forcing the pages to be read in using pg_prewarm gives more measurable results. > So (ignoring the time for JIT'ting itself) this yields only ~2-3% > performance increase... is this because my query is just too simple to > actually benefit a lot, meaning the code path for the 'un-JIT' case is > already fairly optimal ? Or does JIT'ting actually only have a large impact > on the filter/WHERE part of the query but not so much on aggregation / tuple > deforming ? It's hard to know precisely without running a profile of the workload. My suspicion is that the bottleneck in this query is the use of numeric, which has fairly slow operations, including aggregation. And they're too complicated to be inlined. Generally there's definitely advantage in JITing aggregation. There's a lot of further improvements on the table with better JIT code generation, I just haven't gotten around implementing those :( Greetings, Andres Freund
On 06.03.19 18:42, Andres Freund wrote: > > It's hard to know precisely without running a profile of the > workload. My suspicion is that the bottleneck in this query is the use > of numeric, which has fairly slow operations, including aggregation. And > they're too complicated to be inlined. > > Generally there's definitely advantage in JITing aggregation. > > There's a lot of further improvements on the table with better JIT code > generation, I just haven't gotten around implementing those :( Thanks for the quick response ! I think you're onto something with the numeric type. I replaced it with bigint and repeated my test and now I get a nice 40% speedup (I'm again intentionally ignoring the costs for JIT'ting here as I assume a future PostgreSQL version will have some kind of caching for the generated code): Without JIT: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1395000.49..1395000.50 rows=1 width=6240) (actual time=6023.436..6023.436 rows=1 loops=1) Buffers: shared hit=256 read=399744 I/O Timings: read=475.135 -> Seq Scan on test (cost=0.00..420000.00 rows=2000000 width=1560) (actual time=0.035..862.424 rows=2000000 loops=1) Buffers: shared hit=256 read=399744 I/O Timings: read=475.135 Planning Time: 0.574 ms Execution Time: 6024.298 ms (8 rows) With JIT: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1395000.49..1395000.50 rows=1 width=6240) (actual time=4840.064..4840.064 rows=1 loops=1) Buffers: shared hit=320 read=399680 I/O Timings: read=493.679 -> Seq Scan on test (cost=0.00..420000.00 rows=2000000 width=1560) (actual time=0.090..847.458 rows=2000000 loops=1) Buffers: shared hit=320 read=399680 I/O Timings: read=493.679 Planning Time: 1.414 ms JIT: Functions: 3 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 19.747 ms, Inlining 10.281 ms, Optimization 222.619 ms, Emission 362.862 ms, Total 615.509 ms Execution Time: 4862.113 ms (12 rows) Cheers, Tobias
Hi, On 2019-03-06 19:21:33 +0100, Tobias Gierke wrote: > On 06.03.19 18:42, Andres Freund wrote: > > > > It's hard to know precisely without running a profile of the > > workload. My suspicion is that the bottleneck in this query is the use > > of numeric, which has fairly slow operations, including aggregation. And > > they're too complicated to be inlined. > > > > Generally there's definitely advantage in JITing aggregation. > > > > There's a lot of further improvements on the table with better JIT code > > generation, I just haven't gotten around implementing those :( > > Thanks for the quick response ! I think you're onto something with the > numeric type. I replaced it with bigint and repeated my test and now I get a > nice 40% speedup Cool. It'd really be worthwhile for somebody to work on adding fastpaths to the numeric code... Greetings, Andres Freund