Re: Lazy JIT IR code generation to increase JIT speed with partitions - Mailing list pgsql-hackers

From Luc Vlaming
Subject Re: Lazy JIT IR code generation to increase JIT speed with partitions
Date
Msg-id a38f0178-690c-28fc-6960-306475da17a6@swarm64.com
Whole thread Raw
In response to Re: Lazy JIT IR code generation to increase JIT speed with partitions  (Luc Vlaming <luc@swarm64.com>)
Responses Re: Lazy JIT IR code generation to increase JIT speed with partitions
List pgsql-hackers
On 30-12-2020 14:23, Luc Vlaming wrote:
> On 30-12-2020 02:57, Andres Freund wrote:
>> Hi,
>>
>> Great to see work in this area!
>>
>> On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:
>>> I would like to propose a small patch to the JIT machinery which 
>>> makes the
>>> IR code generation lazy. The reason for postponing the generation of 
>>> the IR
>>> code is that with partitions we get an explosion in the number of JIT
>>> functions generated as many child tables are involved, each with 
>>> their own
>>> JITted functions, especially when e.g. partition-aware 
>>> joins/aggregates are
>>> enabled. However, only a fraction of those functions is actually 
>>> executed
>>> because the Parallel Append node distributes the workers among the 
>>> nodes.
>>> With the attached patch we get a lazy generation which makes that 
>>> this is no
>>> longer a problem.
>>
>> I unfortunately don't think this is quite good enough, because it'll
>> lead to emitting all functions separately, which can also lead to very
>> substantial increases of the required time (as emitting code is an
>> expensive step). Obviously that is only relevant in the cases where the
>> generated functions actually end up being used - which isn't the case in
>> your example.
>>
>> If you e.g. look at a query like
>>    SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY blub;
>> on a table without indexes, you would end up with functions for
>>
>> - WHERE clause (including deforming)
>> - projection (including deforming)
>> - grouping key
>> - aggregate transition
>> - aggregate result projection
>>
>> with your patch each of these would be emitted separately, instead of
>> one go. Which IIRC increases the required time by a significant amount,
>> especially if inlining is done (where each separate code generation ends
>> up with copies of the inlined code).
>>
>>
>> As far as I can see you've basically falsified the second part of this
>> comment (which you moved):
>>
>>> +
>>> +    /*
>>> +     * Don't immediately emit nor actually generate the function.
>>> +     * instead do so the first time the expression is actually 
>>> evaluated.
>>> +     * That allows to emit a lot of functions together, avoiding a 
>>> lot of
>>> +     * repeated llvm and memory remapping overhead. It also helps 
>>> with not
>>> +     * compiling functions that will never be evaluated, as can be 
>>> the case
>>> +     * if e.g. a parallel append node is distributing workers 
>>> between its
>>> +     * child nodes.
>>> +     */
>>
>>> -    /*
>>> -     * Don't immediately emit function, instead do so the first time 
>>> the
>>> -     * expression is actually evaluated. That allows to emit a lot of
>>> -     * functions together, avoiding a lot of repeated llvm and memory
>>> -     * remapping overhead.
>>> -     */
>>
>> Greetings,
>>
>> Andres Freund
>>
> 
> Hi,
> 
> Happy to help out, and thanks for the info and suggestions.
> Also, I should have first searched psql-hackers and the like, as I just 
> found out there is already discussions about this in [1] and [2].
> However I think the approach I took can be taken independently and then 
> other solutions could be added on top.
> 
> Assuming I understood all suggestions correctly, the ideas so far are:
> 1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed and 
> not optimized several times (see [1]). Requires all code to be emitted 
> in the same module.
> 2. JIT only parts of the plan, based on cost (see [2]).
> 3. Cache compilation results to avoid recompilation. this would either 
> need a shm capable optimized IR cache or would not work with parallel 
> workers.
> 4. Lazily jitting (this patch)
> 
> An idea that might not have been presented in the mailing list yet(?):
> 5. Only JIT in nodes that process a certain amount of rows. Assuming 
> there is a constant overhead for JITting and the goal is to gain runtime.
> 
> Going forward I would first try to see if my current approach can work 
> out. The only idea that would be counterproductive to my solution would 
> be solution 1. Afterwards I'd like to continue with either solution 2, 
> 5, or 3 in the hopes that we can reduce JIT overhead to a minimum and 
> can therefore apply it more broadly.
> 
> To test out why and where the JIT performance decreased with my solution 
> I improved the test script and added various queries to model some of 
> the cases I think we should care about. I have not (yet) done big scale 
> benchmarks as these queries seemed to already show enough problems for 
> now. Now there are 4 queries which test JITting with/without partitions, 
> and with varying amounts of workers and rowcounts. I hope these are 
> indeed a somewhat representative set of queries.
> 
> As pointed out the current patch does create a degradation in 
> performance wrt queries that are not partitioned (basically q3 and q4). 
> After looking into those queries I noticed two things:
> - q3 is very noisy wrt JIT timings. This seems to be the result of 
> something wrt parallel workers starting up the JITting and creating very 
> high amounts of noise (e.g. inlining timings varying between 3.8s and 6.2s)
> - q4 seems very stable with JIT timings (after the first run).
> I'm wondering if this could mean that with parallel workers quite a lot 
> of time is spent on startup of the llvm machinery and this gets noisy 
> because of OS interaction and the like?
> 
> Either way I took q4 to try and fix the regression and noticed something 
> interesting, given the comment from Andres: the generation and inlining 
> time actually decreased, but the optimization and emission time 
> increased. After trying out various things in the llvm_optimize_module 
> function and googling a bit it seems that the 
> LLVMPassManagerBuilderUseInlinerWithThreshold adds some very expensive 
> passes. I tried to construct some queries where this would actually gain 
> us but couldnt (yet).
> 
> For v2 of the patch-set the first patch slightly changes how we optimize 
> the code, which removes the aforementioned degradations in the queries. 
> The second patch then makes that partitions work a lot better, but 
> interestingly now also q4 gets a lot faster but somehow q3 does not.
> 
> Because these findings contradict the suggestions/findings from Andres 
> I'm wondering what I'm missing. I would continue and do some TPC-H like 
> tests on top, but apart from that I'm not entirely sure where we are 
> supposed to gain most from the call to 
> LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from the 
> scenarios I now tested it seems that the pain is actually in the code 
> optimization and possibly rather specific passes and not necessarily in 
> how many modules are emitted.
> 
> If there are more / better queries / datasets / statistics I can run and 
> gather I would be glad to do so :) To me the current results seem 
> however fairly promising.
> 
> Looking forward to your thoughts & suggestions.
> 
> With regards,
> Luc
> Swarm64
> 
> ===================================
> Results from the test script on my machine:
> 
>   parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
>   query1: HEAD       - 08.088901 #runs=5 #JIT=12014
>   query1: HEAD+01    - 06.369646 #runs=5 #JIT=12014
>   query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
>   query2: HEAD       - 17.628126 #runs=5 #JIT=24074
>   query2: HEAD+01    - 10.786114 #runs=5 #JIT=24074
>   query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
>   query3: HEAD       - 00.220141 #runs=5 #JIT=29
>   query3: HEAD+01    - 00.210917 #runs=5 #JIT=29
>   query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
>   query4: HEAD       - 00.052305 #runs=100 #JIT=10
>   query4: HEAD+01    - 00.038319 #runs=100 #JIT=10
>   query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3
> 
>   parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
>   query1: HEAD       - 14.922044 #runs=5 #JIT=102104
>   query1: HEAD+01    - 11.356347 #runs=5 #JIT=102104
>   query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
>   query2: HEAD       - 18.477133 #runs=5 #JIT=40122
>   query2: HEAD+01    - 11.028579 #runs=5 #JIT=40122
>   query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
>   query3: HEAD       - 00.235587 #runs=5 #JIT=209
>   query3: HEAD+01    - 00.219597 #runs=5 #JIT=209
>   query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
>   query4: HEAD       - 00.052534 #runs=100 #JIT=10
>   query4: HEAD+01    - 00.038881 #runs=100 #JIT=10
>   query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3
> 
>   parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
>   query1: HEAD       - 12.696588 #runs=5 #JIT=102104
>   query1: HEAD+01    - 12.279387 #runs=5 #JIT=102104
>   query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
>   query2: HEAD       - 12.091824 #runs=5 #JIT=40122
>   query2: HEAD+01    - 11.543042 #runs=5 #JIT=40122
>   query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
>   query3: HEAD       - 00.122208 #runs=5 #JIT=209
>   query3: HEAD+01    - 00.114153 #runs=5 #JIT=209
>   query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
>   query4: HEAD       - 00.033125 #runs=100 #JIT=10
>   query4: HEAD+01    - 00.029818 #runs=100 #JIT=10
>   query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3
> 
>   parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
>   query1: HEAD       - 02.760343 #runs=5 #JIT=102104
>   query1: HEAD+01    - 02.742944 #runs=5 #JIT=102104
>   query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
>   query2: HEAD       - 02.396965 #runs=5 #JIT=40122
>   query2: HEAD+01    - 02.394724 #runs=5 #JIT=40122
>   query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
>   query3: HEAD       - 00.186633 #runs=5 #JIT=209
>   query3: HEAD+01    - 00.189623 #runs=5 #JIT=209
>   query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
>   query4: HEAD       - 00.013277 #runs=100 #JIT=10
>   query4: HEAD+01    - 00.012078 #runs=100 #JIT=10
>   query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3
> 
>   parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
>   query1: HEAD       - 02.339973 #runs=5 #JIT=102104
>   query1: HEAD+01    - 02.333525 #runs=5 #JIT=102104
>   query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
>   query2: HEAD       - 02.268987 #runs=5 #JIT=40122
>   query2: HEAD+01    - 02.248729 #runs=5 #JIT=40122
>   query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
>   query3: HEAD       - 00.084531 #runs=5 #JIT=209
>   query3: HEAD+01    - 00.091616 #runs=5 #JIT=209
>   query3: HEAD+01+02 - 00.08668  #runs=5 #JIT=127
>   query4: HEAD       - 00.005371 #runs=100 #JIT=10
>   query4: HEAD+01    - 00.0053   #runs=100 #JIT=10
>   query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3
> 
> ===================================
> [1] 
> https://www.postgresql.org/message-id/flat/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793%40elevated-dev.com 
> 
> [2] 
> https://www.postgresql.org/message-id/flat/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com

> 

Hi,

Did some TPCH testing today on a TPCH 100G to see regressions there.
Results (query/HEAD/patched/speedup)

1    9.49    9.25    1.03
3    11.87    11.65    1.02
4    23.74    21.24    1.12
5    11.66    11.07    1.05
6    7.82    7.72    1.01
7    12.1    11.23    1.08
8    12.99    11.2    1.16
9    71.2    68.05    1.05
10    17.72    17.31    1.02
11    4.75    4.16    1.14
12    10.47    10.27    1.02
13    38.23    38.71    0.99
14    8.69    8.5    1.02
15    12.63    12.6    1.00
19    8.56    8.37    1.02
22    10.34    9.25    1.12

Cheers,
Luc



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: doc review for v14
Next
From: Simon Riggs
Date:
Subject: Re: Safety/validity of resetting permissions by updating system tables