Re: Lazy JIT IR code generation to increase JIT speed with partitions - Mailing list pgsql-hackers
From | Luc Vlaming |
---|---|
Subject | Re: Lazy JIT IR code generation to increase JIT speed with partitions |
Date | |
Msg-id | 1ee73e09-e3d8-22cf-3b0e-9a3f1ace0596@swarm64.com Whole thread Raw |
In response to | Re: Lazy JIT IR code generation to increase JIT speed with partitions (Luc Vlaming <luc@swarm64.com>) |
Responses |
Re: Lazy JIT IR code generation to increase JIT speed with partitions
|
List | pgsql-hackers |
On 18-01-2021 08:47, Luc Vlaming wrote: > Hi everyone, Andres, > > On 03-01-2021 11:05, Luc Vlaming wrote: >> On 30-12-2020 14:23, Luc Vlaming wrote: >>> On 30-12-2020 02:57, Andres Freund wrote: >>>> Hi, >>>> >>>> Great to see work in this area! > > I would like this topic to somehow progress and was wondering what other > benchmarks / tests would be needed to have some progress? I've so far > provided benchmarks for small(ish) queries and some tpch numbers, > assuming those would be enough. > >>>> >>>> On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote: >>>>> I would like to propose a small patch to the JIT machinery which >>>>> makes the >>>>> IR code generation lazy. The reason for postponing the generation >>>>> of the IR >>>>> code is that with partitions we get an explosion in the number of JIT >>>>> functions generated as many child tables are involved, each with >>>>> their own >>>>> JITted functions, especially when e.g. partition-aware >>>>> joins/aggregates are >>>>> enabled. However, only a fraction of those functions is actually >>>>> executed >>>>> because the Parallel Append node distributes the workers among the >>>>> nodes. >>>>> With the attached patch we get a lazy generation which makes that >>>>> this is no >>>>> longer a problem. >>>> >>>> I unfortunately don't think this is quite good enough, because it'll >>>> lead to emitting all functions separately, which can also lead to very >>>> substantial increases of the required time (as emitting code is an >>>> expensive step). Obviously that is only relevant in the cases where the >>>> generated functions actually end up being used - which isn't the >>>> case in >>>> your example. >>>> >>>> If you e.g. look at a query like >>>> SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY >>>> blub; >>>> on a table without indexes, you would end up with functions for >>>> >>>> - WHERE clause (including deforming) >>>> - projection (including deforming) >>>> - grouping key >>>> - aggregate transition >>>> - aggregate result projection >>>> >>>> with your patch each of these would be emitted separately, instead of >>>> one go. Which IIRC increases the required time by a significant amount, >>>> especially if inlining is done (where each separate code generation >>>> ends >>>> up with copies of the inlined code). >>>> >>>> >>>> As far as I can see you've basically falsified the second part of this >>>> comment (which you moved): >>>> >>>>> + >>>>> + /* >>>>> + * Don't immediately emit nor actually generate the function. >>>>> + * instead do so the first time the expression is actually >>>>> evaluated. >>>>> + * That allows to emit a lot of functions together, avoiding a >>>>> lot of >>>>> + * repeated llvm and memory remapping overhead. It also helps >>>>> with not >>>>> + * compiling functions that will never be evaluated, as can be >>>>> the case >>>>> + * if e.g. a parallel append node is distributing workers >>>>> between its >>>>> + * child nodes. >>>>> + */ >>>> >>>>> - /* >>>>> - * Don't immediately emit function, instead do so the first >>>>> time the >>>>> - * expression is actually evaluated. That allows to emit a lot of >>>>> - * functions together, avoiding a lot of repeated llvm and memory >>>>> - * remapping overhead. >>>>> - */ >>>> >>>> Greetings, >>>> >>>> Andres Freund >>>> >>> >>> Hi, >>> >>> Happy to help out, and thanks for the info and suggestions. >>> Also, I should have first searched psql-hackers and the like, as I >>> just found out there is already discussions about this in [1] and [2]. >>> However I think the approach I took can be taken independently and >>> then other solutions could be added on top. >>> >>> Assuming I understood all suggestions correctly, the ideas so far are: >>> 1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed >>> and not optimized several times (see [1]). Requires all code to be >>> emitted in the same module. >>> 2. JIT only parts of the plan, based on cost (see [2]). >>> 3. Cache compilation results to avoid recompilation. this would >>> either need a shm capable optimized IR cache or would not work with >>> parallel workers. >>> 4. Lazily jitting (this patch) >>> >>> An idea that might not have been presented in the mailing list yet(?): >>> 5. Only JIT in nodes that process a certain amount of rows. Assuming >>> there is a constant overhead for JITting and the goal is to gain >>> runtime. >>> >>> Going forward I would first try to see if my current approach can >>> work out. The only idea that would be counterproductive to my >>> solution would be solution 1. Afterwards I'd like to continue with >>> either solution 2, 5, or 3 in the hopes that we can reduce JIT >>> overhead to a minimum and can therefore apply it more broadly. >>> >>> To test out why and where the JIT performance decreased with my >>> solution I improved the test script and added various queries to >>> model some of the cases I think we should care about. I have not >>> (yet) done big scale benchmarks as these queries seemed to already >>> show enough problems for now. Now there are 4 queries which test >>> JITting with/without partitions, and with varying amounts of workers >>> and rowcounts. I hope these are indeed a somewhat representative set >>> of queries. >>> >>> As pointed out the current patch does create a degradation in >>> performance wrt queries that are not partitioned (basically q3 and >>> q4). After looking into those queries I noticed two things: >>> - q3 is very noisy wrt JIT timings. This seems to be the result of >>> something wrt parallel workers starting up the JITting and creating >>> very high amounts of noise (e.g. inlining timings varying between >>> 3.8s and 6.2s) >>> - q4 seems very stable with JIT timings (after the first run). >>> I'm wondering if this could mean that with parallel workers quite a >>> lot of time is spent on startup of the llvm machinery and this gets >>> noisy because of OS interaction and the like? >>> >>> Either way I took q4 to try and fix the regression and noticed >>> something interesting, given the comment from Andres: the generation >>> and inlining time actually decreased, but the optimization and >>> emission time increased. After trying out various things in the >>> llvm_optimize_module function and googling a bit it seems that the >>> LLVMPassManagerBuilderUseInlinerWithThreshold adds some very >>> expensive passes. I tried to construct some queries where this would >>> actually gain us but couldnt (yet). >>> >>> For v2 of the patch-set the first patch slightly changes how we >>> optimize the code, which removes the aforementioned degradations in >>> the queries. The second patch then makes that partitions work a lot >>> better, but interestingly now also q4 gets a lot faster but somehow >>> q3 does not. >>> >>> Because these findings contradict the suggestions/findings from >>> Andres I'm wondering what I'm missing. I would continue and do some >>> TPC-H like tests on top, but apart from that I'm not entirely sure >>> where we are supposed to gain most from the call to >>> LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from >>> the scenarios I now tested it seems that the pain is actually in the >>> code optimization and possibly rather specific passes and not >>> necessarily in how many modules are emitted. >>> >>> If there are more / better queries / datasets / statistics I can run >>> and gather I would be glad to do so :) To me the current results seem >>> however fairly promising. >>> >>> Looking forward to your thoughts & suggestions. >>> >>> With regards, >>> Luc >>> Swarm64 >>> >>> =================================== >>> Results from the test script on my machine: >>> >>> parameters: jit=on workers=5 jit-inline=0 jit-optimize=0 >>> query1: HEAD - 08.088901 #runs=5 #JIT=12014 >>> query1: HEAD+01 - 06.369646 #runs=5 #JIT=12014 >>> query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044 >>> query2: HEAD - 17.628126 #runs=5 #JIT=24074 >>> query2: HEAD+01 - 10.786114 #runs=5 #JIT=24074 >>> query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083 >>> query3: HEAD - 00.220141 #runs=5 #JIT=29 >>> query3: HEAD+01 - 00.210917 #runs=5 #JIT=29 >>> query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25 >>> query4: HEAD - 00.052305 #runs=100 #JIT=10 >>> query4: HEAD+01 - 00.038319 #runs=100 #JIT=10 >>> query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3 >>> >>> parameters: jit=on workers=50 jit-inline=0 jit-optimize=0 >>> query1: HEAD - 14.922044 #runs=5 #JIT=102104 >>> query1: HEAD+01 - 11.356347 #runs=5 #JIT=102104 >>> query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241 >>> query2: HEAD - 18.477133 #runs=5 #JIT=40122 >>> query2: HEAD+01 - 11.028579 #runs=5 #JIT=40122 >>> query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087 >>> query3: HEAD - 00.235587 #runs=5 #JIT=209 >>> query3: HEAD+01 - 00.219597 #runs=5 #JIT=209 >>> query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127 >>> query4: HEAD - 00.052534 #runs=100 #JIT=10 >>> query4: HEAD+01 - 00.038881 #runs=100 #JIT=10 >>> query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3 >>> >>> parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0 >>> query1: HEAD - 12.696588 #runs=5 #JIT=102104 >>> query1: HEAD+01 - 12.279387 #runs=5 #JIT=102104 >>> query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211 >>> query2: HEAD - 12.091824 #runs=5 #JIT=40122 >>> query2: HEAD+01 - 11.543042 #runs=5 #JIT=40122 >>> query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088 >>> query3: HEAD - 00.122208 #runs=5 #JIT=209 >>> query3: HEAD+01 - 00.114153 #runs=5 #JIT=209 >>> query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131 >>> query4: HEAD - 00.033125 #runs=100 #JIT=10 >>> query4: HEAD+01 - 00.029818 #runs=100 #JIT=10 >>> query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3 >>> >>> parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06 >>> query1: HEAD - 02.760343 #runs=5 #JIT=102104 >>> query1: HEAD+01 - 02.742944 #runs=5 #JIT=102104 >>> query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292 >>> query2: HEAD - 02.396965 #runs=5 #JIT=40122 >>> query2: HEAD+01 - 02.394724 #runs=5 #JIT=40122 >>> query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089 >>> query3: HEAD - 00.186633 #runs=5 #JIT=209 >>> query3: HEAD+01 - 00.189623 #runs=5 #JIT=209 >>> query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125 >>> query4: HEAD - 00.013277 #runs=100 #JIT=10 >>> query4: HEAD+01 - 00.012078 #runs=100 #JIT=10 >>> query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3 >>> >>> parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06 >>> query1: HEAD - 02.339973 #runs=5 #JIT=102104 >>> query1: HEAD+01 - 02.333525 #runs=5 #JIT=102104 >>> query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243 >>> query2: HEAD - 02.268987 #runs=5 #JIT=40122 >>> query2: HEAD+01 - 02.248729 #runs=5 #JIT=40122 >>> query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088 >>> query3: HEAD - 00.084531 #runs=5 #JIT=209 >>> query3: HEAD+01 - 00.091616 #runs=5 #JIT=209 >>> query3: HEAD+01+02 - 00.08668 #runs=5 #JIT=127 >>> query4: HEAD - 00.005371 #runs=100 #JIT=10 >>> query4: HEAD+01 - 00.0053 #runs=100 #JIT=10 >>> query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3 >>> >>> =================================== >>> [1] >>> https://www.postgresql.org/message-id/flat/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793%40elevated-dev.com >>> >>> [2] >>> https://www.postgresql.org/message-id/flat/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com >>> >> >> Hi, >> >> Did some TPCH testing today on a TPCH 100G to see regressions there. >> Results (query/HEAD/patched/speedup) >> >> 1 9.49 9.25 1.03 >> 3 11.87 11.65 1.02 >> 4 23.74 21.24 1.12 >> 5 11.66 11.07 1.05 >> 6 7.82 7.72 1.01 >> 7 12.1 11.23 1.08 >> 8 12.99 11.2 1.16 >> 9 71.2 68.05 1.05 >> 10 17.72 17.31 1.02 >> 11 4.75 4.16 1.14 >> 12 10.47 10.27 1.02 >> 13 38.23 38.71 0.99 >> 14 8.69 8.5 1.02 >> 15 12.63 12.6 1.00 >> 19 8.56 8.37 1.02 >> 22 10.34 9.25 1.12 >> >> Cheers, >> Luc >> >> > Kind regards, > Luc > > Providing a new set of rebased patches with a better description in the hopes this helps reviewability. Also registering this to the next CF to increase visibility. Regards, Luc
Attachment
pgsql-hackers by date: