JIT compiling with LLVM v9.0 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | JIT compiling with LLVM v9.0 |
Date | |
Msg-id | 20180124072038.jviav7h3fgkv7hto@alap3.anarazel.de Whole thread Raw |
In response to | [HACKERS] JIT compiling expressions/deform + inlining prototype v2.0 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: JIT compiling with LLVM v9.0
Re: JIT compiling with LLVM v9.0 Re: JIT compiling with LLVM v9.0 Re: JIT compiling with LLVM v9.1 Re: JIT compiling with LLVM v9.0 Re: JIT compiling with LLVM v9.0 Re: JIT compiling with LLVM v9.0 |
List | pgsql-hackers |
Hi, I've spent the last weeks working on my LLVM compilation patchset. In the course of that I *heavily* revised it. While still a good bit away from committable, it's IMO definitely not a prototype anymore. There's too many small changes, so I'm only going to list the major things. A good bit of that is new. The actual LLVM IR emissions itself hasn't changed that drastically. Since I've not described them in detail before I'll describe from scratch in a few cases, even if things haven't fully changed. == JIT Interface == To avoid emitting code in very small increments (increases mmap/mremap rw vs exec remapping, compile/optimization time), code generation doesn't happen for every single expression individually, but in batches. The basic object to emit code via is a jit context created with: extern LLVMJitContext *llvm_create_context(bool optimize); which in case of expression is stored on-demand in the EState. For other usecases that might not be the right location. To emit LLVM IR (ie. the portabe code that LLVM then optimizes and generates native code for), one gets a module from that with: extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context); to which "arbitrary" numbers of functions can be added. In case of expression evaluation, we get the module once for every expression, and emit one function for the expression itself, and one for every applicable/referenced deform function. As explained above, we do not want to emit code immediately from within ExecInitExpr()/ExecReadyExpr(). To facilitate that readying a JITed expression sets the function to callback, which gets the actual native function on the first actual call. That allows to batch together the generation of all native functions that are defined before the first expression is evaluated - in a lot of queries that'll be all. Said callback then calls extern void *llvm_get_function(LLVMJitContext *context, const char *funcname); which'll emit code for the "in progress" mutable module if necessary, and then searches all generated functions for the name. The names are created via extern void *llvm_get_function(LLVMJitContext *context, const char *funcname); currently "evalexpr" and deform" with a generation and counter suffix. Currently expression which do not have access to an EState, basically all "parent" less expressions, aren't JIT compiled. That could be changed, but I so far do not see a huge need. == Error handling == There's two aspects to error handling. Firstly, generated (LLVM IR) and emitted functions (mmap()ed segments) need to be cleaned up both after a successful query execution and after an error. I've settled on a fairly boring resowner based mechanism. On errors all expressions owned by a resowner are released, upon success expressions are reassigned to the parent / released on commit (unless executor shutdown has cleaned them up of course). A second, less pretty and newly developed, aspect of error handling is OOM handling inside LLVM itself. The above resowner based mechanism takes care of cleaning up emitted code upon ERROR, but there's also the chance that LLVM itself runs out of memory. LLVM by default does *not* use any C++ exceptions. It's allocations are primarily funneled through the standard "new" handlers, and some direct use of malloc() and mmap(). For the former a 'new handler' exists http://en.cppreference.com/w/cpp/memory/new/set_new_handler for the latter LLVM provides callback that get called upon failure (unfortunately mmap() failures are treated as fatal rather than OOM errors). What I've chosen to do, and I'd be interested to get some input about that, is to have two functions that LLVM using code must use: extern void llvm_enter_fatal_on_oom(void); extern void llvm_leave_fatal_on_oom(void); before interacting with LLVM code (ie. emitting IR, or using the above functions) llvm_enter_fatal_on_oom() needs to be called. When a libstdc++ new or LLVM error occurs, the handlers set up by the above functions trigger a FATAL error. We have to use FATAL rather than ERROR, as we *cannot* reliably throw ERROR inside a foreign library without risking corrupting its internal state. Users of the above sections do *not* have to use PG_TRY/CATCH blocks, the handlers instead are reset on toplevel sigsetjmp() level. Using a relatively small enter/leave protected section of code, rather than setting up these handlers globally, avoids negative interactions with extensions that might use C++ like e.g. postgis. As LLVM code generation should never execute arbitrary code, just setting these handlers temporarily ought to suffice. == LLVM Interface / patches == Unfortunately a bit of required LLVM functionality, particularly around error handling but also initialization, aren't currently fully exposed via LLVM's C-API. A bit more *optional* API isn't exposed either. Instead of requiring a brand-new version of LLVM that has exposed this functionality I decided it's better to have a small C++ wrapper that can provide this functionality. Due to that new wrapper significantly older LLVM versions can now be used (for now I've only runtime tested 5.0 and master, 4.0 would be possible with a few ifdefs, a bit older probably doable as well). Given that LLVM is written in C++ itself, and optional dependency to a C++ compiler for one file doesn't seem to be too bad. == Inlining == One big advantage of JITing expressions is that it can significantly reduce the overhead of postgres' extensible function/operator mechanism, by inlining the body of called operators. This is the part of code that I've worked on most significantly. While I think JITing is an entirely viable project without committed inlining, I felt that we definitely need to know how exactly we want to do inlining before merging other parts. 3 different implementations later, I'm fairly confident that I have a good concept, even though a few corners still need to be smoothed. As a quick background, LLVM works on the basis of a high-level "abstract" assembly representation (llvm.org/docs/LangRef.html). This can be generated in memory, stored in binary form (bitcode files ending in .bc) or text representation (.ll files). The clang compiler always generates the in-memory representation and the -emit-llvm flag tells it to write that out to disk, rather than .o files/binaries. This facility allows us to get the bitcode for all operators (e.g. int8eq, float8pl etc), without maintaining two copies. The way I've currently set it up is that, if --with-llvm is passed to configure, all backend files are also compiled to bitcode files. These bitcode files get installed into the server's $pkglibdir/bitcode/postgres/ under their original subfolder, eg. ~/build/postgres/dev-assert/install/lib/bitcode/postgres/utils/adt/float.bc Using existing LLVM functionality (for parallel LTO compilation), additionally an index is over these is stored to $pkglibdir/bitcode/postgres.index.bc When deciding to JIT for the first time, $pkglibdir/bitcode/ is scanned for all .index.bc files and a *combined* index over all these files is built in memory. The reason for doing so is that that allows "easy" access to inlining access for extensions - they can install code into $pkglibdir/bitcode/[extension]/ accompanied by $pkglibdir/bitcode/[extension].index.bc just alongside the actual library. The inlining implementation, I had to write my own LLVM's isn't suitable for a number of reasons, can then use the combined in-memory index to look up all 'extern' function references, judge their size, and then open just the file containing its implementation (ie. the above float.bc). Currently there's a limit of 150 instructions for functions to be inlined, functions used by inlined functions have a budget of 0.5 * limit, and so on. This gets rid of most operators I in queries I tested, although there's a few that resist inlining due to references to file-local static variables - but those largely don't seem to be performance relevant. == Type Synchronization == For my current two main avenues of performance optimizations due to JITing, expression eval and tuple deforming, it's obviously required that code generation knows about at least a few postgres types (tuple slots, heap tuples, expr context/state, etc). Initially I'd provided LLVM by emitting types manually like: { LLVMTypeRef members[15]; members[ 0] = LLVMInt32Type(); /* type */ members[ 1] = LLVMInt8Type(); /* isempty */ members[ 2] = LLVMInt8Type(); /* shouldFree */ members[ 3] = LLVMInt8Type(); /* shouldFreeMin */ members[ 4] = LLVMInt8Type(); /* slow */ members[ 5] = LLVMPointerType(StructHeapTupleData, 0); /* tuple */ members[ 6] = LLVMPointerType(StructtupleDesc, 0); /* tupleDescriptor */ members[ 7] = TypeMemoryContext; /* mcxt */ members[ 8] = LLVMInt32Type(); /* buffer */ members[ 9] = LLVMInt32Type(); /* nvalid */ members[10] = LLVMPointerType(TypeSizeT, 0); /* values */ members[11] = LLVMPointerType(LLVMInt8Type(), 0); /* nulls */ members[12] = LLVMPointerType(StructMinimalTupleData, 0); /* mintuple */ members[13] = StructHeapTupleData; /* minhdr */ members[14] = LLVMInt64Type(); /* off */ StructTupleTableSlot = LLVMStructCreateNamed(LLVMGetGlobalContext(), "struct.TupleTableSlot"); LLVMStructSetBody(StructTupleTableSlot, members, lengthof(members), false); } and then using numeric offset when emitting code like: LLVMBuildStructGEP(builder, v_slot, 9, "") to compute the address of nvalid field of a slot at runtime. but that obviously duplicates a lot of information and is incredibly failure prone. Doesn't seem acceptable. What I've now instead done is have one small file (llvmjit_types.c) which references each of the types required for JITing. That file is translated to bitcode at compile time, and loaded when LLVM is initialized in a backend. That works very well to synchronize the type definition, unfortunately it does *not* synchronize offsets as the IR level representation doesn't know field names. Instead I've added defines to the original struct definition that provide access to the relevant offsets. Eg. #define FIELDNO_TUPLETABLESLOT_NVALID 9 int tts_nvalid; /* # of valid values in tts_values */ while that still needs to be defined, it's only required for a relatively small number of fields, and it's bunched together with the struct definition, so it's easily kept synchronized. A significant downside for this is that clang needs to be around to create that bitcode file, but that doesn't seem that bad as an optional *build*-time, *not* runtime, dependency. Not a perfect solution, but I don't quite see a better approach. == Minimal cost based planning & config == Currently there's a number of GUCs that influence JITing: - jit_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost get JITed, *without* optimization (expensive part), corresponding to -O0. This commonly already results in significant speedups if expression/deforming is a bottleneck (removing dynamic branches mostly). - jit_optimize_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost get JITed, *with* optimization (expensive part). - jit_inline_above_cost = -1, 0-DBL_MAX - inlining is tried if query has higher cost. For all of these -1 is a hard disable. There currently also exist: - jit_expressions = 0/1 - jit_deform = 0/1 - jit_perform_inlining = 0/1 but I think they could just be removed in favor of the above. Additionally there's a few debugging/other GUCs: - jit_debugging_support = 0/1 - register generated functions with the debugger. Unfortunately GDBs JIT integration scales O(#functions^2), albeit with a very small constant, so it cannot always be enabled :( - jit_profiling_support = 0/1 - emit information so perf gets notified about JITed functions. As this logs data to disk that is not automatically cleaned up (otherwise it'd be useless), this definitely cannot be enabled by default. - jit_dump_bitcode = 0/1 - log generated pre/post optimization bitcode to disk. This is quite useful for development, so I'd want to keep it. - jit_log_ir = 0/1 - dump generated IR to the logfile. I found this to be too verbose, and I think it should be yanked. Do people feel these should be hidden behind #ifdefs, always present but prevent from being set to a meaningful, or unrestricted? === Remaining work == These I'm planning to tackle in the near future and need to be tackled before mergin. - Add a big readme - Add docs - Add / check LLVM 4.0 support - reconsider location of JITing code (lib/ and heaptuple.c specifically) - Split llvmjit_wrap.cpp into three files (error handling, inlining, temporary LLVM C API extensions) - Split the bigger commit, improve commit messages - Significant amounts of local code cleanup and comments - duplicated code in expression emission for very related step types - more consistent LLVM variable naming - pgindent - timing information about JITing needs to be fewer messages, and hidden behind a GUC. - improve logging (mostly remove) == Future Todo (some already in-progress) == - JITed hash computation for nodeAgg & nodeHash. That's currently a major bottleneck. - Increase quality of generated code. There's a *lot* left still on the table. The generated code currently spills far too much into memory, and LLVM only can optimize that away to a limited degree. I've experimented some and for TPCH Q01 it's possible to get at least another x1.8 due to that, with expression eval *still* being the bottleneck afterwards... - Caching of the generated code, drastically reducing overhead and allowing JITing to be beneficial in OLTP cases. Currently the biggest obstacle to that is the number of specific memory locations referenced in the expression representation, but that definitely can be improved (a lot of it by the above point alone). - More elaborate planning model - The cloning of modules could e reduced to only cloning required parts. As that's the most expensive part of inlining and most of the time only a few functions are used, this should probably be done soon. == Code == As the patchset is large (500kb) and I'm still quickly evolving it, I do not yet want to attach it. The git tree is at https://git.postgresql.org/git/users/andresfreund/postgres.git in the jit branch https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/jit to build --with-llvm has to be passed to configure, llvm-config either needs to be in PATH or provided with LLVM_CONFIG to make. A c++ compiler and clang need to be available under common names or provided via CXX / CLANG respectively. Regards, Andres Freund
pgsql-hackers by date: